Matcher

EDS-NLP simplifies the matching process by exposing a eds.matcher component that can match on terms or regular expressions.

Examples

Let us redefine the pipeline :

import edsnlp

nlp = edsnlp.blank("eds")

terms = dict(
    covid=["coronavirus", "covid19"],  # (1)
    patient="patient",  # (2)
)

regex = dict(
    covid=r"coronavirus|covid[-\s]?19|sars[-\s]cov[-\s]2",  # (3)
)

nlp.add_pipe(
    "eds.matcher",
    config=dict(
        terms=terms,
        regex=regex,
        attr="LOWER",
        term_matcher="exact",
        term_matcher_config={},
    ),
)

Every key in the terms dictionary is mapped to a concept.
The eds.matcher pipeline expects a list of expressions, or a single expression.
We can also define regular expression patterns.

This snippet is complete, and should run as is.

Patterns, be they terms or regex, are defined as dictionaries where keys become the label of the extracted entities. Dictionary values are either a single expression or a list of expressions that match the concept.

Parameters

PARAMETER	DESCRIPTION
`nlp`	The pipeline object. TYPE: `PipelineProtocol`
`name`	The name of the component. TYPE: `Optional[str]` DEFAULT: `'eds.matcher'`
`terms`	A dictionary of terms. TYPE: `Optional[Patterns]` DEFAULT: `None`
`regex`	A dictionary of regular expressions. TYPE: `Optional[Patterns]` DEFAULT: `None`
`attr`	The default attribute to use for matching. Can be overridden using the `terms` and `regex` configurations. TYPE: `str` DEFAULT: `TEXT`
`ignore_excluded`	Whether to skip excluded tokens (requires an upstream pipeline to mark excluded tokens). TYPE: `bool` DEFAULT: `False`
`ignore_space_tokens`	Whether to skip space tokens during matching. You won't be able to match on newlines if this is enabled and the "spaces"/"newline" option of `eds.normalizer` is enabled (by default). TYPE: `bool` DEFAULT: `False`
`term_matcher`	The matcher to use for matching phrases ? One of (exact, simstring) TYPE: `Literal['exact', 'simstring']` DEFAULT: `exact`
`term_matcher_config`	Parameters of the matcher class TYPE: `Dict[str, Any]` DEFAULT: `{}`
`span_setter`	How to set the spans in the doc. TYPE: `SpanSetterArg` DEFAULT: `{'ents': True}`

Authors and citation

The eds.matcher pipeline was developed by AP-HP's Data Science team.