Skip to content

Matcher

EDS-NLP simplifies the matching process by exposing a eds.matcher component that can match on terms or regular expressions.

Examples

Let us redefine the pipeline :

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")

terms = dict(
    covid=["coronavirus", "covid19"],  # (1)
    patient="patient",  # (2)
)

regex = dict(
    covid=r"coronavirus|covid[-\s]?19|sars[-\s]cov[-\s]2",  # (3)
)

nlp.add_pipe(
    eds.matcher(
        terms=terms,
        regex=regex,
        attr="LOWER",
        term_matcher="exact",
        term_matcher_config={},
    ),
)
  1. Every key in the terms dictionary is mapped to a concept.
  2. The eds.matcher pipeline expects a list of expressions, or a single expression.
  3. We can also define regular expression patterns.

This snippet is complete, and should run as is.

Patterns, be they terms or regex, are defined as dictionaries where keys become the label of the extracted entities. Dictionary values are either a single expression or a list of expressions that match the concept.

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object.

TYPE: PipelineProtocol

name

The name of the component.

terms

A dictionary of terms.

TYPE: Optional[Patterns] DEFAULT: None

regex

A dictionary of regular expressions.

TYPE: Optional[Patterns] DEFAULT: None

attr

The default attribute to use for matching. Can be overridden using the terms and regex configurations.

TYPE: str DEFAULT: TEXT

ignore_excluded

Whether to skip excluded tokens (requires an upstream pipeline to mark excluded tokens).

TYPE: bool DEFAULT: False

ignore_space_tokens

Whether to skip space tokens during matching.

You won't be able to match on newlines if this is enabled and the "spaces"/"newline" option of eds.normalizer is enabled (by default).

DEFAULT: False

term_matcher

The matcher to use for matching phrases ? One of (exact, simstring)

TYPE: Literal['exact', 'simstring'] DEFAULT: exact

term_matcher_config

Parameters of the matcher class

TYPE: Dict[str, Any] DEFAULT: {}

span_setter

How to set the spans in the doc.

TYPE: SpanSetterArg DEFAULT: {'ents': True}

Authors and citation

The eds.matcher pipeline was developed by AP-HP's Data Science team.