Skip to content

Matcher

EDS-NLP simplifies the matching process by exposing a eds.matcher pipeline that can match on terms or regular expressions.

Usage

Let us redefine the pipeline :

import spacy

nlp = spacy.blank("fr")

terms = dict(
    covid=["coronavirus", "covid19"],  # 
    patient="patient",  # 
)

regex = dict(
    covid=r"coronavirus|covid[-\s]?19|sars[-\s]cov[-\s]2",  # 
)

nlp.add_pipe(
    "eds.matcher",
    config=dict(
        terms=terms,
        regex=regex,
        attr="LOWER",
        term_matcher="exact",
        term_matcher_config={},
    ),
)

This snippet is complete, and should run as is.

Configuration

The pipeline can be configured using the following parameters :

PARAMETER DESCRIPTION
terms

A dictionary of terms.

TYPE: Optional[Patterns] DEFAULT: None

regex

A dictionary of regular expressions.

TYPE: Optional[Patterns] DEFAULT: 'TEXT'

attr

The default attribute to use for matching. Can be overridden using the terms and regex configurations.

TYPE: str DEFAULT: None

ignore_excluded

Whether to skip excluded tokens (requires an upstream pipeline to mark excluded tokens).

TYPE: bool DEFAULT: False

ignore_space_tokens

Whether to skip space tokens during matching.

You won't be able to match on newlines if this is enabled and the "spaces"/"newline" option of eds.normalizer is enabled (by default).

TYPE: bool DEFAULT: False

term_matcher

The matcher to use for matching phrases ? One of (exact, simstring)

TYPE: GenericTermMatcher DEFAULT: GenericTermMatcher.exact

term_matcher_config

Parameters of the matcher class

TYPE: Dict[str, Any] DEFAULT: {}

Patterns, be they terms or regex, are defined as dictionaries where keys become the label of the extracted entities. Dictionary values are a either a single expression or a list of expressions that match the concept (see example).

Authors and citation

The eds.matcher pipeline was developed by AP-HP's Data Science team.