Skip to content

Contextual Matcher

EDS-NLP provides simple pattern matchers like eds.matcher to extract regular expressions, specific phrases, or perform lexical similarity matching on documents. However, certain use cases require examining the context around matched entities to filter out irrelevant matches or enrich them with additional information. For example, to extract mentions of malignant cancers, we need to exclude matches that have “benin” mentioned nearby : eds.contextual_matcher was built to address such needs.

Example

The following example demonstrates how to configure and use eds.contextual_matcher to extract mentions of solid cancers and lymphomas, while filtering out irrelevant mentions (e.g., benign tumors) and enriching entities with contextual information such as stage or metastasis status.

Let's dive in with the full code example:

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")

nlp.add_pipe(eds.sentences())
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(
    eds.contextual_matcher(
        patterns=[
            dict(
                terms=["cancer", "tumeur"],  
                regex=[r"adeno(carcinom|[\s-]?k)", "neoplas", "melanom"],  
                regex_attr="NORM",  
                exclude=dict(
                    regex="benign|benin",  
                    window=3,  
                ),
                assign=[
                    dict(
                        name="stage",  
                        regex="stade (I{1,3}V?|[1234])",  
                        window="words[-10:10]",  
                        replace_entity=False,  
                        reduce_mode=None,  
                    ),
                    dict(
                        name="metastase",  
                        regex="(metasta)",  
                        window=10,  
                        replace_entity=False,  
                        reduce_mode="keep_last",  
                    ),
                ],
                source="Cancer solide",  
            ),
            dict(
                regex=["lymphom", "lymphangio"],  
                regex_attr="NORM",  
                exclude=dict(
                    regex=["hodgkin"],  
                    window=3,  
                ),
                source="Lymphome",  
            ),
        ],
        label="cancer",
    ),
)

Let's explore some examples using this pipeline:

txt = "Le patient a eu un cancer il y a 5 ans"
doc = nlp(txt)
ent = doc.ents[0]

ent.label_
# Out: cancer

ent._.source
# Out: Cancer solide

ent.text, ent.start, ent.end
# Out: ('cancer', 5, 6)

Check exclusion with a benign mention:

txt = "Le patient a eu un cancer relativement bénin il y a 5 ans"
doc = nlp(txt)

doc.ents
# Out: ()

Additional information extracted via assign configurations is available in the assigned attribute:

txt = "Le patient a eu un cancer de stade 3."
doc = nlp(txt)

doc.ents[0]._.assigned  
# Out: {'stage': ['3']}

Better control over the final extracted entities

Three main parameters refine how entities are extracted:

include_assigned

Following the previous example, if you want extracted entities to include the cancer stage or metastasis status (if found), set include_assigned=True in the pipe configuration.

For instance, from the sentence "Le patient a un cancer au stade 3":

  • If include_assigned=False, the extracted entity is "cancer"
  • If include_assigned=True, the extracted entity is "cancer au stade 3"

reduce_mode

Sometimes, an assignment matches multiple times. For example, in the sentence "Le patient a un cancer au stade 3 et au stade 4", both "stade 3" and "stade 4" match the stage key. Depending on your use case:

  • reduce_mode=None (default): Keeps all matched extractions in a list
  • reduce_mode="keep_first": Keeps only the extraction closest to the main matched entity ("stade 3" in this case)
  • reduce_mode="keep_last": Keeps only the furthest extraction

replace_entity

This parameter can be set to True for only one assign key per dictionary. If set to True, the matched assignment replaces the main entity.

Example using "Le patient a un cancer au stade 3":

  • With replace_entity=True for the stage key, the entity extracted is "stade 3"
  • With replace_entity=False, the entity extracted remains "cancer"

Note: With replace_entity=True, if the corresponding assign key matches nothing, the entity is discarded.

The primary configuration is provided in the patterns key as either a pattern dictionary or a list of pattern dictionaries.

Parameters

PARAMETER DESCRIPTION
patterns
The patterns to match
PARAMETER DESCRIPTION
span_getter

A span getter to pick the assigned spans from already extracted entities in the doc.

TYPE: Optional[SpanGetterArg]

regex

A single Regex or a list of Regexes

TYPE: ListOrStr

regex_attr

An attributes to overwrite the given attr when matching with Regexes.

TYPE: Optional[str]

regex_flags

Regex flags

terms

A single term or a list of terms (for exact matches)

TYPE: Union[RegexFlag, int]

exclude
One or more exclusion patterns
PARAMETER DESCRIPTION
regex

A single Regex or a list of Regexes

TYPE: ListOrStr

regex_attr

An attributes to overwrite the given attr when matching with Regexes.

TYPE: Optional[str]

regex_flags

Regex flags

TYPE: RegexFlag

span_getter

A span getter to pick the assigned spans from already extracted entities.

TYPE: Optional[SpanGetterArg]

window

Context window to search for patterns around the anchor. Defaults to "sent" ( i.e. the sentence of the anchor span).

TYPE: Optional[ContextWindow]

TYPE: AsList[SingleExcludeModel]

include
One or more inclusion patterns
PARAMETER DESCRIPTION
regex

A single Regex or a list of Regexes

TYPE: ListOrStr

regex_attr

An attributes to overwrite the given attr when matching with Regexes.

TYPE: Optional[str]

regex_flags

Regex flags

TYPE: RegexFlag

span_getter

A span getter to pick the assigned spans from already extracted entities.

TYPE: Optional[SpanGetterArg]

window

Context window to search for patterns around the anchor. Defaults to "sent" ( i.e. the sentence of the anchor span).

TYPE: Optional[ContextWindow]

TYPE: AsList[SingleIncludeModel]

assign
One or more assignment patterns
PARAMETER DESCRIPTION
span_getter

A span getter to pick the assigned spans from already extracted entities in the doc.

TYPE: Optional[SpanGetterArg]

regex

A single Regex or a list of Regexes

TYPE: ListOrStr

regex_attr

An attributes to overwrite the given attr when matching with Regexes.

TYPE: Optional[str]

regex_flags

Regex flags

TYPE: RegexFlag

window

Context window to search for patterns around the anchor. Defaults to "sent" ( i.e. the sentence of the anchor span).

TYPE: Optional[ContextWindow]

replace_entity

If set to True, the match from the corresponding assign key will be used as entity, instead of the main match. See this paragraph

TYPE: Optional[bool]

reduce_mode

Set how multiple assign matches are handled. See the documentation of the reduce_mode parameter

TYPE: Optional[Flags]

required

If set to True, the assign key must match for the extraction to be kept. If it does not match, the extraction is discarded.

TYPE: Optional[str]

TYPE: AsList[SingleAssignModel]

source

A label describing the pattern

TYPE: str

TYPE: FullConfig

assign_as_span

Whether to store eventual extractions defined via the assign key as Spans or as string

TYPE: bool DEFAULT: False

attr

Attribute to match on, eg TEXT, NORM, etc.

TYPE: str DEFAULT: NORM

ignore_excluded

Whether to skip excluded tokens during matching.

TYPE: bool DEFAULT: False

ignore_space_tokens

Whether to skip space tokens during matching.

TYPE: bool DEFAULT: False

alignment_mode

Overwrite alignment mode.

TYPE: str DEFAULT: expand

regex_flags

RegExp flags to use when matching, filtering and assigning (See here)

TYPE: Union[RegexFlag, int] DEFAULT: 0

include_assigned

Whether to include (eventual) assign matches to the final entity

TYPE: bool DEFAULT: False

label_name

Deprecated, use label instead. The label to assign to the matched entities

TYPE: Optional[str] DEFAULT: None

label

The label to assign to the matched entities

TYPE: str DEFAULT: None

span_setter

How to set matches on the doc

TYPE: SpanSetterArg DEFAULT: {'ents': True}

Authors and citation

The eds.contextual_matcher pipeline component was developed by AP-HP's Data Science team.