Contextual Matcher

During feature extraction, it may be necessary to search for additional patterns in their neighborhood, namely:

patterns to discard irrelevant entities
patterns to enrich these entities and store some information

For example, to extract mentions of non-benign cancers, we need to discard all extractions that mention "benin" in their immediate neighborhood. Although such a filtering is feasible using a regular expression, it essentially requires modifying each of the regular expressions.

The ContextualMatcher allows to perform this extraction in a clear and concise way.

The configuration file

The whole ContextualMatcher pipeline component is basically defined as a list of pattern dictionaries. Let us see step by step how to build such a list using the example stated just above.

a. Finding mentions of cancer

To do this, we can build either a set of terms or a set of regex. terms will be used to search for exact matches in the text. While less flexible, it is faster than using regex. In our case we could use the following lists (which are of course absolutely not exhaustives):

terms = [
    "cancer",
    "tumeur",
]

regex = [
    "adeno(carcinom|[\s-]?k)",
    "neoplas",
    "melanom",
]

Maybe we want to exclude mentions of benign cancers:

benign = "benign|benin"

b. Find mention of a stage and extract its value

For this we will forge a RegEx with one capturing group (basically a pattern enclosed in parentheses):

stage = "stade (I{1,3}V?|[1234])"

This will extract stage between 1 and 4

We can add a second regex to try to capture if the cancer is in a metastasis stage or not:

metastase = "(metasta)"

c. The complete configuration

We can now put everything together:

cancer = dict(
    source="Cancer solide",
    regex=regex,
    terms=terms,
    regex_attr="NORM",
    exclude=dict(
        regex=benign,
        window=3,
    ),
    assign=[
        dict(
            name="stage",
            regex=stage,
            window=(-10, 10),
            replace_entity=False,
            reduce_mode=None,
        ),
        dict(
            name="metastase",
            regex=metastase,
            window=10,
            replace_entity=False,
            reduce_mode="keep_last",
        ),
    ],
)

Here the configuration consists of a single dictionary. We might want to also include lymphoma in the matcher:

lymphome = dict(
    source="Lymphome",
    regex=["lymphom", "lymphangio"],
    regex_attr="NORM",
    exclude=dict(
        regex=["hodgkin"],  # (1)
        window=3,
    ),
)

We are excluding "Lymphome de Hodgkin" here

In this case, the configuration can be concatenated in a list:

patterns = [cancer, lymphome]

Available parameters for more flexibility

3 main parameters can be used to refine how entities will be formed

The `include_assigned` parameter

Following the previous example, you might want your extracted entities to include, if found, the cancer stage and the metastasis status. This can be achieved by setting include_assigned=True in the pipe configuration.

For instance, from the sentence "Le patient a un cancer au stade 3", the extracted entity will be:

"cancer" if include_assigned=False
"cancer au stade 3" if include_assigned=True

The `reduce_mode` parameter

It may happen that an assignment matches more than once. For instance, in the (nonsensical) sentence "Le patient a un cancer au stade 3 et au stade 4", both "stade 3" and "stade 4" will be matched by the stage assign key. Depending on your use case, you may want to keep all the extractions, or just one.

If reduce_mode=None (default), all extractions are kept in a list
If reduce_mode="keep_first", only the extraction closest to the main matched entity will be kept (in this case, it would be "stade 3" since it is the closest to "cancer")
If reduce_mode=="keep_last", only the furthest extraction is kept.

The `replace_entity` parameter

This parameter can be se to True only for a single assign key per dictionary. This limitation comes from the purpose of this parameter: If set to True, the corresponding assign key will be returned as the entity, instead of the match itself. For clarity, let's take the same sentence "Le patient a un cancer au stade 3" as an example:

if replace_entity=True in the stage assign key, then the extracted entity will be "stade 3" instead of "cancer"
if replace_entity=False for every assign key, the returned entity will be, as expected, "cancer"

Please notice that with replace_entity set to True, if the correponding assign key matches nothing, the entity will be discarded.

Examples

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")

nlp.add_pipe(eds.sentences())
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(
    eds.contextual_matcher(
        patterns=patterns,
        label="cancer",
    ),
)

Let us see what we can get from this pipeline with a few examples

Simple matchExclusion ruleExtracting additional infos

txt = "Le patient a eu un cancer il y a 5 ans"
doc = nlp(txt)
ent = doc.ents[0]

ent.label_
# Out: cancer

ent._.source
# Out: Cancer solide

ent.text, ent.start, ent.end
# Out: ('cancer', 5, 6)

Let us check that when a benign mention is present, the extraction is excluded:

txt = "Le patient a eu un cancer relativement bénin il y a 5 ans"
doc = nlp(txt)

doc.ents
# Out: ()

All informations extracted from the provided assign configuration can be found in the assigned attribute under the form of a dictionary:

txt = "Le patient a eu un cancer de stade 3."
doc = nlp(txt)

doc.ents[0]._.assigned
# Out: {'stage': '3'}

However, most of the configuration is provided in the patterns key, as a pattern dictionary or a list of pattern dictionaries

The pattern dictionary

Description

A patterr is a nested dictionary with the following keys:

A full pattern dictionary example

dict(
    source="AVC",
    regex=[
        "accidents? vasculaires? cerebr",
    ],
    terms="avc",
    regex_attr="NORM",
    exclude=[
        dict(
            regex=["service"],
            window=3,
        ),
        dict(
            regex=[" a "],
            window=-2,
        ),
    ],
    assign=[
        dict(
            name="neo",
            regex=r"(neonatal)",
            expand_entity=True,
            window=3,
        ),
        dict(
            name="trans",
            regex="(transitoire)",
            expand_entity=True,
            window=3,
        ),
        dict(
            name="hemo",
            regex=r"(hemorragique)",
            expand_entity=True,
            window=3,
        ),
        dict(
            name="risk",
            regex=r"(risque)",
            expand_entity=False,
            window=-3,
        ),
    ],
)

Parameters

PARAMETER	DESCRIPTION
`patterns`	The configuration dictionary TYPE: `Union[Dict[str, Any], List[Dict[str, Any]]]`
`assign_as_span`	Whether to store eventual extractions defined via the `assign` key as Spans or as string TYPE: `bool` DEFAULT: `False`
`attr`	Attribute to match on, eg `TEXT`, `NORM`, etc. TYPE: `str` DEFAULT: `NORM`
`ignore_excluded`	Whether to skip excluded tokens during matching. TYPE: `bool` DEFAULT: `False`
`ignore_space_tokens`	Whether to skip space tokens during matching. TYPE: `bool` DEFAULT: `False`
`alignment_mode`	Overwrite alignment mode. TYPE: `str` DEFAULT: `expand`
`regex_flags`	RegExp flags to use when matching, filtering and assigning (See here) TYPE: `Union[RegexFlag, int]` DEFAULT: `0`
`include_assigned`	Whether to include (eventual) assign matches to the final entity TYPE: `bool` DEFAULT: `False`
`label_name`	Deprecated, use `label` instead. The label to assign to the matched entities TYPE: `Optional[str]` DEFAULT: `None`
`label`	The label to assign to the matched entities TYPE: `str` DEFAULT: `None`
`span_setter`	How to set matches on the doc TYPE: `SpanSetterArg` DEFAULT: `{'ents': True}`

Authors and citation

The eds.matcher pipeline component was developed by AP-HP's Data Science team.