Skip to content

Contextual Matcher

During feature extraction, it may be necessary to search for additional patterns in their neighborhood, namely:

  • patterns to discard irrelevant entities
  • patterns to enrich these entities and store some information

For example, to extract mentions of non-benign cancers, we need to discard all extractions that mention "benin" in their immediate neighborhood. Although such a filtering is feasible using a regular expression, it essentially requires modifying each of the regular expressions.

The ContextualMatcher allows to perform this extraction in a clear and concise way.

The configuration file

The whole ContextualMatcher pipeline is basically defined as a list of pattern dictionaries. Let us see step by step how to build such a list using the example stated just above.

a. Finding mentions of cancer

To do this, we can build either a set of terms or a set of regex. terms will be used to search for exact matches in the text. While less flexible, it is faster than using regex. In our case we could use the following lists (which are of course absolutely not exhaustives):

terms = [
    "cancer",
    "tumeur",
]

regex = [
    "adeno(carcinom|[\s-]?k)",
    "neoplas",
    "melanom",
]

Maybe we want to exclude mentions of benign cancers:

benign = "benign|benin"

b. Find mention of a stage and extract its value

For this we will forge a RegEx with one capturing group (basically a pattern enclosed in parentheses):

stage = "stade (I{1,3}V?|[1234])"

This will extract stage between 1 and 4

We can add a second regex to try to capture if the cancer is in a metastasis stage or not:

metastase = "(metasta)"

c. The complete configuration

We can now put everything together:

cancer = dict(
    source="Cancer solide",
    regex=regex,
    terms=terms,
    regex_attr="NORM",
    exclude=dict(
        regex=benign,
        window=3,
    ),
    assign=[
        dict(
            name="stage",
            regex=stage,
            window=(-10,10),
            expand_entity=False,
        ),
        dict(
            name="metastase",
            regex=metastase,
            window=10,
            expand_entity=True,
        ),
    ]
)

Here the configuration consists of a single dictionary. We might want to also include lymphoma in the matcher:

lymphome = dict(
    source="Lymphome",
    regex=["lymphom", "lymphangio"],
    regex_attr="NORM",
    exclude=dict(
        regex=["hodgkin"],  # (1)
        window=3,
    ),
)
  1. We are excluding "Lymphome de Hodgkin" here

In this case, the configuration can be concatenated in a list:

patterns = [cancer, lymphome]

Usage

import spacy

nlp = spacy.blank("fr")

nlp.add_pipe("sentences")
nlp.add_pipe("normalizer")

nlp.add_pipe(
    "eds.contextual-matcher",
    name="Cancer",
    config=dict(
        patterns=patterns,
    ),
)

Let us see what we can get from this pipeline with a few examples

txt = "Le patient a eu un cancer il y a 5 ans"
doc = nlp(txt)
ent = doc.ents[0]

ent.label_
# Out: Cancer

ent._.source
# Out: Cancer solide

ent.text, ent.start, ent.end
# Out: ('cancer', 5, 6)

Let us check that when a benign mention is present, the extraction is excluded:

txt = "Le patient a eu un cancer relativement bénin il y a 5 ans"
doc = nlp(txt)

doc.ents
# Out: ()

All informations extracted from the provided assign configuration can be found in the assigned attribute under the form of a dictionary:

txt = "Le patient a eu un cancer de stade 3."
doc = nlp(txt)

doc.ents[0]._.assigned
# Out: {'stage': '3'}

Configuration

The pipeline can be configured using the following parameters :

Parameter Explanation Default
patterns Dictionary or List of dictionaries. See below
assign_as_span Whether to store eventual extractions defined via the assign key as Spans or as string False
attr spaCy attribute to match on (eg NORM, LOWER) "TEXT"
ignore_excluded Whether to skip excluded tokens during matching False
regex_flags RegExp flags to use when matching, filtering and assigning (See here) 0 (use default flag)

However, most of the configuration is provided in the patterns key, as a pattern dictionary or a list of pattern dictionaries

The pattern dictionary

Description

A patterr is a nested dictionary with the following keys:

A label describing the pattern

A single Regex or a list of Regexes

An attributes to overwrite the given attr when matching with Regexes.

A single term or a list of terms (for exact matches)

A dictionary (or list of dictionaries) to define exclusion rules. Exclusion rules are given as Regexes, and if a match is found in the surrounding context of an extraction, the extraction is removed. Each dictionary should have the following keys:

Size of the context to use (in number of words). You can provide the window as:

  • A positive integer, in this case the used context will be taken after the extraction
  • A negative integer, in this case the used context will be taken before the extraction
  • A tuple of integers (start, end), in this case the used context will be the snippet from start tokens before the extraction to end tokens after the extraction

A single Regex or a list of Regexes.

A dictionary to refine the extraction. Similarily to the exclude key, you can provide a dictionary to use on the context before and after the extraction.

A name (string)

Size of the context to use (in number of words). You can provide the window as:

  • A positive integer, in this case the used context will be taken after the extraction
  • A negative integer, in this case the used context will be taken before the extraction
  • A tuple of integers (start, end), in this case the used context will be the snippet from start tokens before the extraction to end tokens after the extraction

A dictionary where keys are labels and values are Regexes with a single capturing group

If set to True, the initial entity's span will be expanded to the furthest match from the regex dictionary

A full pattern dictionary example

dict(
    source="AVC",
    regex=[
        "accidents? vasculaires? cerebr",
    ],
    terms="avc",
    regex_attr="NORM",
    exclude=[
        dict(
            regex=["service"],
            window=3,
        ),
        dict(
            regex=[" a "],
            window=-2,
        ),
    ],
    assign=[
        dict(
            name="neo",
            regex=r"(neonatal)",
            expand_entity=True,
            window=3,
        ),
        dict(
            name="trans",
            regex="(transitoire)",
            expand_entity=True,
            window=3,
        ),
        dict(
            name="hemo",
            regex=r"(hemorragique)",
            expand_entity=True,
            window=3,
        ),
        dict(
            name="risk",
            regex=r"(risque)",
            expand_entity=False,
            window=-3,
        ),
    ]
)

Authors and citation

The eds.matcher pipeline was developed by AP-HP's Data Science team.