Contextual Matcher
During feature extraction, it may be necessary to search for additional patterns in their neighborhood, namely:
- patterns to discard irrelevant entities
- patterns to enrich these entities and store some information
For example, to extract mentions of non-benign cancers, we need to discard all extractions that mention "benin" in their immediate neighborhood. Although such a filtering is feasible using a regular expression, it essentially requires modifying each of the regular expressions.
The ContextualMatcher allows to perform this extraction in a clear and concise way.
The configuration file
The whole ContextualMatcher pipeline is basically defined as a list of pattern dictionaries. Let us see step by step how to build such a list using the example stated just above.
a. Finding mentions of cancer
To do this, we can build either a set of terms
or a set of regex
. terms
will be used to search for exact matches in the text. While less flexible,
it is faster than using regex. In our case we could use the following lists (which are of course absolutely not exhaustives):
terms = [
"cancer",
"tumeur",
]
regex = [
"adeno(carcinom|[\s-]?k)",
"neoplas",
"melanom",
]
Maybe we want to exclude mentions of benign cancers:
benign = "benign|benin"
b. Find mention of a stage and extract its value
For this we will forge a RegEx with one capturing group (basically a pattern enclosed in parentheses):
stage = "stade (I{1,3}V?|[1234])"
This will extract stage between 1 and 4
We can add a second regex to try to capture if the cancer is in a metastasis stage or not:
metastase = "(metasta)"
c. The complete configuration
We can now put everything together:
cancer = dict(
source="Cancer solide",
regex=regex,
terms=terms,
regex_attr="NORM",
exclude=dict(
regex=benign,
window=3,
),
assign=[
dict(
name="stage",
regex=stage,
window=(-10,10),
expand_entity=False,
),
dict(
name="metastase",
regex=metastase,
window=10,
expand_entity=True,
),
]
)
Here the configuration consists of a single dictionary. We might want to also include lymphoma in the matcher:
lymphome = dict(
source="Lymphome",
regex=["lymphom", "lymphangio"],
regex_attr="NORM",
exclude=dict(
regex=["hodgkin"], # (1)
window=3,
),
)
- We are excluding "Lymphome de Hodgkin" here
In this case, the configuration can be concatenated in a list:
patterns = [cancer, lymphome]
Usage
import spacy
nlp = spacy.blank("fr")
nlp.add_pipe("sentences")
nlp.add_pipe("normalizer")
nlp.add_pipe(
"eds.contextual-matcher",
name="Cancer",
config=dict(
patterns=patterns,
),
)
Let us see what we can get from this pipeline with a few examples
txt = "Le patient a eu un cancer il y a 5 ans"
doc = nlp(txt)
ent = doc.ents[0]
ent.label_
# Out: Cancer
ent._.source
# Out: Cancer solide
ent.text, ent.start, ent.end
# Out: ('cancer', 5, 6)
Let us check that when a benign mention is present, the extraction is excluded:
txt = "Le patient a eu un cancer relativement bénin il y a 5 ans"
doc = nlp(txt)
doc.ents
# Out: ()
All informations extracted from the provided assign
configuration can be found in the assigned
attribute
under the form of a dictionary:
txt = "Le patient a eu un cancer de stade 3."
doc = nlp(txt)
doc.ents[0]._.assigned
# Out: {'stage': '3'}
Configuration
The pipeline can be configured using the following parameters :
Parameter | Explanation | Default |
---|---|---|
patterns |
Dictionary or List of dictionaries. See below | |
assign_as_span |
Whether to store eventual extractions defined via the assign key as Spans or as string |
False |
attr |
spaCy attribute to match on (eg NORM , LOWER ) |
"TEXT" |
ignore_excluded |
Whether to skip excluded tokens during matching | False |
regex_flags |
RegExp flags to use when matching, filtering and assigning (See here) | 0 (use default flag) |
However, most of the configuration is provided in the patterns
key, as a pattern dictionary or a list of pattern dictionaries
The pattern dictionary
Description
A patterr is a nested dictionary with the following keys:
A label describing the pattern
A single Regex or a list of Regexes
An attributes to overwrite the given attr
when matching with Regexes.
A single term or a list of terms (for exact matches)
A dictionary (or list of dictionaries) to define exclusion rules. Exclusion rules are given as Regexes, and if a match is found in the surrounding context of an extraction, the extraction is removed. Each dictionary should have the following keys:
Size of the context to use (in number of words). You can provide the window as:
- A positive integer, in this case the used context will be taken after the extraction
- A negative integer, in this case the used context will be taken before the extraction
- A tuple of integers
(start, end)
, in this case the used context will be the snippet fromstart
tokens before the extraction toend
tokens after the extraction
A single Regex or a list of Regexes.
A dictionary to refine the extraction. Similarily to the exclude
key, you can provide a dictionary to
use on the context before and after the extraction.
A name (string)
Size of the context to use (in number of words). You can provide the window as:
- A positive integer, in this case the used context will be taken after the extraction
- A negative integer, in this case the used context will be taken before the extraction
- A tuple of integers
(start, end)
, in this case the used context will be the snippet fromstart
tokens before the extraction toend
tokens after the extraction
A dictionary where keys are labels and values are Regexes with a single capturing group
If set to True
, the initial entity's span will be expanded to the furthest match from the regex
dictionary
A full pattern dictionary example
dict(
source="AVC",
regex=[
"accidents? vasculaires? cerebr",
],
terms="avc",
regex_attr="NORM",
exclude=[
dict(
regex=["service"],
window=3,
),
dict(
regex=[" a "],
window=-2,
),
],
assign=[
dict(
name="neo",
regex=r"(neonatal)",
expand_entity=True,
window=3,
),
dict(
name="trans",
regex="(transitoire)",
expand_entity=True,
window=3,
),
dict(
name="hemo",
regex=r"(hemorragique)",
expand_entity=True,
window=3,
),
dict(
name="risk",
regex=r"(risque)",
expand_entity=False,
window=-3,
),
]
)
Authors and citation
The eds.matcher
pipeline was developed by AP-HP's Data Science team.