Contextual Matcher
EDS-NLP provides simple pattern matchers like eds.matcher to extract regular expressions, specific phrases, or perform lexical similarity matching on documents. However, certain use cases require examining the context around matched entities to filter out irrelevant matches or enrich them with additional information. For example, to extract mentions of malignant cancers, we need to exclude matches that have “benin” mentioned nearby : eds.contextual_matcher was built to address such needs.
Example
The following example demonstrates how to configure and use eds.contextual_matcher to extract mentions of solid cancers and lymphomas, while filtering out irrelevant mentions (e.g., benign tumors) and enriching entities with contextual information such as stage or metastasis status.
Let's dive in with the full code example:
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(
eds.contextual_matcher(
patterns=[
dict(
terms=["cancer", "tumeur"], # (1)!
regex=[r"adeno(carcinom|[\s-]?k)", "neoplas", "melanom"], # (2)!
regex_attr="NORM", # (3)!
exclude=dict(
regex="benign|benin", # (4)!
window=3, # (5)!
),
assign=[
dict(
name="stage", # (6)!
regex="stade (I{1,3}V?|[1234])", # (7)!
window="words[-10:10]", # (8)!
replace_entity=False, # (9)!
reduce_mode=None, # (10)!
),
dict(
name="metastase", # (11)!
regex="(metasta)", # (12)!
window=10, # (13)!
replace_entity=False, # (14)!
reduce_mode="keep_last", # (15)!
),
],
source="Cancer solide", # (16)!
),
dict(
regex=["lymphom", "lymphangio"], # (17)!
regex_attr="NORM", # (18)!
exclude=dict(
regex=["hodgkin"], # (19)!
window=3, # (20)!
),
source="Lymphome", # (21)!
),
],
label="cancer",
),
)
- Exact match terms (faster than regex, but less flexible)
- Regex for flexible matching
- Apply regex on normalized text
- Regex to exclude benign mentions
- Window size for exclusion check
- Extract cancer stage
- Stage regex pattern
- Window range for stage extraction. Visit the documentation of ContextWindow for more information about this syntax.
- Do not use these matches as replacement for the anchor (default behavior)
- Keep all matches
- Detect metastasis
- Regex for metastasis detection
- Window size for detection
- Keep main entity
- Keep furthest extraction
- Optional source label for solid tumor. This can be useful to know which pattern matched the entity.
- Regex patterns for lymphoma
- Apply regex on normalized text
- Exclude Hodgkin lymphoma
- Window size for exclusion
- Optional source label for lymphoma. This can be useful to know which pattern matched the entity.
Let's explore some examples using this pipeline:
txt = "Le patient a eu un cancer il y a 5 ans"
doc = nlp(txt)
ent = doc.ents[0]
ent.label_
# Out: cancer
ent._.source
# Out: Cancer solide
ent.text, ent.start, ent.end
# Out: ('cancer', 5, 6)
Check exclusion with a benign mention:
txt = "Le patient a eu un cancer relativement bénin il y a 5 ans"
doc = nlp(txt)
doc.ents
# Out: ()
Additional information extracted via assign configurations is available in the assigned attribute:
txt = "Le patient a eu un cancer de stade 3."
doc = nlp(txt)
doc.ents[0]._.assigned # (1)!
# Out: {'stage': ['3']}
- We get a list for 'stage' because
reduce_modeis set toNone(default). If you want to keep only the first or last match, setreduce_mode="keep_first"orreduce_mode="keep_last".
Better control over the final extracted entities
Three main parameters refine how entities are extracted:
include_assigned
Following the previous example, if you want extracted entities to include the cancer stage or metastasis status (if found), set include_assigned=True in the pipe configuration.
For instance, from the sentence "Le patient a un cancer au stade 3":
- If
include_assigned=False, the extracted entity is "cancer" - If
include_assigned=True, the extracted entity is "cancer au stade 3"
reduce_mode
Sometimes, an assignment matches multiple times. For example, in the sentence "Le patient a un cancer au stade 3 et au stade 4", both "stade 3" and "stade 4" match the stage key. Depending on your use case:
reduce_mode=None(default): Keeps all matched extractions in a listreduce_mode="keep_first": Keeps only the extraction closest to the main matched entity ("stade 3" in this case)reduce_mode="keep_last": Keeps only the furthest extraction
replace_entity
This parameter can be set to True for only one assign key per dictionary. If set to True, the matched assignment replaces the main entity.
Example using "Le patient a un cancer au stade 3":
- With
replace_entity=Truefor thestagekey, the entity extracted is "stade 3" - With
replace_entity=False, the entity extracted remains "cancer"
Note: With replace_entity=True, if the corresponding assign key matches nothing, the entity is discarded.
The primary configuration is provided in the patterns key as either a pattern dictionary or a list of pattern dictionaries.
Parameters
| PARAMETER | DESCRIPTION | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
patterns | The patterns to match
TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
assign_as_span | Whether to store eventual extractions defined via the TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
attr | Attribute to match on, eg TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ignore_excluded | Whether to skip excluded tokens during matching. TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ignore_space_tokens | Whether to skip space tokens during matching. TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
alignment_mode | Overwrite alignment mode. TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
regex_flags | RegExp flags to use when matching, filtering and assigning (See here) TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
include_assigned | Whether to include (eventual) assign matches to the final entity TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
label_name | Deprecated, use TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
label | The label to assign to the matched entities TYPE: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
span_setter | How to set matches on the doc TYPE: |
Authors and citation
The eds.contextual_matcher pipeline component was developed by AP-HP's Data Science team.