Terminology
EDS-NLP simplifies the terminology matching process by exposing a eds.terminology
pipeline that can match on terms or regular expressions.
The terminology matcher is very similar to the generic matcher, although the use case differs slightly. The generic matcher is designed to extract any entity, while the terminology matcher is specifically tailored towards high volume terminologies.
There are some key differences:
- It labels every matched entity to the same value, provided to the pipeline
- The keys provided in the
regex
andterms
dictionaries are used as thekb_id_
of the entity, which handles fine-grained labelling
For instance, a terminology matcher could detect every drug mention under the top-level label drug
, and link each individual mention to a given drug through its kb_id_
attribute.
Examples
Let us redefine the pipeline :
import edsnlp
nlp = edsnlp.blank("eds")
terms = dict(
covid=["coronavirus", "covid19"], # (1)
flu=["grippe saisonnière"], # (2)
)
regex = dict(
covid=r"coronavirus|covid[-\s]?19|sars[-\s]cov[-\s]2", # (3)
)
nlp.add_pipe(
"eds.terminology",
config=dict(
label="disease",
terms=terms,
regex=regex,
attr="LOWER",
),
)
- Every key in the
terms
dictionary is mapped to a concept. - The
eds.matcher
pipeline expects a list of expressions, or a single expression. - We can also define regular expression patterns.
This snippet is complete, and should run as is.
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object TYPE: |
terms | A dictionary of terms. TYPE: |
regex | A dictionary of regular expressions. TYPE: |
attr | The default attribute to use for matching. Can be overridden using the TYPE: |
ignore_excluded | Whether to skip excluded tokens (requires an upstream pipeline to mark excluded tokens). TYPE: |
ignore_space_tokens | Whether to skip space tokens during matching. DEFAULT: |
term_matcher | The matcher to use for matching phrases ? One of (exact, simstring) DEFAULT: |
term_matcher_config | Parameters of the matcher class DEFAULT: |
label | Label name to use for the
|
span_setter | How to set matches on the doc TYPE: |
Patterns, be they terms
or regex
, are defined as dictionaries where keys become the kb_id_
of the extracted entities. Dictionary values are either a single expression or a list of expressions that match the concept (see example).
Authors and citation
The eds.terminology
pipeline was developed by AP-HP's Data Science team.