Terminology

EDS-NLP simplifies the terminology matching process by exposing a eds.terminology pipeline that can match on terms or regular expressions.

The terminology matcher is very similar to the generic matcher, although the use case differs slightly. The generic matcher is designed to extract any entity, while the terminology matcher is specifically tailored towards high volume terminologies.

There are some key differences:

It labels every matched entity to the same value, provided to the pipeline
The keys provided in the regex and terms dictionaries are used as the kb_id_ of the entity, which handles fine-grained labelling

For instance, a terminology matcher could detect every drug mention under the top-level label drug, and link each individual mention to a given drug through its kb_id_ attribute.

Examples

Let us redefine the pipeline :

import edsnlp

nlp = edsnlp.blank("eds")

terms = dict(
    covid=["coronavirus", "covid19"],  # (1)
    flu=["grippe saisonnière"],  # (2)
)

regex = dict(
    covid=r"coronavirus|covid[-\s]?19|sars[-\s]cov[-\s]2",  # (3)
)

nlp.add_pipe(
    "eds.terminology",
    config=dict(
        label="disease",
        terms=terms,
        regex=regex,
        attr="LOWER",
    ),
)

Every key in the terms dictionary is mapped to a concept.
The eds.matcher pipeline expects a list of expressions, or a single expression.
We can also define regular expression patterns.

This snippet is complete, and should run as is.

Parameters

PARAMETER	DESCRIPTION
`nlp`	The pipeline object TYPE: `PipelineProtocol`
`terms`	A dictionary of terms. TYPE: `Optional[Patterns]` DEFAULT: `None`
`regex`	A dictionary of regular expressions. TYPE: `Optional[Patterns]` DEFAULT: `None`
`attr`	The default attribute to use for matching. Can be overridden using the `terms` and `regex` configurations. TYPE: `str` DEFAULT: `TEXT`
`ignore_excluded`	Whether to skip excluded tokens (requires an upstream pipeline to mark excluded tokens). TYPE: `bool` DEFAULT: `False`
`ignore_space_tokens`	Whether to skip space tokens during matching. DEFAULT: `False`
`term_matcher`	The matcher to use for matching phrases ? One of (exact, simstring) DEFAULT: `exact`
`term_matcher_config`	Parameters of the matcher class DEFAULT: `None`
`label`	Label name to use for the `Span` object and the extension
`span_setter`	How to set matches on the doc TYPE: `SpanSetterArg` DEFAULT: `{'ents': True}`

Patterns, be they terms or regex, are defined as dictionaries where keys become the kb_id_ of the extracted entities. Dictionary values are either a single expression or a list of expressions that match the concept (see example).

Authors and citation

The eds.terminology pipeline was developed by AP-HP's Data Science team.