Skip to content

Terminology

EDS-NLP simplifies the terminology matching process by exposing a eds.terminology pipeline that can match on terms or regular expressions.

The terminology matcher is very similar to the generic matcher, although the use case differs slightly. The generic matcher is designed to extract any entity, while the terminology matcher is specifically tailored towards high volume terminologies.

There are some key differences:

  1. It labels every matched entity to the same value, provided to the pipeline
  2. The keys provided in the regex and terms dictionaries are used as the kb_id_ of the entity, which handles fine-grained labelling

For instance, a terminology matcher could detect every drug mention under the top-level label drug, and link each individual mention to a given drug through its kb_id_ attribute.

Examples

Let us redefine the pipeline :

import edsnlp

nlp = edsnlp.blank("eds")

terms = dict(
    covid=["coronavirus", "covid19"],  # (1)
    flu=["grippe saisonnière"],  # (2)
)

regex = dict(
    covid=r"coronavirus|covid[-\s]?19|sars[-\s]cov[-\s]2",  # (3)
)

nlp.add_pipe(
    "eds.terminology",
    config=dict(
        label="disease",
        terms=terms,
        regex=regex,
        attr="LOWER",
    ),
)
  1. Every key in the terms dictionary is mapped to a concept.
  2. The eds.matcher pipeline expects a list of expressions, or a single expression.
  3. We can also define regular expression patterns.

This snippet is complete, and should run as is.

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object

TYPE: PipelineProtocol

terms

A dictionary of terms.

TYPE: Optional[Patterns] DEFAULT: None

regex

A dictionary of regular expressions.

TYPE: Optional[Patterns] DEFAULT: None

attr

The default attribute to use for matching. Can be overridden using the terms and regex configurations.

TYPE: str DEFAULT: TEXT

ignore_excluded

Whether to skip excluded tokens (requires an upstream pipeline to mark excluded tokens).

TYPE: bool DEFAULT: False

ignore_space_tokens

Whether to skip space tokens during matching.

DEFAULT: False

term_matcher

The matcher to use for matching phrases ? One of (exact, simstring)

DEFAULT: exact

term_matcher_config

Parameters of the matcher class

DEFAULT: None

label

Label name to use for the Span object and the extension

span_setter

How to set matches on the doc

TYPE: SpanSetterArg DEFAULT: {'ents': True}

Patterns, be they terms or regex, are defined as dictionaries where keys become the kb_id_ of the extracted entities. Dictionary values are either a single expression or a list of expressions that match the concept (see example).

Authors and citation

The eds.terminology pipeline was developed by AP-HP's Data Science team.