Matching a terminology

Matching a terminology is perhaps the most basic application of a medical NLP pipeline.

In this tutorial, we will cover :

Matching a terminology using spaCy's matchers, as well as RegExps
Matching on a specific attribute

You should consider reading the matcher's specific documentation for a description.

Comparison to spaCy's matcher

spaCy's Matcher and PhraseMatcher use a very efficient algorithm that compare a hashed representation token by token. They are not components by themselves, but can underpin rule-based pipes.

EDS-NLP's RegexMatcher lets the user match entire expressions using regular expressions. To achieve this, the matcher has to get to the text representation, match on it, and get back to spaCy's abstraction.

The EDSPhraseMatcher lets EDS-NLP reuse spaCy's efficient algorithm, while adding the ability to skip pollution tokens (see the normalizer documentation for detail)

A simple use case : finding COVID19

Let's try to find mentions of COVID19 and references to patients within a clinical note.

import edsnlp, edsnlp.pipes as eds

text = (
    "Motif de prise en charge : probable pneumopathie a COVID19, "
    "sans difficultés respiratoires\n"
    "Le père du patient est asthmatique."
)

terms = dict(
    covid=["coronavirus", "covid19"],
    respiratoire=["asthmatique", "respiratoire"],
)

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.matcher(terms=terms))

doc = nlp(text)

doc.ents
# Out: (asthmatique,)

Let's unpack what happened:

We defined a dictionary of terms to look for, in the form {'label': list of terms}.
We declared a spaCy pipeline, and add the eds.matcher component.
We applied the pipeline to the texts...
... and explored the extracted entities.

This example showcases a limitation of our term dictionary : the phrases COVID19 and difficultés respiratoires were not detected by the pipeline.

To increase recall, we could just add every possible variation :

terms = dict(
-    covid=["coronavirus", "covid19"],
+    covid=["coronavirus", "covid19", "COVID19"],
-    respiratoire=["asthmatique", "respiratoire"],
+    respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)

But what if we come across Coronavirus? Surely we can do better!

Matching on normalised text

We can modify the matcher's configuration to match on other attributes instead of the verbatim input. You can refer to spaCy's list of available token attributes.

Let's focus on two:

The LOWER attribute, which lets you match on a lowercased version of the text.
The NORM attribute, which adds some basic normalisation (eg œ to oe). EDS-NLP provides a eds.normalizer component that extends the level of cleaning on the NORM attribute.

The `LOWER` attribute

Matching on the lowercased version is extremely easy:

import edsnlp, edsnlp.pipes as eds

text = (
    "Motif de prise en charge : probable pneumopathie a COVID19, "
    "sans difficultés respiratoires\n"
    "Le père du patient est asthmatique."
)

terms = dict(
    covid=["coronavirus", "covid19"],
    respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.matcher(
        terms=terms,
        attr="LOWER",  # The matcher's attr parameter defines the attribute that the matcher will use. It is set to "TEXT" by default (ie verbatim text).

    ),
)

doc = nlp(text)

doc.ents
# Out: (COVID19, respiratoires, asthmatique)

This code is complete, and should run as is.

Using the normalisation component

EDS-NLP provides its own normalisation component, which modifies the NORM attribute in place. It handles:

removal of accentuated characters;
normalisation of quotes and apostrophes;
lowercasing, which enabled by default in spaCy – EDS-NLP lets you disable it;
removal of pollution.

Pollution in clinical texts

EDS-NLP is meant to be deployed on clinical reports extracted from hospitals information systems. As such, it is often riddled with extraction issues or administrative artifacts that "pollute" the report.

As a core principle, EDS-NLP never modifies the input text, and nlp(text).text == text is always true. However, we can tag some tokens as pollution elements, and avoid using them for matching the terminology.

You can activate it like any other component.

import edsnlp, edsnlp.pipes as eds

text = (
    "Motif de prise en charge : probable pneumopathie a ===== COVID19, "  # We've modified the example to include a simple pollution.

    "sans difficultés respiratoires\n"
    "Le père du patient est asthmatique."
)

terms = dict(
    covid=["coronavirus", "covid19", "pneumopathie à covid19"],  # We've added pneumopathie à covid19 to the list of synonyms detected by the pipeline. Note that in the synonym we provide, we kept the accentuated à, whereas the example displays an unaccentuated a.

    respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)

nlp = edsnlp.blank("eds")

# Add the normalisation component
nlp.add_pipe(eds.normalizer())  # The component can be configured. See the specific documentation for detail.


nlp.add_pipe(
    eds.matcher(
        terms=terms,
        attr="NORM",  # The normalisation lives in the NORM attribute

        ignore_excluded=True,  # We can tell the matcher to ignore excluded tokens (tokens tagged as pollution by the normalisation component). This is not an obligation.

    ),
)

doc = nlp(text)

doc.ents
# Out: (pneumopathie a ===== COVID19, respiratoires, asthmatique)

Using the normalisation component, you can match on a normalised version of the text, as well as skip pollution tokens during the matching process.

Using term matching with the normalisation

If you use the term matcher with the normalisation, bear in mind that the examples go through the pipeline. That's how the matcher was able to recover pneumopathie a ===== COVID19 despite the fact that we used an accentuated à in the terminology.

The term matcher matches the input text to the provided terminology, using the selected attribute in both cases. The NORM attribute that corresponds to à and a is the same: a.

Preliminary conclusion

We have matched all mentions! However, we had to spell out the singular and plural form of respiratoire... And what if we wanted to detect covid 19, or covid-19 ? Of course, we could write out every imaginable possibility, but this will quickly become tedious.

Using regular expressions

Let us redefine the pipeline once again, this time using regular expressions. Using regular expressions can help define richer patterns using more compact queries.

import edsnlp, edsnlp.pipes as eds

text = (
    "Motif de prise en charge : probable pneumopathie a COVID19, "
    "sans difficultés respiratoires\n"
    "Le père du patient est asthmatique."
)

regex = dict(
    covid=r"(coronavirus|covid[-\s]?19)",
    respiratoire=r"respiratoires?",
)
terms = dict(respiratoire="asthmatique")

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.matcher(
        regex=regex,  # We can now match using regular expressions.

        terms=terms,  # We can mix and match patterns! Here we keep looking for patients using spaCy's term matching.

        attr="LOWER",  # RegExp matching is not limited to the verbatim text! You can choose to use one of spaCy's native attribute, ignore excluded tokens, etc.

    ),
)

doc = nlp(text)

doc.ents
# Out: (COVID19, respiratoires, asthmatique)

To visualize extracted entities, check out the Visualization tutorial.