Matching a terminology
Matching a terminology is perhaps the most basic application of a medical NLP pipeline.
In this tutorial, we will cover :
- Matching a terminology using spaCy's matchers, as well as RegExps
- Matching on a specific attribute
You should consider reading the matcher's specific documentation for a description.
Comparison to spaCy's matcher
spaCy's Matcher
and PhraseMatcher
use a very efficient algorithm that compare a hashed representation token by token. They are not components by themselves, but can underpin rule-based pipes.
EDS-NLP's RegexMatcher
lets the user match entire expressions using regular expressions. To achieve this, the matcher has to get to the text representation, match on it, and get back to spaCy's abstraction.
The EDSPhraseMatcher
lets EDS-NLP reuse spaCy's efficient algorithm, while adding the ability to skip pollution tokens (see the normalizer documentation for detail)
A simple use case : finding COVID19
Let's try to find mentions of COVID19 and references to patients within a clinical note.
import edsnlp, edsnlp.pipes as eds
text = (
"Motif de prise en charge : probable pneumopathie a COVID19, "
"sans difficultés respiratoires\n"
"Le père du patient est asthmatique."
)
terms = dict(
covid=["coronavirus", "covid19"],
respiratoire=["asthmatique", "respiratoire"],
)
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.matcher(terms=terms))
doc = nlp(text)
doc.ents
# Out: (asthmatique,)
Let's unpack what happened:
- We defined a dictionary of terms to look for, in the form
{'label': list of terms}
. - We declared a spaCy pipeline, and add the
eds.matcher
component. - We applied the pipeline to the texts...
- ... and explored the extracted entities.
This example showcases a limitation of our term dictionary : the phrases COVID19
and difficultés respiratoires
were not detected by the pipeline.
To increase recall, we could just add every possible variation :
terms = dict(
- covid=["coronavirus", "covid19"],
+ covid=["coronavirus", "covid19", "COVID19"],
- respiratoire=["asthmatique", "respiratoire"],
+ respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)
But what if we come across Coronavirus
? Surely we can do better!
Matching on normalised text
We can modify the matcher's configuration to match on other attributes instead of the verbatim input. You can refer to spaCy's list of available token attributes.
Let's focus on two:
- The
LOWER
attribute, which lets you match on a lowercased version of the text. - The
NORM
attribute, which adds some basic normalisation (egœ
tooe
). EDS-NLP provides aeds.normalizer
component that extends the level of cleaning on theNORM
attribute.
The LOWER
attribute
Matching on the lowercased version is extremely easy:
import edsnlp, edsnlp.pipes as eds
text = (
"Motif de prise en charge : probable pneumopathie a COVID19, "
"sans difficultés respiratoires\n"
"Le père du patient est asthmatique."
)
terms = dict(
covid=["coronavirus", "covid19"],
respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.matcher(
terms=terms,
attr="LOWER", # (1)
),
)
doc = nlp(text)
doc.ents
# Out: (COVID19, respiratoires, asthmatique)
- The matcher's
attr
parameter defines the attribute that the matcher will use. It is set to"TEXT"
by default (ie verbatim text).
This code is complete, and should run as is.
Using the normalisation component
EDS-NLP provides its own normalisation component, which modifies the NORM
attribute in place. It handles:
- removal of accentuated characters;
- normalisation of quotes and apostrophes;
- lowercasing, which enabled by default in spaCy – EDS-NLP lets you disable it;
- removal of pollution.
Pollution in clinical texts
EDS-NLP is meant to be deployed on clinical reports extracted from hospitals information systems. As such, it is often riddled with extraction issues or administrative artifacts that "pollute" the report.
As a core principle, EDS-NLP never modifies the input text, and nlp(text).text == text
is always true. However, we can tag some tokens as pollution elements, and avoid using them for matching the terminology.
You can activate it like any other component.
import edsnlp, edsnlp.pipes as eds
text = (
"Motif de prise en charge : probable pneumopathie a ===== COVID19, " # (1)
"sans difficultés respiratoires\n"
"Le père du patient est asthmatique."
)
terms = dict(
covid=["coronavirus", "covid19", "pneumopathie à covid19"], # (2)
respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)
nlp = edsnlp.blank("eds")
# Add the normalisation component
nlp.add_pipe(eds.normalizer()) # (3)
nlp.add_pipe(
eds.matcher(
terms=terms,
attr="NORM", # (4)
ignore_excluded=True, # (5)
),
)
doc = nlp(text)
doc.ents
# Out: (pneumopathie a ===== COVID19, respiratoires, asthmatique)
- We've modified the example to include a simple pollution.
- We've added
pneumopathie à covid19
to the list of synonyms detected by the pipeline. Note that in the synonym we provide, we kept the accentuatedà
, whereas the example displays an unaccentuateda
. - The component can be configured. See the specific documentation for detail.
- The normalisation lives in the
NORM
attribute - We can tell the matcher to ignore excluded tokens (tokens tagged as pollution by the normalisation component). This is not an obligation.
Using the normalisation component, you can match on a normalised version of the text, as well as skip pollution tokens during the matching process.
Using term matching with the normalisation
If you use the term matcher with the normalisation, bear in mind that the examples go through the pipeline. That's how the matcher was able to recover pneumopathie a ===== COVID19
despite the fact that we used an accentuated à
in the terminology.
The term matcher matches the input text to the provided terminology, using the selected attribute in both cases. The NORM
attribute that corresponds to à
and a
is the same: a
.
Preliminary conclusion
We have matched all mentions! However, we had to spell out the singular and plural form of respiratoire
... And what if we wanted to detect covid 19
, or covid-19
? Of course, we could write out every imaginable possibility, but this will quickly become tedious.
Using regular expressions
Let us redefine the pipeline once again, this time using regular expressions. Using regular expressions can help define richer patterns using more compact queries.
import edsnlp, edsnlp.pipes as eds
text = (
"Motif de prise en charge : probable pneumopathie a COVID19, "
"sans difficultés respiratoires\n"
"Le père du patient est asthmatique."
)
regex = dict(
covid=r"(coronavirus|covid[-\s]?19)",
respiratoire=r"respiratoires?",
)
terms = dict(respiratoire="asthmatique")
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.matcher(
regex=regex, # (1)
terms=terms, # (2)
attr="LOWER", # (3)
),
)
doc = nlp(text)
doc.ents
# Out: (COVID19, respiratoires, asthmatique)
- We can now match using regular expressions.
- We can mix and match patterns! Here we keep looking for patients using spaCy's term matching.
- RegExp matching is not limited to the verbatim text! You can choose to use one of spaCy's native attribute, ignore excluded tokens, etc.
To visualize extracted entities, check out the Visualization tutorial.