Matchers

We implemented three pattern matchers that are fit to clinical documents:

the EDSPhraseMatcher
the RegexMatcher
the SimstringMatcher

However, note that for most use-cases, you should instead use the eds.matcher pipe that wraps these classes to annotate documents.

EDSPhraseMatcher

The EDSPhraseMatcher lets you efficiently match large terminology lists, by comparing tokenx against a given attribute. This matcher differs from the spacy.PhraseMatcher in that it allows to skip pollution tokens. To make it efficient, we have reimplemented the matching algorithm in Cython, like the original spacy.PhraseMatcher.

You can use it as described in the code below.

import edsnlp, edsnlp.pipes as eds
from edsnlp.matchers.phrase import EDSPhraseMatcher

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
doc = nlp("On ne relève pas de signe du Corona =============== virus.")

matcher = EDSPhraseMatcher(nlp.vocab, attr="NORM")
matcher.build_patterns(
    nlp,
    {
        "covid": ["corona virus", "coronavirus", "covid"],
        "diabete": ["diabete", "diabetique"],
    },
)

list(matcher(doc, as_spans=True))[0].text
# Out: Corona =============== virus

RegexMatcher

The RegexMatcher performs full-text regex matching. It is especially useful to handle spelling variations like mammo-?graphies?. Like the EDSPhraseMatcher, this class allows to skip pollution tokens. Note that this class is significantly slower than the EDSPhraseMatcher: if you can, try enumerating lexical variations of the target phrases and feed them to the PhraseMatcher instead.

You can use it as described in the code below.

import edsnlp, edsnlp.pipes as eds
from edsnlp.matchers.regex import RegexMatcher

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
doc = nlp("On ne relève pas de signe du Corona =============== virus.")

matcher = RegexMatcher(attr="NORM", ignore_excluded=True)
matcher.build_patterns(
    {
        "covid": ["corona[ ]*virus", "covid"],
        "diabete": ["diabete", "diabetique"],
    },
)

list(matcher(doc, as_spans=True))[0].text
# Out: Corona =============== virus

SimstringMatcher

The SimstringMatcher performs fuzzy term matching by comparing spans of text with a similarity metric. It is especially useful to handle spelling variations like paracetomol (instead of paracetamol).

The simstring algorithm compares two strings by enumerating their char trigrams and measuring the overlap between the two sets. In the previous example: - paracetomol becomes ##p #pa par ara rac ace cet eto tom omo mol ol# l## - paracetamol becomes ##p #pa par ara rac ace cet eta tam amo mol ol# l## and the Dice (or F1) similarity between the two sets is 0.75.

Like the EDSPhraseMatcher, this class allows to skip pollution tokens. Just like the RegexMatcher, this class is significantly slower than the EDSPhraseMatcher: if you can, try enumerating lexical variations of the target phrases and feed them to the PhraseMatcher instead.

You can use it as described in the code below.

import edsnlp, edsnlp.pipes as eds
from edsnlp.matchers.simstring import SimstringMatcher

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
doc = nlp(
    "On ne relève pas de signe du corona-virus. Historique d'un hepatocellulaire carcinome."
)

matcher = SimstringMatcher(
    nlp.vocab,
    attr="NORM",
    ignore_excluded=True,
    measure="dice",
    threshold=0.75,
    windows=5,
)
matcher.build_patterns(
    nlp,
    {
        "covid": ["coronavirus", "covid"],
        "carcinome": ["carcinome hepatocellulaire"],
    },
)

list(matcher(doc, as_spans=True))[0].text
# Out: corona-virus

list(matcher(doc, as_spans=True))[1].text
# Out: hepatocellulaire carcinome