Skip to content

Matchers

We implemented two pattern matchers that are fit to clinical documents:

  • the EDSPhraseMatcher
  • the RegexMatcher

However, note that for most use-cases, you should instead use the eds.matcher pipeline that wraps these classes to annotate documents.

EDSPhraseMatcher

The EDSPhraseMatcher lets you efficiently match large terminology lists, by comparing tokenx against a given attribute. This matcher differs from the spacy.PhraseMatcher in that it allows to skip pollution tokens. To make it efficient, we have reimplemented the matching algorithm in Cython, like the original spacy.PhraseMatcher.

You can use it as described in the code below.

import spacy
from edsnlp.matchers.phrase import EDSPhraseMatcher

nlp = spacy.blank("eds")
nlp.add_pipe("eds.normalizer")
doc = nlp("On ne relève pas de signe du Corona =============== virus.")

matcher = EDSPhraseMatcher(nlp.vocab, attr="NORM")
matcher.build_patterns(
    nlp,
    {
        "covid": ["corona virus", "coronavirus", "covid"],
        "diabete": ["diabete", "diabetique"],
    },
)

list(matcher(doc, as_spans=True))[0].text
# Out: Corona =============== virus

RegexMatcher

The RegexMatcher performs full-text regex matching. It is especially useful to handle spelling variations like mammo-?graphies?. Like the EDSPhraseMatcher, this class allows to skip pollution tokens. Note that this class is significantly slower than the EDSPhraseMatcher: if you can, try enumerating lexical variations of the target phrases and feed them to the PhraseMatcher instead.

You can use it as described in the code below.

import spacy
from edsnlp.matchers.regex import RegexMatcher

nlp = spacy.blank("eds")
nlp.add_pipe("eds.normalizer")
doc = nlp("On ne relève pas de signe du Corona =============== virus.")

matcher = RegexMatcher(attr="NORM", ignore_excluded=True)
matcher.build_patterns(
    {
        "covid": ["corona[ ]*virus", "covid"],
        "diabete": ["diabete", "diabetique"],
    },
)

list(matcher(doc, as_spans=True))[0].text
# Out: Corona =============== virus