Qualifier Overview
In EDS-NLP, we call qualifiers the suite of components designed to qualify a pre-extracted entity for a linguistic modality.
Available components
Pipeline | Description |
---|---|
eds.negation | Rule-based negation detection |
eds.family | Rule-based family context detection |
eds.hypothesis | Rule-based speculation detection |
eds.reported_speech | Rule-based reported speech detection |
eds.history | Rule-based medical history detection |
Rationale
In a typical medical NLP pipeline, a group of clinicians would define a list of synonyms for a given concept of interest (say, for example, diabetes), and look for that terminology in a corpus of documents.
Now, consider the following example:
Le patient n'est pas diabétique.
Le patient est peut-être diabétique.
Le père du patient est diabétique.
The patient is not diabetic.
The patient could be diabetic.
The patient's father is diabetic.
There is an obvious problem: none of these examples should lead us to include this particular patient into the cohort.
Warning
We show an English example just to explain the issue. EDS-NLP remains a French-language medical NLP library.
To curb this issue, EDS-NLP proposes rule-based pipes that qualify entities to help the user make an informed decision about which patient should be included in a real-world data cohort.
Where do we get our spans ?
A component get entities from a document by looking up doc.ents
or doc.spans[group]
. This behavior is set by the span_getter
argument in components that support it.
Valid values for the span_getter
argument of a component can be :
- a (doc) -> spans callable
- a span group name
- a list of span group names
- a dict of group name to True or list of labels
The group name "ents"
is a special case, and will get the matches from doc.ents
Examples
span_getter=["ents", "ckd"]
will get the matches from bothdoc.ents
anddoc.spans["ckd"]
. It is equivalent to{"ents": True, "ckd": True}
.span_getter={"ents": ["foo", "bar"]}
will get the matches with label "foo" and "bar" fromdoc.ents
.span_getter="ents"
will get all matches fromdoc.ents
.span_getter="ckd"
will get all matches fromdoc.spans["ckd"]
.
Under the hood
Our qualifier pipes all follow the same basic pattern:
-
The pipeline extracts cues. We define three (possibly overlapping) kinds :
preceding
, ie cues that precede modulated entities ;following
, ie cues that follow modulated entities ;- in some cases,
verbs
, ie verbs that convey a modulation (treated as preceding cues).
-
The pipeline splits the text between sentences and propositions, using annotations from a sentencizer pipeline and
termination
patterns, which define syntagma/proposition terminations. -
For each pre-extracted entity, the pipeline checks whether there is a cue between the start of the syntagma and the start of the entity, or a following cue between the end of the entity and the end of the proposition.
Albeit simple, this algorithm can achieve very good performance depending on the modality. For instance, our eds.negation
pipeline reaches 88% F1-score on our dataset.
Dealing with pseudo-cues
The pipeline can also detect pseudo-cues, ie phrases that contain cues but that are not cues themselves. For instance: sans doute
/without doubt
contains sans/without
, but does not convey negation.
Detecting pseudo-cues lets the pipeline filter out any cue that overlaps with a pseudo-cue.
Sentence boundaries are required
The rule-based algorithm detects cues, and propagate their modulation on the rest of the syntagma. For that reason, a qualifier pipeline needs a sentencizer component to be defined, and will fail otherwise.
You may use EDS-NLP's:
import edsnlp, edsnlp.pipes as eds
...
nlp.add_pipe(eds.sentences())
Persisting the results
Our qualifier pipelines write their results to a custom spaCy extension, defined on both Span
and Token
objects. We follow the convention of naming said attribute after the pipeline itself, eg Span._.negation
for theeds.negation
pipeline.
We also provide a string representation of the result, computed on the fly by declaring a getter that reads the boolean result of the pipeline. Following spaCy convention, we give this attribute the same name, followed by a _
.