Normalisation
The normalisation scheme used by EDS-NLP adheres to the non-destructive doctrine. In other words,
nlp(text).text == text
is always true.
To achieve this, the input text is never modified. Instead, our normalisation strategy focuses on two axes:
- Only the
NORM
attribute is modified by thenormalizer
pipeline ; - Pipelines (eg the
pollution
pipeline) can mark tokens as excluded by setting the extensionToken._.excluded
toTrue
. It enables downstream matchers to skip excluded tokens.
The normaliser can act on the input text in four dimensions :
- Move the text to lowercase.
- Remove accents. We use a deterministic approach to avoid modifying the character-length of the text, which helps for RegEx matching.
- Normalize apostrophes and quotation marks, which are often coded using special characters.
- Remove pollutions.
Note
We recommend you also add an end-of-line classifier to remove excess new line characters (introduced by the PDF layout).
We provide a endlines
pipeline, which requires training an unsupervised model.
Refer to the dedicated page for more information.
Usage
The normalisation is handled by the single eds.normalizer
pipeline. The following code snippet is complete, and should run as is.
import spacy
from edsnlp.matchers.utils import get_text
nlp = spacy.blank("fr")
nlp.add_pipe("eds.normalizer")
# Notice the special character used for the apostrophe and the quotes
text = "Le patient est admis à l'hôpital le 23 août 2021 pour une douleur ʺaffreuse” à l`estomac."
doc = nlp(text)
get_text(doc, attr="NORM")
# Out: le patient est admis a l'hopital le 23 aout 2021 pour une douleur "affreuse" a l'estomac
Utilities
To simplify the use of the normalisation output, we provide the get_text
utility function. It computes the textual representation for a Span
or Doc
object.
Moreover, every span exposes a normalized_variant
extension getter, which computes the normalised representation of an entity on the fly.
Configuration
Parameter | Explanation | Default |
---|---|---|
accents |
Whether to strip accents | True |
lowercase |
Whether to remove casing | True |
quotes |
Whether to normalise quotes | True |
pollution |
Whether to tag pollutions as excluded tokens | True |
Pipelines
Let's review each subcomponent.
Lowercase
The eds.lowercase
pipeline transforms every token to lowercase. It is not configurable.
Consider the following example :
import spacy
from edsnlp.matchers.utils import get_text
config = dict(
lowercase=True,
accents=False,
quotes=False,
pollution=False,
)
nlp = spacy.blank("fr")
nlp.add_pipe("eds.normalizer", config=config)
text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"
doc = nlp(text)
get_text(doc, attr="NORM")
# Out: pneumopathie à nbnbwbwbnbwbnbnbnbwbw `coronavirus'
Accents
The eds.accents
pipeline removes accents. To avoid edge cases,
the component uses a specified list of accentuated characters and their unaccented representation,
making it more predictable than using a library such as unidecode
.
Consider the following example :
import spacy
from edsnlp.matchers.utils import get_text
config = dict(
lowercase=False,
accents=True,
quotes=False,
pollution=False,
)
nlp = spacy.blank("fr")
nlp.add_pipe("eds.normalizer", config=config)
text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"
doc = nlp(text)
get_text(doc, attr="NORM")
# Out: Pneumopathie a NBNbWbWbNbWbNBNbNbWbW `coronavirus'
Apostrophes and quotation marks
Apostrophes and quotation marks can be encoded using unpredictable special characters. The eds.quotes
component transforms every such special character to '
and "
, respectively.
Consider the following example :
import spacy
from edsnlp.matchers.utils import get_text
config = dict(
lowercase=False,
accents=False,
quotes=True,
pollution=False,
)
nlp = spacy.blank("fr")
nlp.add_pipe("eds.normalizer", config=config)
text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"
doc = nlp(text)
get_text(doc, attr="NORM")
# Out: Pneumopathie à NBNbWbWbNbWbNBNbNbWbW 'coronavirus'
Pollution
The pollution pipeline uses a set of regular expressions to detect pollutions (irrelevant non-medical text that hinders text processing). Corresponding tokens are marked as excluded (by setting Token._.excluded
to True
), enabling the use of the phrase matcher.
Consider the following example :
import spacy
from edsnlp.matchers.utils import get_text
config = dict(
lowercase=False,
accents=True,
quotes=False,
pollution=True,
)
nlp = spacy.blank("fr")
nlp.add_pipe("eds.normalizer", config=config)
text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"
doc = nlp(text)
get_text(doc, attr="NORM")
# Out: Pneumopathie a NBNbWbWbNbWbNBNbNbWbW `coronavirus'
get_text(doc, attr="TEXT", ignore_excluded=True)
# Out: Pneumopathie à `coronavirus'
This example above shows that the normalisation scheme works on two axes: non-destructive text modification and exclusion of tokens.
The two are independent: a matcher can use the NORM
attribute but keep excluded tokens, and conversely, match on TEXT
while ignoring excluded tokens.
Authors and citation
The eds.normalizer
pipeline was developed by AP-HP's Data Science team.