Rule-based model
Installation
First, clone the repository
git clone https://github.com/aphp/eds-pseudo.git
cd eds-pseudo
And install the dependencies:
poetry install
If you face issues with the installation, try to lower the maximum python version to
<= 3.10 (in pyproject.toml
).
Rule-based model definition
A simple option consists in using the rule-based components of the model.
import edsnlp
nlp = edsnlp.blank("eds")
# Some text cleaning
nlp.add_pipe("eds.normalizer")
# Various simple rules
nlp.add_pipe(
"eds_pseudo.simple_rules",
config={"pattern_keys": ["TEL", "MAIL", "SECU", "PERSON"]},
)
# Address detection
nlp.add_pipe("eds_pseudo.addresses")
# Date detection
nlp.add_pipe("eds_pseudo.dates")
# Contextual rules (requires a dict of info about the patient)
nlp.add_pipe("eds_pseudo.context")
# Date value and format detector
# This is useful to reinsert a new shifted date with the same format in the text
nlp.add_pipe(
"eds_pseudo.dates_normalizer",
config={"format": "java"}
# java format -> will output a format like "yyyy/MM/dd"
# strftime format -> will output a format like "%Y/%m/%d"
)
# Apply it to a text
doc = nlp(
"En 2015, M. Charles-François-Bienvenu "
"Myriel était évêque de Digne. C’était un vieillard "
"d’environ soixante-quinze ans ; il occupait le "
"siège de Digne depuis le 2 janvier 2006."
)
for e in doc.ents:
print(f"{e.text: <30}{e.label_: <10}{str(e._.date): <15}{e._.date_format}")
# Text Label Date Format
# ---------------------------- -------- ------------- ---------
# 2015 DATE 2015-??-?? yyyy
# Charles-François-Bienvenu NOM None None
# Myriel PRENOM None None
# 2 janvier 2006 DATE 2006-01-02 d MMMM yyyy
- The original date is 1815, but the rule-based date detection only matches dates after 1900 to avoid false positives.
You can observe that the model is not flawless : "Digne" is not detected as a city. This can be alleviated by adding contextual information about the patient (see below), or by training a model.
Apply on multiple documents
We recommend you check out the edsnlp's tutorial on how to process multiple documents.
Assuming we have a dataframe df
with columns note_id
, text
and an optional column context
, containing information about the patient, e.g.:
note_id | text | context |
---|---|---|
doc-1 | En 2015, M. Charles-François-Bienvenu ... | {"VILLE": "DIGNE", "zip": "04070"} |
doc-2 | Mme. Ange-Gardien Josephine est admise pour irritation des tendons fléchisseurs | |
doc-3 | josephine.ange-gardien @ test.com |
We can apply the model to all the documents with the following code:
import edsnlp
# Function to convert a row of the dataframe to a Doc object
def converter(row):
tokenizer = edsnlp.data.converters.get_current_tokenizer()
doc = tokenizer(row["text"])
doc._.note_id = row["note_id"]
ctx = row["context"]
if isinstance(ctx, dict):
doc._.context = {k: v if isinstance(v, list) else [v] for k, v in ctx.items()}
return doc
data = edsnlp.data.from_pandas(df, converter=converter)
data = data.map_pipeline(nlp)
data.to_pandas(converter="ents", span_attributes=["date", "date_format"])
and we get the following dataframe:
note_id | start | end | label | lexical_variant |
---|---|---|---|---|
doc-1 | 3 | 7 | DATE | 2015 |
doc-1 | 12 | 37 | NOM | Charles-François-Bienvenu |
doc-1 | 38 | 44 | PRENOM | Myriel |
doc-1 | 61 | 66 | VILLE | Digne |
doc-1 | 145 | 150 | VILLE | Digne |
doc-1 | 158 | 162 | DATE | 2006 |
doc-2 | 5 | 17 | NOM | Ange-Gardien |
doc-2 | 18 | 27 | PRENOM | Joséphine |
doc-3 | 0 | 33 | josephine.ange-gardien @ test.com |