Skip to content

Tokenizers

In addition to the standard spaCy FrenchLanguage (fr), EDS-NLP offers a new language better fit for French clinical documents: EDSLanguage (eds). Additionally, the EDSLanguage document creation should be around 5-6 times faster than the fr language. The main differences lie in the tokenization process.

A comparison of the two tokenization methods is demonstrated below:

Example FrenchLanguage EDSLanguage
ACR5 [ACR5] [ACR, 5]
26.5/ [26.5/] [26.5, /]
\n \n CONCLUSION [\n \n, CONCLUSION] [\n, \n, CONCLUSION]
l'artère [l', artère] [l', artère] (same)
Dr. Pichon [Dr, ., Pichon] [Dr., Pichon]
B.H.HP.A.7.A [B.H.HP.A.7.A] [B., H., HP., A, 7, A, 0]

To instantiate one of the two languages, you can call the spacy.blank method.

import edsnlp

nlp = edsnlp.blank("eds")
import edsnlp

nlp = edsnlp.blank("fr")