Tokenizers
In addition to the standard spaCy FrenchLanguage (fr), EDS-NLP offers a new language better fit for French clinical documents: EDSLanguage (eds). Additionally, the EDSLanguage document creation should be around 5-6 times faster than the fr language. The main differences lie in the tokenization process.
A comparison of the two tokenization methods is demonstrated below:
| Example | FrenchLanguage | EDSLanguage |
|---|---|---|
ACR5 | [ACR5] | [ACR, 5] |
26.5/ | [26.5/] | [26.5, /] |
\n \n CONCLUSION | [\n \n, CONCLUSION] | [\n, \n, CONCLUSION] |
l'artère | [l', artère] | [l', artère] (same) |
Dr. Pichon | [Dr, ., Pichon] | [Dr., Pichon] |
B.H.HP.A.7.A | [B.H.HP.A.7.A] | [B., H., HP., A, 7, A, 0] |
To instantiate one of the two languages, you can call the spacy.blank method.
import edsnlp
nlp = edsnlp.blank("eds")
import edsnlp
nlp = edsnlp.blank("fr")