Getting started

EDS-NLP is a collaborative NLP framework that aims at extracting information from French clinical notes. At its core, it is a collection of components or pipes, either rule-based functions or deep learning modules. These components are organized into a novel efficient and modular pipeline system, built for hybrid and multitask models. We use spaCy to represent documents and their annotations, and Pytorch as a deep-learning backend for trainable components.

EDS-NLP is versatile and can be used on any textual document. The rule-based components are fully compatible with spaCy's pipelines, and vice versa. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities.

Check out our interactive demo !

Quick start

Installation

You can install EDS-NLP via pip. We recommend pinning the library version in your projects, or use a strict package manager like Poetry.

pip install edsnlp==0.21.0

or if you want to use the trainable components (using pytorch)

pip install "edsnlp[ml]==0.21.0"

A first pipeline

Once you've installed the library, let's begin with a very simple example that extracts mentions of COVID19 in a text, and detects whether they are negated.

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")  # (1)!

terms = dict(
    covid=["covid", "coronavirus"],  # (2)!
)

# Sentencizer component, needed for negation detection
nlp.add_pipe(eds.sentences())  # (3)!
# Matcher component
nlp.add_pipe(eds.matcher(terms=terms))  # (4)!
# Negation detection
nlp.add_pipe(eds.negation())

# Process your text in one call !
doc = nlp("Le patient n'est pas atteint de covid")

doc.ents  # (5)!
# Out: (covid,)

doc.ents[0]._.negation  # (6)!
# Out: True

'eds' is the name of the language, which defines the tokenizer.
This example terminology provides a very simple, and by no means exhaustive, list of synonyms for COVID19.
Similarly to spaCy, pipes are added via the nlp.add_pipe method.
See the matching tutorial for mode details.
spaCy stores extracted entities in the Doc.ents attribute.
The eds.negation component has adds a negation custom attribute.

Available pipeline components

CoreQualifiersMiscellaneousNERTrainableLLM-based

See the Core components overview for more information.

Component	Description
`eds.normalizer`	Non-destructive input text normalisation
`eds.sentences`	Better sentence boundary detection
`eds.matcher`	A simple yet powerful entity extractor
`eds.terminology`	A simple yet powerful terminology matcher
`eds.contextual_matcher`	A conditional entity extractor
`eds.endlines`	An unsupervised model to classify each end line

See the Qualifiers overview for more information.

Pipeline	Description
`eds.negation`	Rule-based negation detection
`eds.family`	Rule-based family context detection
`eds.hypothesis`	Rule-based speculation detection
`eds.reported_speech`	Rule-based reported speech detection
`eds.history`	Rule-based medical history detection

See the Miscellaneous components overview for more information.

Component	Description
`eds.dates`	Date extraction and normalisation
`eds.consultation_dates`	Identify consultation dates
`eds.quantities`	Quantity extraction and normalisation
`eds.sections`	Section detection
`eds.reason`	Rule-based hospitalisation reason detection
`eds.tables`	Tables detection
`eds.split`	Doc splitting
`eds.explode`	Explode entities between multiples copies of a document

See the NER overview for more information.

Component	Description
`eds.covid`	A COVID mentions detector
`eds.charlson`	A Charlson score extractor
`eds.sofa`	A SOFA score extractor
`eds.elston_ellis`	An Elston & Ellis code extractor
`eds.emergency_priority`	A priority score extractor
`eds.emergency_ccmu`	A CCMU score extractor
`eds.emergency_gemsa`	A GEMSA score extractor
`eds.tnm`	A TNM score extractor
`eds.adicap`	A ADICAP codes extractor
`eds.drugs`	A drug mentions extractor
`eds.cim10`	A CIM10 terminology matcher
`eds.umls`	An UMLS terminology matcher
`eds.ckd`	CKD extractor
`eds.copd`	COPD extractor
`eds.cerebrovascular_accident`	Cerebrovascular accident extractor
`eds.congestive_heart_failure`	Congestive heart failure extractor
`eds.connective_tissue_disease`	Connective tissue disease extractor
`eds.dementia`	Dementia extractor
`eds.diabetes`	Diabetes extractor
`eds.hemiplegia`	Hemiplegia extractor
`eds.leukemia`	Leukemia extractor
`eds.liver_disease`	Liver disease extractor
`eds.lymphoma`	Lymphoma extractor
`eds.myocardial_infarction`	Myocardial infarction extractor
`eds.peptic_ulcer_disease`	Peptic ulcer disease extractor
`eds.peripheral_vascular_disease`	Peripheral vascular disease extractor
`eds.solid_tumor`	Solid tumor extractor
`eds.alcohol`	Alcohol consumption extractor
`eds.tobacco`	Tobacco consumption extractor

See the Trainable components overview for more information.

Name	Description
`eds.transformer`	Embed text with a transformer model
`eds.text_cnn`	Contextualize embeddings with a CNN
`eds.span_pooler`	A span embedding component that aggregates word embeddings
`eds.ner_crf`	A trainable component to extract entities
`eds.extractive_qa`	A trainable component for extractive question answering
`eds.span_classifier`	A trainable component for multi-class multi-label span classification
`eds.span_linker`	A trainable entity linker (i.e. to a list of concepts)
`eds.biaffine_dep_parser`	A trainable biaffine dependency parser

See the LLM-based components overview for more information.

Component	Description
`eds.llm_markup_extractor`	Extract structured information using LLMs through markup.
`eds.llm_span_qualifier`	Predict attributes of spans using LLMs.

Tutorials

To learn more about EDS-NLP, we have prepared a series of tutorials that should cover the main features of the library.

Spacy representations

Learn the basics of how documents are represented with spaCy.

Matching a terminology

Extract phrases that belong to a given terminology.

Qualifying entities

Ensure extracted concepts are not invalidated by linguistic modulation.

Detecting dates

Detect and parse dates in a text.

Processing multiple texts

Improve the inference speed of your pipeline

Running on HPC (eg. Slurm)

Use an existing model at scale with an High-Performance Computing (HPC) job scheduler like Slurm.

Detecting hospitalisation reason

Identify spans mentioning the reason for hospitalisation or tag entities as the reason.

↵ Detecting false endlines

Classify each line end and add the excluded attribute to these tokens.

Aggregating results

Aggregate the results of your pipeline at the document level.

FastAPI

Deploy your pipeline as an API.

Visualization

Quickly visualize the results of your pipeline as annotations or tables.

Deep-learning tutorials: we also provide tutorials on how to train deep-learning models with EDS-NLP. These tutorials cover the training API, hyperparameter tuning, and more.

Writing a training script

Learn how EDS-NLP handles training deep-neural networks, and how to write a training script on your own.

Training a NER model

Learn how to quickly train a NER model with edsnlp.train.

Training a Span Classifier model

Learn how to quickly train a biopsy date classifier model model with edsnlp.train.

Hyperparameter Tuning

Learn how to tune hyperparameters of a model with edsnlp.tune.

Disclaimer

The performances of an extraction pipeline may depend on the population and documents that are considered.

Contributing to EDS-NLP

We welcome contributions ! Fork the project and propose a pull request. Take a look at the dedicated page for detail.

Citation

If you use EDS-NLP, please cite us as below.

@misc{edsnlp,
  author = {Wajsburt, Perceval and Petit-Jean, Thomas and Dura, Basile and Cohen, Ariel and Jean, Charline and Bey, Romain},
  doi    = {10.5281/zenodo.6424993},
  title  = {EDS-NLP: efficient information extraction from French clinical notes},
  url    = {https://aphp.github.io/edsnlp}
}