Skip to content

Detecting end-of-lines

A common problem in medical corpus is that the character \n does not necessarily correspond to a real new line as in other domains.

For example, it is common to find texts like:

Il doit prendre
le medicament indiqué 3 fois par jour. Revoir médecin
dans 1 mois.

Inserted new line characters

This issue is especially impactful for clinical notes that have been extracted from PDF documents. In that case, the new line character could be deliberately inserted by the doctor, or more likely added to respect the layout during the edition of the PDF.

The aim of this tutorial is to train a unsupervised model to detect this false endlines and to use it for inference. The implemented model is based on the work of Zweigenbaum et al1.

Training the model

Let's train the model using an example corpus of three documents:

import spacy
from edsnlp.pipelines.core.endlines import EndLinesModel

nlp = spacy.blank("fr")

text1 = """Le patient est arrivé hier soir.
Il est accompagné par son fils

ANTECEDENTS
Il a fait une TS en 2010;
Fumeur, il est arrêté il a 5 mois
Chirurgie de coeur en 2011
CONCLUSION
Il doit prendre
le medicament indiqué 3 fois par jour. Revoir médecin
dans 1 mois.
DIAGNOSTIC :

Antecedents Familiaux:
- 1. Père avec diabète
"""

text2 = """J'aime le \nfromage...\n"""
text3 = (
    "/n"
    "Intervention(s) - acte(s) réalisé(s) :/n"
    "Parathyroïdectomie élective le [DATE]"
)

texts = [
    text1,
    text2,
    text3,
]

corpus = nlp.pipe(texts)

# Fit the model
endlines = EndLinesModel(nlp=nlp)  # (1)
df = endlines.fit_and_predict(corpus)  # (2)

# Save model
PATH = "/path_to_model"
endlines.save(PATH)
  1. Initialize the EndLinesModel object and then fit (and predict) in the training corpus.
  2. The corpus should be an iterable of spacy documents.

Use a trained model for inference

import spacy

nlp = spacy.blank("fr")

PATH = "/path_to_model"
nlp.add_pipe("eds.endlines", config=dict(model_path=PATH))  # (1)

docs = list(nlp.pipe([text1, text2, text3]))

doc = docs[1]
doc
# Out: J'aime le
# Out: fromage...

doc.spans["new_lines"][0].label_  # (2)
# Out: 'space'
  1. you should specify the path to the trained model here.
  2. All new lines are stored in the doc.spans["new_lines"] key. Whether or not a specific new line character is detected as a false endline is stored in the label attribute.

Declared extensions

The eds.endlines pipeline declares one spaCy extensions, on both Span and Token objects. The end_line attribute is a boolean, set to True if the pipeline predicts that the new line is an end line character. Otherwise, it is set to False if the new line is classified as a space.

The pipeline also sets the excluded custom attribute on newlines that are classified as spaces. It lets downstream matchers skip excluded tokens (see normalisation) for more detail.


  1. Pierre Zweigenbaum, Cyril Grouin, and Thomas Lavergne. Une catégorisation de fins de lignes non-supervisée (end-of-line classification with no supervision). In Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters), 364–371. Paris, France, 7 2016. AFCP - ATALA. URL: https://aclanthology.org/2016.jeptalnrecital-poster.7

Back to top