Skip to content

Endlines[source]

The eds.endlines component classifies newline characters as actual end of lines or mere spaces. In the latter case, the token is removed from the normalised document.

Behind the scenes, it uses a endlinesmodel instance, which is an unsupervised algorithm based on the work of Zweigenbaum et al., 2016.

Installation

To use this component, you need to install the scikit-learn library.

Training

import edsnlp
from edsnlp.pipes.core.endlines.model import EndLinesModel

nlp = edsnlp.blank("eds")

texts = [
    """
Le patient est arrivé hier soir.
Il est accompagné par son fils

ANTECEDENTS
Il a fait une TS en 2010
Fumeur, il est arreté il a 5 mois
Chirurgie de coeur en 2011
CONCLUSION
Il doit prendre
le medicament indiqué 3 fois par jour. Revoir médecin
dans 1 mois.
DIAGNOSTIC :
Il aime le fromage...

Antecedents Familiaux:
- 1. Père avec diabete
""",
    """
J'aime le
fromage...
""",
]

docs = list(nlp.pipe(texts))

# Train and predict an EndLinesModel
endlines = EndLinesModel(nlp=nlp)

df = endlines.fit_and_predict(docs)
df.head()

PATH = "/tmp/path_to_save"
endlines.save(PATH)

Examples

import edsnlp, edsnlp.pipes as eds
from spacy.tokens import Span
from spacy import displacy

nlp = edsnlp.blank("eds")

PATH = "/tmp/path_to_save"
nlp.add_pipe(eds.endlines(model_path=PATH))

docs = list(nlp.pipe(texts))

doc_exemple = docs[1]

doc_exemple.ents = tuple(
    Span(doc_exemple, token.i, token.i + 1, "excluded")
    for token in doc_exemple
    if token.tag_ == "EXCLUDED"
)

displacy.render(doc_exemple, style="ent", options={"colors": {"space": "red"}})

Extensions

The eds.endlines pipe declares one extension, on both Span and Token objects. The end_line attribute is a boolean, set to True if the pipe predicts that the new line is an end line character. Otherwise, it is set to False if the new line is classified as a space.

The pipe also sets the excluded custom attribute on newlines that are classified as spaces. It lets downstream matchers skip excluded tokens (see normalisation) for more detail.

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object.

TYPE: PipelineProtocol

name

The name of the component.

model_path

Path to trained model. If None, it will use a default model

TYPE: Optional[Union[str, EndLinesModel]] DEFAULT: None

Authors and citation

The eds.endlines pipe was developed by AP-HP's Data Science team based on the work of Zweigenbaum et al., 2016.


  1. Zweigenbaum P., Grouin C. and Lavergne T., 2016. Une catégorisation de fins de lignes non-supervisée (End-of-line classification with no supervision). https://aclanthology.org/2016.jeptalnrecital-poster.7