Skip to content

Sentences[source]

The eds.sentences matcher provides an alternative to spaCy's default sentencizer, aiming to overcome some of its limitations.

Indeed, the sentencizer merely looks at period characters to detect the end of a sentence, a strategy that often fails in a clinical note settings. Our eds.sentences component also classifies end-of-lines as sentence boundaries if the subsequent token begins with an uppercase character, leading to slightly better performances. It can additionally leverage expanded capitalization patterns and bullet-like list starters, which are frequent in structured medical documents.

Moreover, the eds.sentences component use the output of the eds.normalizer and eds.endlines output by default when these components are added to the pipeline.

Examples

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())  # same as nlp.add_pipe("eds.sentences")

text = """Le patient est admis le 23 août 2021 pour une douleur à l'estomac
Il lui était arrivé la même chose il y a deux ans."
"""

doc = nlp(text)

for sentence in doc.sents:
    print("<s>", sentence, "</s>")
# Out: <s> Le patient est admis le 23 août 2021 pour une douleur à l'estomac
# Out:  <\s>
# Out: <s> Il lui était arrivé la même chose il y a deux ans. <\s>
import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe("sentencizer")

text = """Le patient est admis le 23 août 2021 pour une douleur à l'estomac"
Il lui était arrivé la même chose il y a deux ans.
"""

doc = nlp(text)

for sentence in doc.sents:
    print("<s>", sentence, "</s>")
# Out: <s> Le patient est admis le 23 août 2021 pour une douleur à l'estomac
# Out: Il lui était arrivé la même chose il y a deux ans. <\s>

Notice how EDS-NLP's implementation is more robust to ill-defined sentence endings.

Parameters

PARAMETER DESCRIPTION
nlp

The EDS-NLP pipeline

TYPE: Optional[PipelineProtocol] DEFAULT: None

name

The name of the component

TYPE: Optional[str] DEFAULT: None

punct_chars

Punctuation characters.

use_endlines

Whether to use endlines prediction.

ignore_excluded

Whether to ignore excluded tokens.

check_capitalized

Whether to check for capitalized words after newlines or full stops.

capitalized_mode

Selects the preset of capitalized shapes used when check_capitalized=True and no explicit capitalized_shapes are provided.

TYPE: (Optional[str], {legacy, expanded}) DEFAULT: "expanded"

capitalized_shapes

Capitalized shapes.

min_newline_count

The minimum number of newlines to consider a newline-triggered sentence.

hard_newline_count

The minimum number of consecutive newlines to force a sentence boundary, independently of capitalization. Use None to disable this rule.

use_bullet_start

Whether to check for bullet starters after newlines or full stops.

bullet_starters

Bullet starters characters.

Authors and citation

The eds.sentences component was developed by AP-HP's Data Science team.