Skip to content

Hypothesis

The eds.hypothesis pipeline uses a simple rule-based algorithm to detect spans that are speculations rather than certain statements.

Usage

The following snippet matches a simple terminology, and checks whether the extracted entities are part of a speculation. It is complete and can be run as is.

import spacy

nlp = spacy.blank("fr")
nlp.add_pipe("eds.sentences")
# Dummy matcher
nlp.add_pipe(
    "eds.matcher",
    config=dict(terms=dict(douleur="douleur", fracture="fracture")),
)
nlp.add_pipe("eds.hypothesis")

text = (
    "Le patient est admis le 23 août 2021 pour une douleur au bras. "
    "Possible fracture du radius."
)

doc = nlp(text)

doc.ents
# Out: (douleur, fracture)

doc.ents[0]._.hypothesis
# Out: False

doc.ents[1]._.hypothesis
# Out: True

Configuration

The pipeline can be configured using the following parameters :

Parameter Explanation Default
attr spaCy attribute to match on (eg NORM, TEXT, LOWER) "NORM"
pseudo Pseudo-hypothesis patterns None (use pre-defined patterns)
preceding Preceding hypothesis patterns None (use pre-defined patterns)
following Following hypothesis patterns None (use pre-defined patterns)
termination Termination patterns (for syntagma/proposition extraction) None (use pre-defined patterns)
verbs_hyp Patterns for verbs that imply a hypothesis None (use pre-defined patterns)
verbs_eds Common verb patterns, checked for conditional mode None (use pre-defined patterns)
on_ents_only Whether to qualify pre-extracted entities only True
within_ents Whether to look for hypothesis within entities False
explain Whether to keep track of the cues for each entity False

Declared extensions

The eds.hypothesis pipeline declares two spaCy extensions, on both Span and Token objects :

  1. The hypothesis attribute is a boolean, set to True if the pipeline predicts that the span/token is a speculation.
  2. The hypothesis_ property is a human-readable string, computed from the hypothesis attribute. It implements a simple getter function that outputs HYP or CERT, depending on the value of hypothesis.

Performance

The pipeline's performance is measured on three datasets :

  • The ESSAI1 and CAS2 datasets were developed at the CNRS. The two are concatenated.
  • The NegParHyp corpus was specifically developed at EDS to test the pipeline on actual clinical notes, using pseudonymised notes from the EDS.
Dataset Hypothesis F1
CAS/ESSAI 49%
NegParHyp 52%

NegParHyp corpus

The NegParHyp corpus was built by matching a subset of the MeSH terminology with around 300 documents from AP-HP's clinical data warehouse. Matched entities were then labelled for negation, speculation and family context.

Authors and citation

The eds.hypothesis pipeline was developed by AP-HP's Data Science team.


  1. Clément Dalloux, Vincent Claveau, and Natalia Grabar. Détection de la négation : corpus français et apprentissage supervisé. In SIIM 2017 - Symposium sur l'Ingénierie de l'Information Médicale, 1–8. Toulouse, France, November 2017. URL: https://hal.archives-ouvertes.fr/hal-01659637

  2. Natalia Grabar, Vincent Claveau, and Clément Dalloux. CAS: French Corpus with Clinical Cases. In LOUHI 2018 - The Ninth International Workshop on Health Text Mining and Information Analysis, Ninth International Workshop on Health Text Mining and Information Analysis (LOUHI) Proceedings of the Workshop, 1–7. Bruxelles, France, October 2018. URL: https://hal.archives-ouvertes.fr/hal-01937096

Back to top