Skip to content

Hypothesis

The eds.hypothesis pipeline uses a simple rule-based algorithm to detect spans that are speculations rather than certain statements.

The component looks for five kinds of expressions in the text :

  • preceding hypothesis, ie cues that precede a hypothetical expression
  • following hypothesis, ie cues that follow a hypothetical expression
  • pseudo hypothesis : contain a hypothesis cue, but are not hypothesis (eg "pas de doute"/"no doubt")
  • hypothetical verbs : verbs indicating hypothesis (eg "douter")
  • classic verbs conjugated to the conditional, thus indicating hypothesis

Examples

The following snippet matches a simple terminology, and checks whether the extracted entities are part of a speculation. It is complete and can be run as is.

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.sentences")
# Dummy matcher
nlp.add_pipe(
    "eds.matcher",
    config=dict(terms=dict(douleur="douleur", fracture="fracture")),
)
nlp.add_pipe("eds.hypothesis")

text = (
    "Le patient est admis le 23 août 2021 pour une douleur au bras. "
    "Possible fracture du radius."
)

doc = nlp(text)

doc.ents
# Out: (douleur, fracture)

doc.ents[0]._.hypothesis
# Out: False

doc.ents[1]._.hypothesis
# Out: True

Extensions

The eds.hypothesis component declares two extensions, on both Span and Token objects :

  1. The hypothesis attribute is a boolean, set to True if the component predicts that the span/token is a speculation.
  2. The hypothesis_ property is a human-readable string, computed from the hypothesis attribute. It implements a simple getter function that outputs HYP or CERT, depending on the value of hypothesis.

Performance

The component's performance is measured on three datasets :

  • The ESSAI (Dalloux et al., 2017) and CAS (Grabar et al., 2018) datasets were developed at the CNRS. The two are concatenated.
  • The NegParHyp corpus was specifically developed at APHP's CDW to test the component on actual clinical notes, using pseudonymised notes from the APHP's CDW.
Dataset Hypothesis F1
CAS/ESSAI 49%
NegParHyp 52%

NegParHyp corpus

The NegParHyp corpus was built by matching a subset of the MeSH terminology with around 300 documents from AP-HP's clinical data warehouse. Matched entities were then labelled for negation, speculation and family context.

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object.

TYPE: PipelineProtocol

name

The component name.

TYPE: Optional[str] DEFAULT: 'eds.hypothesis'

attr

spaCy's attribute to use

TYPE: str DEFAULT: NORM

pseudo

List of pseudo hypothesis cues.

TYPE: Optional[List[str]] DEFAULT: None

preceding

List of preceding hypothesis cues

TYPE: Optional[List[str]] DEFAULT: None

following

List of following hypothesis cues.

TYPE: Optional[List[str]] DEFAULT: None

verbs_hyp

List of hypothetical verbs.

TYPE: Optional[List[str]] DEFAULT: None

verbs_eds

List of mainstream verbs.

TYPE: Optional[List[str]] DEFAULT: None

termination

List of termination terms.

TYPE: Optional[List[str]] DEFAULT: None

attr

spaCy's attribute to use: a string with the value "TEXT" or "NORM", or a dict with the key 'term_attr'

TYPE: str DEFAULT: NORM

span_getter

Which entities should be classified. By default, doc.ents

TYPE: SpanGetterArg DEFAULT: None

on_ents_only

Deprecated, use span_getter instead.

Whether to look for matches around detected entities only. Useful for faster inference in downstream tasks.

  • If True, will look in all ents located in doc.ents only
  • If an iterable of string is passed, will additionally look in doc.spans[key] for each key in the iterable

TYPE: Union[bool, str, List[str], Set[str]] DEFAULT: None

within_ents

Whether to consider cues within entities.

TYPE: bool DEFAULT: False

explain

Whether to keep track of cues for each entity.

TYPE: bool DEFAULT: False

Authors and citation

The eds.hypothesis pipeline was developed by AP-HP's Data Science team.


  1. Dalloux C., Claveau V. and Grabar N., 2017. Détection de la négation : corpus français et apprentissage supervisé.

  2. Grabar N., Claveau V. and Dalloux C., 2018. CAS: French Corpus with Clinical Cases.