Skip to content

Nested Named Entity Recognition

The default spaCy Named Entity Recognizer (NER) pipeline only allows flat entity recognition, meaning that overlapping and nested entities cannot be extracted.

The other spaCy component SpanCategorizer only supports assigning to a specific span group and both components are not well suited for extracting entities with ill-defined boundaries (this can occur if your training data contains difficult and long entities).

We propose the new eds.ner component to extract almost any named entity:

  • flat entities like spaCy's EntityRecognizer overlapping entities
  • overlapping entities of different labels (much like spaCy's SpanCategorizer)
  • entities will ill-defined boundaries

However, the model cannot currently extract entities that are nested inside larger entities of the same label.

The pipeline assigns both doc.ents (in which overlapping entities are filtered out) and doc.spans.

Architecture

The model performs token classification using the BIOUL (Begin, Inside, Outside, Unary, Last) tagging scheme. To extract overlapping entities, each label has its own tag sequence, so the model predicts $n_{labels}$ sequences of O, I, B, L, U tags. The architecture is displayed in the figure below.

To enforce the tagging scheme, (ex: I cannot follow O but only B, ...), we use a stack of CRF (Conditional Random Fields) layers, one per label during both training and prediction.

Nested NER architecture

Nested NER architecture

Usage

Let us define the pipeline and train it:

from pathlib import Path

import spacy

from edsnlp.connectors.brat import BratConnector
from edsnlp.utils.training import train, make_spacy_corpus_config

tmp_path = Path("/tmp/test-nested-ner")

nlp = spacy.blank("eds")
# ↓ below is the nested ner pipeline ↓
# you can configure it using the `add_pipe(..., config=...)` parameter
nlp.add_pipe("nested_ner")

# Train the model, with additional training configuration
nlp = train(
    nlp,
    output_path=tmp_path / "model",
    config=dict(
        **make_spacy_corpus_config(
            train_data="/path/to/the/training/set/brat/files",
            dev_data="/path/to/the/dev/set/brat/files",
            nlp=nlp,
            data_format="brat",
        ),
        training=dict(
            max_steps=4000,
        ),
    ),
)

# Finally, we can run the pipeline on a new document
doc = nlp("Arret du folfox si inefficace")
doc.spans["drug"]
# Out: [folfox]

doc.spans["criteria"]
# Out: [si folfox inefficace]

# And export new predictions as Brat annotations
predicted_docs = BratConnector("/path/to/the/new/files", run_pipe=True).brat2docs(nlp)
BratConnector("/path/to/predictions").docs2brat(predicted_docs)

Configuration

The pipeline component can be configured using the following parameters :

Parameter Explanation Default
ent_labels Labels to search in and assign to doc.ents. Expects a list. None (inits to all labels in doc.ents)
spans_labels Labels to search in and assign to doc.spans. Expects a dict of lists. None (inits to all span groups and their labels in doc.spans)

The default model eds.nested_ner_model.v1 can be configured using the following parameters :

Parameter Explanation Default
loss_mode How the CRF loss is computed joint
joint Loss accounts for CRF transitions
independent Loss does not account for CRF transitions (softmax loss)
marginal Tag scores are smoothly updated with CRF transitions, and softmax loss is applied

Authors and citation

The eds.nested_ner pipeline was developed by AP-HP's Data Science team.

The deep learning model was adapted from Wajsbürt1


  1. Perceval Wajsbürt. Extraction and normalization of simple and structured entities in medical documents. Theses, Sorbonne Université, December 2021. URL: https://hal.archives-ouvertes.fr/tel-03624928