Trainable NER[source]

The eds.ner_crf component is a general purpose trainable named entity recognizer. It can extract:

flat entities
overlapping entities of different labels

However, at the moment, the model cannot currently extract entities that are nested inside larger entities of the same label.

It is based on a CRF (Conditional Random Field) layer and should therefore work well on dataset composed of entities will ill-defined boundaries. We offer a compromise between speed and performance by allowing the user to specify a window size for the CRF layer. The smaller the window, the faster the model will be, but at the cost of degraded performance.

The pipeline assigns both doc.ents (in which overlapping entities are filtered out) and doc.spans. These destinations can be inferred from the target_span_getter parameter, combined with the post_init step.

Architecture

The model performs token classification using the BIOUL (Begin, Inside, Outside, Unary, Last) tagging scheme. To extract overlapping entities, each label has its own tag sequence, so the model predicts n_labels sequences of O, I, B, L, U tags. The architecture is displayed in the figure below.

To enforce the tagging scheme, (ex: I cannot follow O but only B, ...), we use a stack of CRF (Conditional Random Fields) layers, one per label during both training and prediction.

Examples

Let us define a pipeline composed of a transformer, and a NER component.

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.ner_crf(
        embedding=eds.transformer(
            model="prajjwal1/bert-tiny",
            window=128,
            stride=96,
        ),
        mode="joint",
        target_span_getter="ner-gold",
        span_setter="ents",
        window=10,
    ),
    name="ner"
)

To train the model, refer to the Training tutorial.

Extensions

Experimental Confidence Score

The NER confidence score feature is experimental and the API and underlying algorithm may change.

The eds.ner_crf pipeline declares one extension on the Span object:

span._.ner_confidence_score: The confidence score of the Named Entity Recognition (NER) model for the given span.

The ner_confidence_score is computed based on the Average Entity Confidence Score using the following formula:

$$ \text{Average Entity Confidence Score} = \frac{1}{n} \sum_{i \in \text{tokens}} (1 - p(O)_i) $$

Where:

$n$ is the number of tokens.
$\text{tokens}$ refers to the tokens within the span.
$p(O)_i$ represents the probability of token $i$ belonging to class 'O' (Outside entity).

Confidence score is not computed by default

By default, the confidence score is not computed, as it adds around 5% to inference time. You can enable its computation with:

nlp.pipes.ner.compute_confidence_score = True

Parameters

PARAMETER	DESCRIPTION
`nlp`	The pipeline object TYPE: `PipelineProtocol` DEFAULT: `None`
`name`	Name of the component TYPE: `str` DEFAULT: `'ner_crf'`
`embedding`	The word embedding component TYPE: `WordEmbeddingComponent`
`target_span_getter`	Method to call to get the gold spans from a document, for scoring or training. By default, takes all entities in `doc.ents`, but we recommend you specify a given span group name instead. TYPE: `SpanGetterArg` DEFAULT: `{'ents': True}`
`labels`	The labels to predict. The labels can also be inferred from the data during `nlp.post_init(...)` TYPE: `List[str]` DEFAULT: `None`
`span_setter`	The span setter to use to set the predicted spans on the Doc object. If None, the component will infer the span setter from the target_span_getter config. TYPE: `Optional[SpanSetterArg]` DEFAULT: `None`
`infer_span_setter`	Whether to complete the span setter from the target_span_getter config. False by default, unless the span_setter is None. TYPE: `Optional[bool]` DEFAULT: `None`
`context_getter`	What context to use when computing the span embeddings (defaults to the whole document). For example `{"section": "conclusion"}` to only extract the entities from the conclusion. TYPE: `Optional[SpanGetterArg]` DEFAULT: `None`
`mode`	The CRF mode to use : independent, joint or marginal TYPE: `Literal['independent', 'joint', 'marginal']`
`window`	The window size to use for the CRF. If 0, will use the whole document, at the cost of a longer computation time. If 1, this is equivalent to assuming that the tags are independent and will the component be faster, but with degraded performance. Empirically, we found that a window size of 10 or 20 works well. TYPE: `int` DEFAULT: `40`
`stride`	The stride to use for the CRF windows. Defaults to `window // 2`. TYPE: `Optional[int]` DEFAULT: `None`

Authors and citation

The eds.ner_crf pipeline was developed by AP-HP's Data Science team.

The deep learning model was adapted from Wajsbürt, 2021.

Wajsbürt P., 2021. Extraction and normalization of simple and structured entities in medical documents. https://hal.archives-ouvertes.fr/tel-03624928