NER Metrics

We provide several metrics to evaluate the performance of Named Entity Recognition (NER) components. Let's look at an example and see how they differ. We'll use the following two documents: a reference document (ref) and a document with predicted entities (pred).

Shared example

pred	ref
La patiente a une fièvre aigüe	La patiente a une fièvre aigüe.

Let's create matching documents in EDS-NLP using the following code snippet:

from edsnlp.data.converters import MarkupToDocConverter

conv = MarkupToDocConverter(preset="md", span_setter="entities")

pred = conv("[La](PER) [patiente](PER) a une [fièvre aiguë](DIS).")
ref = conv("La [patiente](PER) a [une fièvre](DIS) aiguë.")

Summary of metrics

The table below shows the different scores depending on the metric used.

Metric	Precision	Recall	F1
Span-level exact	0.33	0.5	0.40
Token-level	0.50	0.67	0.57
Span-level overlap	0.67	1.0	0.80

Span-level NER metric with exact match

The eds.ner_exact metric scores the extracted entities (that may be overlapping or nested) by looking in the spans returned by a given SpanGetter object and comparing predicted spans to gold spans for exact boundary and label matches.

Let's view these elements as collections of (span → label) and count how many of the predicted spans match the gold spans exactly (and vice versa):

pred	ref
La patiente fièvre aiguë	patiente une fièvre

Precision, Recall and F1 (micro-average and per‐label) are computed as follows:

Precision: p = |matched items of pred| / |pred|
Recall: r = |matched items of ref| / |ref|
F1: f = 2 / (1/p + 1/f)

Examples

from edsnlp.metrics.ner import NerExactMetric

metric = NerExactMetric(span_getter=conv.span_setter, micro_key="micro")
metric([ref], [pred])
# Out: {
#   'micro': {'f': 0.4, 'p': 0.33, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 3},
#   'PER': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2},
#   'DIS': {'f': 0.0, 'p': 0.0, 'r': 0.0, 'tp': 0, 'support': 1, 'positives': 1},
# }

Parameters

PARAMETER DESCRIPTION

span_getter

The span getter to use to extract the spans from the document

TYPE: SpanGetterArg

micro_key

The key to use to store the micro-averaged results for spans of all types

TYPE: str DEFAULT: 'micro'

filter_expr

The filter expression to use to filter the documents. Evaluated with doc as the variable.

TYPE: Optional[str] DEFAULT: None

Span-level NER metric with approximate match

The eds.ner_overlap metric scores the extracted entities that may be overlapping or nested by looking in the spans returned by a given SpanGetter object and counting a prediction as correct if it overlaps by at least the given Dice‐coefficient threshold with a gold span of the same label.

This metric is useful for evaluating NER systems where the exact boundaries do not matter too much, but the presence of the entity at the same spot is important. For instance, you may not want to penalize a system that forgets determiners if the rest of the entity is correctly identified.

Let's view these elements as sets of (span → label) and count how many of the predicted spans match the gold spans by at least the given Dice coefficient (and vice versa):

pred	ref
La patiente fièvre aiguë	patiente une fièvre

Precision, Recall and F1 (micro-average and per‐label) are computed as follows:

Precision: p = |matched items of pred| / |pred|
Recall: r = |matched items of ref| / |ref|
F1: f = 2 / (1/p + 1/f)

Overlap threshold

The threshold is the minimum Dice coefficient to consider two spans as overlapping. Setting it to 1.0 will yield the same results as the eds.ner_exact metric, while setting it to a near-zero value (e.g., like 1e-14) will match any two spans that share at least one token.

Examples

from edsnlp.metrics.ner import NerOverlapMetric

metric = NerOverlapMetric(
    span_getter=conv.span_setter, micro_key="micro", threshold=0.5
)
metric([ref], [pred])
# Out: {
#   'micro': {'f': 0.8, 'p': 0.67, 'r': 1.0, 'tp': 2, 'support': 2, 'positives': 3},
#   'PER': {'f': 0.67, 'p': 0.5, 'r': 1.0, 'tp': 1, 'support': 1, 'positives': 2},
#   'DIS': {'f': 1.0, 'p': 1.0, 'r': 1.0, 'tp': 1, 'support': 1, 'positives': 1}
# }

Parameters

PARAMETER	DESCRIPTION
`span_getter`	The span getter to use to extract the spans from the document TYPE: `SpanGetterArg`
`micro_key`	The key to use to store the micro-averaged results for spans of all types TYPE: `str` DEFAULT: `'micro'`
`filter_expr`	The filter expression to use to filter the documents TYPE: `Optional[str]` DEFAULT: `None`
`threshold`	The threshold on the Dice coefficient to consider two spans as overlapping TYPE: `float` DEFAULT: `0.5`

Token-level NER metric

The eds.ner_token metric scores the extracted entities that may be overlapping or nested by looking in doc.ents, and doc.spans, and comparing the predicted and gold entities at the token level.

Assuming we use the eds (or fr or en) tokenizer, in the above example, there are 3 annotated tokens in the reference, and 4 annotated tokens in the prediction. Let's view these elements as sets of (token, label) and count how many of the predicted tokens match the gold tokens exactly (and vice versa):

pred	ref
La patiente fièvre aiguë	patiente une fièvre

Precision, Recall and F1 (micro-average and per‐label) are computed as follows:

Precision: p = |matched items of pred| / |pred|
Recall: r = |matched items of ref| / |ref|
F1: f = 2 / (1/p + 1/f)

Examples

from edsnlp.metrics.ner import NerTokenMetric

metric = NerTokenMetric(span_getter=conv.span_setter, micro_key="micro")
metric([ref], [pred])
# Out: {
#   'micro': {'f': 0.57, 'p': 0.5, 'r': 0.67, 'tp': 2, 'support': 3, 'positives': 4},
#   'PER': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2},
#   'DIS': {'f': 0.5, 'p': 0.5, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 2}
# }

Parameters

PARAMETER DESCRIPTION

span_getter

The span getter to use to extract the spans from the document

TYPE: SpanGetterArg

micro_key

The key to use to store the micro-averaged results for spans of all types

TYPE: str DEFAULT: 'micro'

filter_expr

The filter expression to use to filter the documents. Will be evaluated with doc as the variable name, so you can use doc.ents, doc.spans, etc.

TYPE: Optional[str] DEFAULT: None