NER Metrics
We provide several metrics to evaluate the performance of Named Entity Recognition (NER) components. Let's look at an example and see how they differ. We'll use the following two documents: a reference document (ref) and a document with predicted entities (pred).
Shared example
| pred | ref |
|---|---|
| La patiente a une fièvre aigüe | La patiente a une fièvre aigüe. |
Let's create matching documents in EDS-NLP using the following code snippet:
from edsnlp.data.converters import MarkupToDocConverter
conv = MarkupToDocConverter(preset="md", span_setter="entities")
pred = conv("[La](PER) [patiente](PER) a une [fièvre aiguë](DIS).")
ref = conv("La [patiente](PER) a [une fièvre](DIS) aiguë.")
Summary of metrics
The table below shows the different scores depending on the metric used.
| Metric | Precision | Recall | F1 |
|---|---|---|---|
| Span-level exact | 0.33 | 0.5 | 0.40 |
| Token-level | 0.50 | 0.67 | 0.57 |
| Span-level overlap | 0.67 | 1.0 | 0.80 |
Span-level NER metric with exact match
The eds.ner_exact metric scores the extracted entities (that may be overlapping or nested) by looking in the spans returned by a given SpanGetter object and comparing predicted spans to gold spans for exact boundary and label matches.
Let's view these elements as collections of (span → label) and count how many of the predicted spans match the gold spans exactly (and vice versa):
| pred | ref |
|---|---|
| La | patiente |
Precision, Recall and F1 (micro-average and per‐label) are computed as follows:
- Precision:
p = |matched items of pred| / |pred| - Recall:
r = |matched items of ref| / |ref| - F1:
f = 2 / (1/p + 1/f)
Examples
from edsnlp.metrics.ner import NerExactMetric
metric = NerExactMetric(span_getter=conv.span_setter, micro_key="micro")
metric([ref], [pred])
# Out: {
# 'micro': {'f': 0.4, 'p': 0.33, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 3},
# 'PER': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2},
# 'DIS': {'f': 0.0, 'p': 0.0, 'r': 0.0, 'tp': 0, 'support': 1, 'positives': 1},
# }
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | The span getter to use to extract the spans from the document TYPE: |
micro_key | The key to use to store the micro-averaged results for spans of all types TYPE: |
filter_expr | The filter expression to use to filter the documents. Evaluated with TYPE: |
Span-level NER metric with approximate match
The eds.ner_overlap metric scores the extracted entities that may be overlapping or nested by looking in the spans returned by a given SpanGetter object and counting a prediction as correct if it overlaps by at least the given Dice‐coefficient threshold with a gold span of the same label.
This metric is useful for evaluating NER systems where the exact boundaries do not matter too much, but the presence of the entity at the same spot is important. For instance, you may not want to penalize a system that forgets determiners if the rest of the entity is correctly identified.
Let's view these elements as sets of (span → label) and count how many of the predicted spans match the gold spans by at least the given Dice coefficient (and vice versa):
| pred | ref |
|---|---|
| La | patiente |
Precision, Recall and F1 (micro-average and per‐label) are computed as follows:
- Precision:
p = |matched items of pred| / |pred| - Recall:
r = |matched items of ref| / |ref| - F1:
f = 2 / (1/p + 1/f)
Overlap threshold
The threshold is the minimum Dice coefficient to consider two spans as overlapping. Setting it to 1.0 will yield the same results as the eds.ner_exact metric, while setting it to a near-zero value (e.g., like 1e-14) will match any two spans that share at least one token.
Examples
from edsnlp.metrics.ner import NerOverlapMetric
metric = NerOverlapMetric(
span_getter=conv.span_setter, micro_key="micro", threshold=0.5
)
metric([ref], [pred])
# Out: {
# 'micro': {'f': 0.8, 'p': 0.67, 'r': 1.0, 'tp': 2, 'support': 2, 'positives': 3},
# 'PER': {'f': 0.67, 'p': 0.5, 'r': 1.0, 'tp': 1, 'support': 1, 'positives': 2},
# 'DIS': {'f': 1.0, 'p': 1.0, 'r': 1.0, 'tp': 1, 'support': 1, 'positives': 1}
# }
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | The span getter to use to extract the spans from the document TYPE: |
micro_key | The key to use to store the micro-averaged results for spans of all types TYPE: |
filter_expr | The filter expression to use to filter the documents TYPE: |
threshold | The threshold on the Dice coefficient to consider two spans as overlapping TYPE: |
Token-level NER metric
The eds.ner_token metric scores the extracted entities that may be overlapping or nested by looking in doc.ents, and doc.spans, and comparing the predicted and gold entities at the token level.
Assuming we use the eds (or fr or en) tokenizer, in the above example, there are 3 annotated tokens in the reference, and 4 annotated tokens in the prediction. Let's view these elements as sets of (token, label) and count how many of the predicted tokens match the gold tokens exactly (and vice versa):
| pred | ref |
|---|---|
| La | patiente |
Precision, Recall and F1 (micro-average and per‐label) are computed as follows:
- Precision:
p = |matched items of pred| / |pred| - Recall:
r = |matched items of ref| / |ref| - F1:
f = 2 / (1/p + 1/f)
Examples
from edsnlp.metrics.ner import NerTokenMetric
metric = NerTokenMetric(span_getter=conv.span_setter, micro_key="micro")
metric([ref], [pred])
# Out: {
# 'micro': {'f': 0.57, 'p': 0.5, 'r': 0.67, 'tp': 2, 'support': 3, 'positives': 4},
# 'PER': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2},
# 'DIS': {'f': 0.5, 'p': 0.5, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 2}
# }
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | The span getter to use to extract the spans from the document TYPE: |
micro_key | The key to use to store the micro-averaged results for spans of all types TYPE: |
filter_expr | The filter expression to use to filter the documents. Will be evaluated with TYPE: |