Span Attribute Classification Metrics
Several NLP tasks consist in classifying existing spans of text into multiple classes, such as the detection of negation, hypothesis or span linking. We provide a metric to evaluate the performance of such tasks.
Let's look at an example. We'll use the following two documents: a reference document (ref) and a document with predicted entities (pred).
pred | ref |
---|---|
Le patient n'est pas fièvreux, son père a du diabète. Pas d'évolution du cancer. | Le patient n'est pas fièvreux, son père a du diabète. Pas d'évolution du cancer. |
We can quickly create matching documents in EDS-NLP using the following code snippet:
from edsnlp.data.converters import MarkupToDocConverter
conv = MarkupToDocConverter(preset="md", span_setter="entities")
# Create a document with predicted attributes and a reference document
pred = conv(
"Le patient n'est pas [fièvreux](SYMP neg=true), "
"son père a [du diabète](DIS neg=false carrier=PATIENT). "
"Pas d'évolution du [cancer](DIS neg=true carrier=PATIENT)."
)
ref = conv(
"Le patient n'est pas [fièvreux](SYMP neg=true), "
"son père a [du diabète](DIS neg=false carrier=FATHER). "
"Pas d'évolution du [cancer](DIS neg=false carrier=PATIENT)."
)
The eds.span_attribute
metric evaluates span‐level attribute classification by comparing predicted and gold attribute values on the same set of spans. For each attribute you specify, it computes Precision, Recall, F1, number of true positives (tp), number of gold instances (support), number of predicted instances (positives), and the Average Precision (ap). A micro‐average over all attributes is also provided under micro_key
.
from edsnlp.metrics.span_attribute import SpanAttributeMetric
metric = SpanAttributeMetric(
span_getter=conv.span_setter,
# Evaluated attributes
attributes={
"neg": True, # 'neg' on every entity
"carrier": ["DIS"], # 'carrier' only on 'DIS' entities
},
# Ignore these default values when counting matches
default_values={
"neg": False,
},
micro_key="micro",
)
Let's enumerate (span -> attr = value) items in our documents. Only the items with matching span boundaries, attribute name, and value are counted as a true positives. For instance, with the predicted and reference spans of the example above:
pred | ref |
---|---|
fièvreux → neg = True | fièvreux → neg = True |
Default values
Note that there we don't count "neg=False" items, shown in grey in the table. In EDS-NLP, this is done by setting defaults_values={"neg": False}
when creating the metric. This is quite common in classification tasks, where one of the values is both the most common and the "default" (hence the name of the parameter). Counting these values would likely skew the micro-average metrics towards the default value.
Precision, Recall and F1 (micro-average and per‐label) are computed as follows:
- Precision:
p = |matched items of pred| / |pred|
- Recall:
r = |matched items of ref| / |ref|
- F1:
f = 2 / (1/p + 1/f)
This yields the following metrics:
metric([ref], [pred])
# Out: {
# 'micro': {'f': 0.57, 'p': 0.5, 'r': 0.67, 'tp': 2, 'support': 3, 'positives': 4, 'ap': 0.17},
# 'neg': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2, 'ap': 0.0},
# 'carrier': {'f': 0.5, 'p': 0.5, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 2, 'ap': 0.25},
# }
Parameters
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to extract spans from each TYPE: |
attributes | Map each attribute name to TYPE: |
default_values | Attribute values to omit from micro‐average counts (e.g., common negative or default labels). TYPE: |
include_falsy | If TYPE: |
micro_key | Key under which to store the micro‐averaged results across all attributes. TYPE: |
filter_expr | A Python expression (using TYPE: |
RETURNS | DESCRIPTION |
---|---|
Dict[str, Dict[str, float]] | A dictionary mapping each attribute name (and the
|