Span Attribute Classification Metrics

Several NLP tasks consist in classifying existing spans of text into multiple classes, such as the detection of negation, hypothesis or span linking. We provide a metric to evaluate the performance of such tasks.

Let's look at an example. We'll use the following two documents: a reference document (ref) and a document with predicted entities (pred).

pred	ref
Le patient n'est pas fièvreux, son père a du diabète. Pas d'évolution du cancer.	Le patient n'est pas fièvreux, son père a du diabète. Pas d'évolution du cancer.

We can quickly create matching documents in EDS-NLP using the following code snippet:

from edsnlp.data.converters import MarkupToDocConverter

conv = MarkupToDocConverter(preset="md", span_setter="entities")
# Create a document with predicted attributes and a reference document
pred = conv(
    "Le patient n'est pas [fièvreux](SYMP neg=true), "
    "son père a [du diabète](DIS neg=false carrier=PATIENT). "
    "Pas d'évolution du [cancer](DIS neg=true carrier=PATIENT)."
)
ref = conv(
    "Le patient n'est pas [fièvreux](SYMP neg=true), "
    "son père a [du diabète](DIS neg=false carrier=FATHER). "
    "Pas d'évolution du [cancer](DIS neg=false carrier=PATIENT)."
)

The eds.span_attribute metric evaluates span‐level attribute classification by comparing predicted and gold attribute values on the same set of spans. For each attribute you specify, it computes Precision, Recall, F1, number of true positives (tp), number of gold instances (support), number of predicted instances (positives), and the Average Precision (ap). A micro‐average over all attributes is also provided under micro_key.

from edsnlp.metrics.span_attribute import SpanAttributeMetric

metric = SpanAttributeMetric(
    span_getter=conv.span_setter,
    # Evaluated attributes
    attributes={
        "neg": True,  # 'neg' on every entity
        "carrier": ["DIS"],  # 'carrier' only on 'DIS' entities
    },
    # Ignore these default values when counting matches
    default_values={
        "neg": False,
    },
    micro_key="micro",
)

Let's enumerate (span -> attr = value) items in our documents. Only the items with matching span boundaries, attribute name, and value are counted as a true positives. For instance, with the predicted and reference spans of the example above:

pred	ref
fièvreux → neg = True du diabète → neg = False du diabète → carrier = PATIENT cancer → neg = True cancer → carrier = PATIENT	fièvreux → neg = True du diabète → neg = False du diabète → carrier = FATHER cancer → neg = False cancer → carrier = PATIENT

Default values

Note that there we don't count "neg=False" items, shown in grey in the table. In EDS-NLP, this is done by setting defaults_values={"neg": False} when creating the metric. This is quite common in classification tasks, where one of the values is both the most common and the "default" (hence the name of the parameter). Counting these values would likely skew the micro-average metrics towards the default value.

Precision, Recall and F1 (micro-average and per‐label) are computed as follows:

Precision: p = |matched items of pred| / |pred|
Recall: r = |matched items of ref| / |ref|
F1: f = 2 / (1/p + 1/f)

This yields the following metrics:

metric([ref], [pred])
# Out: {
#   'micro': {'f': 0.57, 'p': 0.5, 'r': 0.67, 'tp': 2, 'support': 3, 'positives': 4, 'ap': 0.17},
#   'neg': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2, 'ap': 0.0},
#   'carrier': {'f': 0.5, 'p': 0.5, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 2, 'ap': 0.25},
# }

Parameters

PARAMETER	DESCRIPTION
`span_getter`	The span getter to extract spans from each `Doc`. TYPE: `SpanGetterArg`
`attributes`	Map each attribute name to `True` (evaluate on all spans) or a sequence of labels restricting which spans to test. TYPE: `Mapping[str, Union[bool, Sequence[str]]]` DEFAULT: `None`
`default_values`	Attribute values to omit from micro‐average counts (e.g., common negative or default labels). TYPE: `Dict[str, Any]` DEFAULT: `{}`
`include_falsy`	If `False`, ignore falsy values (e.g., `False`, `None`, `''`) in predictions or gold when computing metrics; if `True`, count them. TYPE: `bool` DEFAULT: `False`
`micro_key`	Key under which to store the micro‐averaged results across all attributes. TYPE: `str` DEFAULT: `'micro'`
`filter_expr`	A Python expression (using `doc`) to filter which examples are scored. TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dict[str, Dict[str, float]]`	A dictionary mapping each attribute name (and the `micro_key`) to its metrics: `label` or micro_key : `p` : precision `r` : recall `f` : F1 score `tp` : true positive count `support` : number of gold instances `positives` : number of predicted instances `ap` : average precision