Skip to content

Span Attribute Classification Metrics

Several NLP tasks consist in classifying existing spans of text into multiple classes, such as the detection of negation, hypothesis or span linking. We provide a metric to evaluate the performance of such tasks.

Let's look at an example. We'll use the following two documents: a reference document (ref) and a document with predicted entities (pred).

pred

ref

Le patient n'est pas fièvreux, son père a du diabète. Pas d'évolution du cancer.

Le patient n'est pas fièvreux, son père a du diabète. Pas d'évolution du cancer.

We can quickly create matching documents in EDS-NLP using the following code snippet:

from edsnlp.data.converters import MarkupToDocConverter

conv = MarkupToDocConverter(preset="md", span_setter="entities")
# Create a document with predicted attributes and a reference document
pred = conv(
    "Le patient n'est pas [fièvreux](SYMP neg=true), "
    "son père a [du diabète](DIS neg=false carrier=PATIENT). "
    "Pas d'évolution du [cancer](DIS neg=true carrier=PATIENT)."
)
ref = conv(
    "Le patient n'est pas [fièvreux](SYMP neg=true), "
    "son père a [du diabète](DIS neg=false carrier=FATHER). "
    "Pas d'évolution du [cancer](DIS neg=false carrier=PATIENT)."
)

The eds.span_attribute metric evaluates span‐level attribute classification by comparing predicted and gold attribute values on the same set of spans. For each attribute you specify, it computes Precision, Recall, F1, number of true positives (tp), number of gold instances (support), number of predicted instances (positives), and the Average Precision (ap). A micro‐average over all attributes is also provided under micro_key.

from edsnlp.metrics.span_attribute import SpanAttributeMetric

metric = SpanAttributeMetric(
    span_getter=conv.span_setter,
    # Evaluated attributes
    attributes={
        "neg": True,  # 'neg' on every entity
        "carrier": ["DIS"],  # 'carrier' only on 'DIS' entities
    },
    # Ignore these default values when counting matches
    default_values={
        "neg": False,
    },
    micro_key="micro",
)

Let's enumerate (span -> attr = value) items in our documents. Only the items with matching span boundaries, attribute name, and value are counted as a true positives. For instance, with the predicted and reference spans of the example above:

pred

ref

fièvreux → neg = True
du diabète → neg = False
du diabète → carrier = PATIENT
cancer → neg = True
cancer → carrier = PATIENT

fièvreux → neg = True
du diabète → neg = False
du diabète → carrier = FATHER
cancer → neg = False
cancer → carrier = PATIENT

Default values

Note that there we don't count "neg=False" items, shown in grey in the table. In EDS-NLP, this is done by setting defaults_values={"neg": False} when creating the metric. This is quite common in classification tasks, where one of the values is both the most common and the "default" (hence the name of the parameter). Counting these values would likely skew the micro-average metrics towards the default value.

Precision, Recall and F1 (micro-average and per‐label) are computed as follows:

  • Precision: p = |matched items of pred| / |pred|
  • Recall: r = |matched items of ref| / |ref|
  • F1: f = 2 / (1/p + 1/f)

This yields the following metrics:

metric([ref], [pred])
# Out: {
#   'micro': {'f': 0.57, 'p': 0.5, 'r': 0.67, 'tp': 2, 'support': 3, 'positives': 4, 'ap': 0.17},
#   'neg': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2, 'ap': 0.0},
#   'carrier': {'f': 0.5, 'p': 0.5, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 2, 'ap': 0.25},
# }

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to extract spans from each Doc.

TYPE: SpanGetterArg

attributes

Map each attribute name to True (evaluate on all spans) or a sequence of labels restricting which spans to test.

TYPE: Mapping[str, Union[bool, Sequence[str]]] DEFAULT: None

default_values

Attribute values to omit from micro‐average counts (e.g., common negative or default labels).

TYPE: Dict[str, Any] DEFAULT: {}

include_falsy

If False, ignore falsy values (e.g., False, None, '') in predictions or gold when computing metrics; if True, count them.

TYPE: bool DEFAULT: False

micro_key

Key under which to store the micro‐averaged results across all attributes.

TYPE: str DEFAULT: 'micro'

filter_expr

A Python expression (using doc) to filter which examples are scored.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Dict[str, Dict[str, float]]

A dictionary mapping each attribute name (and the micro_key) to its metrics:

  • label or micro_key :

    • p : precision
    • r : recall
    • f : F1 score
    • tp : true positive count
    • support : number of gold instances
    • positives : number of predicted instances
    • ap : average precision