Skip to content

Span qualification

We propose the new span_qualifier component to qualify (i.e. assign attributes on) any span with machine learning. In this context, the span qualification task consists in assigning values (boolean, strings or any complex object) to attributes/extensions of spans such as:

  • span.label_,
  • span._.negation,
  • span._.date.mode
  • etc.

Architecture

The underlying eds.span_multilabel_classifier.v1 model performs span classification by:

  1. Pooling the words embedding (mean, max or sum) into a single embedding per span
  2. Computing logits for each possible binding (i.e. qualifier-value assignment)
  3. Splitting these bindings into independent groups such as

    • event_type=start and event_type=stop
    • negated=False and negated=True
  4. Learning or predicting a combination amongst legal combination of these bindings. For instance in the second group, we can't have both negated=True and negated=False so the combinations are [(1, 0), (0, 1)]

  5. Assigning bindings on spans depending on the predicted results

Under the hood

Initialization

During the initialization of the pipeline, the span_qualifier component will gather all spans that match on_ents and on_span_groups patterns (or candidate_getter function). It will then list all possible values for each qualifier of the qualifiers list and store every possible (qualifier, value) pair (i.e. binding).

For instance, a custom qualifier negation with possible values True and False will result in the following bindings [("_.negation", True), ("_.negation", False)], while a custom qualifier event_type with possible values start, stop, and start-stop will result in the following bindings [("_.event_type", "start"), ("_.event_type", "stop"), ("_.event_type", "start-stop")].

Training

During training, the span_qualifier component will gather spans on the documents in a mini-batch and evaluate each binding on each span to build a supervision matrix. This matrix will be feed it to the underlying model (most likely a eds.span_multilabel_classifier.v1). The model will compute logits for each entry of the matrix and compute a cross-entropy loss for each group of bindings sharing the same qualifier. The loss will not be computed for entries that violate the label_constraints parameter (for instance, the event_type qualifier can only be assigned to spans with the event label).

Prediction

During prediction, the span_qualifier component will gather spans on a given document and evaluate each binding on each span using the underlying model. Using the same binding exclusion and label constraint mechanisms as during training, scores will be computed for each binding and the best legal combination of bindings will be selected. Finally, the selected bindings will be assigned to the spans.

Usage

Let us define the pipeline and train it. We provide utils to train the model using an API, but you can use a spaCy's config file as well.

from pathlib import Path

import spacy

from edsnlp.connectors.brat import BratConnector
from edsnlp.utils.training import train, make_spacy_corpus_config
from edsnlp.pipelines.trainable.span_qualifier import SPAN_QUALIFIER_DEFAULTS

tmp_path = Path("/tmp/test-span-qualifier")

nlp = spacy.blank("eds")
# ↓ below is the span qualifier pipeline ↓
# you can configure it using the `add_pipe(..., config=...)` parameter
nlp.add_pipe(
    "span_qualifier",
    config={
        **SPAN_QUALIFIER_DEFAULTS,
        # Two qualifiers: binary `_.negation` and multi-class `_.event_type`
        "qualifiers": ("_.negation", "_.event_type"),
        # Only predict on entities, not on span groups
        "from_ents": True,
        "from_span_groups": False,
        "label_constraints": {
            # Only allow `_.event_type` qualifier on events
            "_.event_type": ("event",),
        },
        "model": {
            **SPAN_QUALIFIER_DEFAULTS["model"],
            "pooler_mode": "mean",
            "classifier_mode": "dot",
        },
    },
)

# Train the model, with additional training configuration
nlp = train(
    nlp,
    output_path=tmp_path / "model",
    config=dict(
        **make_spacy_corpus_config(
            train_data="/path/to/the/training/set/brat/files",
            dev_data="/path/to/the/dev/set/brat/files",
            nlp=nlp,
            data_format="brat",
        ),
        training=dict(
            max_steps=100,
        ),
    ),
)

# Finally, we can run the pipeline on a new document
doc = nlp.make_doc("Arret du ttt si folfox inefficace")
doc.ents = [
    # event = "Arret"
    spacy.tokens.Span(doc, 0, 1, "event"),
    # criteria = "si"
    spacy.tokens.Span(doc, 3, 4, "criteria"),
    # drug = "folfox"
    spacy.tokens.Span(doc, 4, 5, "drug"),
]
doc = nlp(doc)

[ent._.negation for ent in doc.ents]
# Out: [True, False, False]

[ent._.event_type for ent in doc.ents]
# Out: ["start", None, None]

# And export new predictions as Brat annotations
predicted_docs = BratConnector("/path/to/the/new/files", run_pipe=True).brat2docs(nlp)
BratConnector("/path/to/predictions").docs2brat(predicted_docs)
config.cfg
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
raw = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "eds"
pipeline = ["span_qualifier"]

[components]

[components.span_qualifier]
factory = "span_qualifier"
label_constraints = null
from_ents = false
from_span_groups = true
qualifiers = ["label_"]
scorer = {"@scorers":"eds.span_qualifier_scorer.v1"}

[components.span_qualifier.model]
@architectures = "eds.span_multi_classifier.v1"
projection_mode = "dot"
pooler_mode = "max"
n_labels = null

[components.span_qualifier.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.span_qualifier.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 96
rows = [5000,2000,1000,1000]
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false

[components.span_qualifier.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
window_size = 1
maxout_pieces = 3
depth = 4

[corpora]

[corpora.train]
@readers = "test-span-classification-corpus"
path = ${path.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.dev]
@readers = "test-span-classification-corpus"
path = ${path.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 10000
max_epochs = 0
max_steps = 10
eval_frequency = 5
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
accuracy = 1.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

To train it, run the following command :

spacy train config.cfg --output training/ --paths.train your_corpus/train.spacy --paths.dev your_corpus/dev.spacy

To use it, load the model and process a text :

import spacy

nlp = spacy.load("training/model-best")
doc = nlp.make_doc("Arret du ttt si folfox inefficace")
doc.ents = [
    # event = "Arret"
    spacy.tokens.Span(doc, 0, 1, "event"),
    # criteria = "si"
    spacy.tokens.Span(doc, 3, 4, "criteria"),
    # drug = "folfox"
    spacy.tokens.Span(doc, 4, 5, "drug"),
]
doc = nlp(doc)

[ent._.negation for ent in doc.ents]
# Out: [True, False, False]

[ent._.event_type for ent in doc.ents]
# Out: ["start", None, None]
config.cfg
[paths]
bert = "camembert-base"
train = null
dev = null
vectors = null
init_tok2vec = null
raw = null

[system]
seed = 0
gpu_allocator = "pytorch"

[nlp]
lang = "eds"
pipeline = ["span_qualifier"]

[components]

[components.span_qualifier]
factory = "span_qualifier"
label_constraints = null
from_ents = false
from_span_groups = true
qualifiers = ["label_"]
scorer = {"@scorers":"eds.span_qualifier_scorer.v1"}

[components.span_qualifier.model]
@architectures = "eds.span_multi_classifier.v1"
projection_mode = "dot"
pooler_mode = "max"
n_labels = null

# (1) We use a transformer instead below here
[components.span_qualifier.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecTransformer.v3"
name = ${path.bert}
tokenizer_config = {"use_fast": false}
transformer_config = {}
grad_factor = 1.0
mixed_precision = true
grad_scaler_config = {"init_scale": 32768}

[corpora]

[corpora.train]
@readers = "test-span-classification-corpus"
path = ${path.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.dev]
@readers = "test-span-classification-corpus"
path = ${path.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 10000
max_epochs = 0
max_steps = 10
eval_frequency = 5
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
accuracy = 1.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

To train it, run the following command :

spacy train config.cfg --output training/ --paths.train your_corpus/train.spacy --paths.dev your_corpus/dev.spacy

To use it, load the model and process a text :

import spacy

nlp = spacy.load("training/model-best")
doc = nlp.make_doc("Arret du ttt si folfox inefficace")
doc.ents = [
    # event = "Arret"
    spacy.tokens.Span(doc, 0, 1, "event"),
    # criteria = "si"
    spacy.tokens.Span(doc, 3, 4, "criteria"),
    # drug = "folfox"
    spacy.tokens.Span(doc, 4, 5, "drug"),
]
doc = nlp(doc)

[ent._.negation for ent in doc.ents]
# Out: [True, False, False]

[ent._.event_type for ent in doc.ents]
# Out: ["start", None, None]

Configuration

The span_qualifier pipeline component can be configured using the following parameters :

PARAMETER DESCRIPTION
model

The model to extract the spans

TYPE: Model

on_ents

Whether to look into doc.ents for spans to classify. If a list of strings is provided, only the span of the given labels will be considered. If None and on_span_groups is False, labels mentioned in label_constraints will be used, and all ents will be used if label_constraints is None.

TYPE: Optional[Union[bool, Sequence[str]]] DEFAULT: None

on_span_groups

Whether to look into doc.spans for spans to classify:

  • If True, all span groups will be considered
  • If False, no span group will be considered
  • If a list of str is provided, only these span groups will be kept
  • If a mapping is provided, the keys are the span group names and the values are either a list of allowed labels in the group or True to keep them all

TYPE: Union[bool, Sequence[str], Mapping[str, Union[bool, Sequence[str]]]] DEFAULT: False

qualifiers

The qualifiers to predict or train on. If None, keys from the label_constraints will be used

TYPE: Optional[Sequence[str]] DEFAULT: None

label_constraints

Constraints to select qualifiers for each span depending on their labels. Keys of the dict are the qualifiers and values are the labels for which the qualifier is allowed. If None, all qualifiers will be used for all spans

TYPE: Optional[Dict[str, List[str]]] DEFAULT: None

candidate_getter

Optional method to call to extract the candidate spans and the qualifiers to predict or train on. If None, a candidate getter will be created from the other parameters: on_ents, on_span_groups, qualifiers and label_constraints.

TYPE: Optional[Callable[[Doc], Tuple[Spans, Optional[Spans], SpanGroups, List[List[str]]]]] DEFAULT: None

The default model eds.span_multi_classifier.v1 can be configured using the following parameters :

Authors and citation

The eds.span_qualifier pipeline was developed by AP-HP's Data Science team.