Span qualification
We propose the new span_qualifier
component to qualify (i.e. assign attributes on) any span with machine learning.
In this context, the span qualification task consists in assigning values (boolean, strings or any complex object) to attributes/extensions of spans such as:
span.label_
,span._.negation
,span._.date.mode
- etc.
Architecture
The underlying eds.span_multilabel_classifier.v1
model performs span classification by:
- Pooling the words embedding (
mean
,max
orsum
) into a single embedding per span - Computing logits for each possible binding (i.e. qualifier-value assignment)
-
Splitting these bindings into independent groups such as
event_type=start
andevent_type=stop
negated=False
andnegated=True
-
Learning or predicting a combination amongst legal combination of these bindings. For instance in the second group, we can't have both
negated=True
andnegated=False
so the combinations are[(1, 0), (0, 1)]
- Assigning bindings on spans depending on the predicted results
Under the hood
Initialization
During the initialization of the pipeline, the span_qualifier
component will gather all spans
that match on_ents
and on_span_groups
patterns (or candidate_getter
function). It will then list
all possible values for each qualifier
of the qualifiers
list and store every possible
(qualifier, value) pair (i.e. binding).
For instance, a custom qualifier negation
with possible values True
and False
will result in the following bindings
[("_.negation", True), ("_.negation", False)]
, while a custom qualifier event_type
with possible values start
, stop
, and start-stop
will result in the following bindings [("_.event_type", "start"), ("_.event_type", "stop"), ("_.event_type", "start-stop")]
.
Training
During training, the span_qualifier
component will gather spans on the documents in a mini-batch
and evaluate each binding on each span to build a supervision matrix.
This matrix will be feed it to the underlying model (most likely a eds.span_multilabel_classifier.v1
).
The model will compute logits for each entry of the matrix and compute a cross-entropy loss for each group of bindings
sharing the same qualifier. The loss will not be computed for entries that violate the label_constraints
parameter (for instance, the event_type
qualifier can only be assigned to spans with the event
label).
Prediction
During prediction, the span_qualifier
component will gather spans on a given document and evaluate each binding on each span using the underlying model. Using the same binding exclusion and label constraint mechanisms as during training, scores will be computed for each binding and the best legal combination of bindings will be selected. Finally, the selected bindings will be assigned to the spans.
Usage
Let us define the pipeline and train it. We provide utils to train the model using an API, but you can use a spaCy's config file as well.
from pathlib import Path
import spacy
from edsnlp.connectors.brat import BratConnector
from edsnlp.utils.training import train, make_spacy_corpus_config
from edsnlp.pipelines.trainable.span_qualifier import SPAN_QUALIFIER_DEFAULTS
tmp_path = Path("/tmp/test-span-qualifier")
nlp = spacy.blank("eds")
# ↓ below is the span qualifier pipeline ↓
# you can configure it using the `add_pipe(..., config=...)` parameter
nlp.add_pipe(
"span_qualifier",
config={
**SPAN_QUALIFIER_DEFAULTS,
# Two qualifiers: binary `_.negation` and multi-class `_.event_type`
"qualifiers": ("_.negation", "_.event_type"),
# Only predict on entities, not on span groups
"from_ents": True,
"from_span_groups": False,
"label_constraints": {
# Only allow `_.event_type` qualifier on events
"_.event_type": ("event",),
},
"model": {
**SPAN_QUALIFIER_DEFAULTS["model"],
"pooler_mode": "mean",
"classifier_mode": "dot",
},
},
)
# Train the model, with additional training configuration
nlp = train(
nlp,
output_path=tmp_path / "model",
config=dict(
**make_spacy_corpus_config(
train_data="/path/to/the/training/set/brat/files",
dev_data="/path/to/the/dev/set/brat/files",
nlp=nlp,
data_format="brat",
),
training=dict(
max_steps=100,
),
),
)
# Finally, we can run the pipeline on a new document
doc = nlp.make_doc("Arret du ttt si folfox inefficace")
doc.ents = [
# event = "Arret"
spacy.tokens.Span(doc, 0, 1, "event"),
# criteria = "si"
spacy.tokens.Span(doc, 3, 4, "criteria"),
# drug = "folfox"
spacy.tokens.Span(doc, 4, 5, "drug"),
]
doc = nlp(doc)
[ent._.negation for ent in doc.ents]
# Out: [True, False, False]
[ent._.event_type for ent in doc.ents]
# Out: ["start", None, None]
# And export new predictions as Brat annotations
predicted_docs = BratConnector("/path/to/the/new/files", run_pipe=True).brat2docs(nlp)
BratConnector("/path/to/predictions").docs2brat(predicted_docs)
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
raw = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "eds"
pipeline = ["span_qualifier"]
[components]
[components.span_qualifier]
factory = "span_qualifier"
label_constraints = null
from_ents = false
from_span_groups = true
qualifiers = ["label_"]
scorer = {"@scorers":"eds.span_qualifier_scorer.v1"}
[components.span_qualifier.model]
@architectures = "eds.span_multi_classifier.v1"
projection_mode = "dot"
pooler_mode = "max"
n_labels = null
[components.span_qualifier.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
[components.span_qualifier.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 96
rows = [5000,2000,1000,1000]
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false
[components.span_qualifier.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
window_size = 1
maxout_pieces = 3
depth = 4
[corpora]
[corpora.train]
@readers = "test-span-classification-corpus"
path = ${path.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.dev]
@readers = "test-span-classification-corpus"
path = ${path.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 10000
max_epochs = 0
max_steps = 10
eval_frequency = 5
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
accuracy = 1.0
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
To train it, run the following command :
spacy train config.cfg --output training/ --paths.train your_corpus/train.spacy --paths.dev your_corpus/dev.spacy
To use it, load the model and process a text :
import spacy
nlp = spacy.load("training/model-best")
doc = nlp.make_doc("Arret du ttt si folfox inefficace")
doc.ents = [
# event = "Arret"
spacy.tokens.Span(doc, 0, 1, "event"),
# criteria = "si"
spacy.tokens.Span(doc, 3, 4, "criteria"),
# drug = "folfox"
spacy.tokens.Span(doc, 4, 5, "drug"),
]
doc = nlp(doc)
[ent._.negation for ent in doc.ents]
# Out: [True, False, False]
[ent._.event_type for ent in doc.ents]
# Out: ["start", None, None]
[paths]
bert = "camembert-base"
train = null
dev = null
vectors = null
init_tok2vec = null
raw = null
[system]
seed = 0
gpu_allocator = "pytorch"
[nlp]
lang = "eds"
pipeline = ["span_qualifier"]
[components]
[components.span_qualifier]
factory = "span_qualifier"
label_constraints = null
from_ents = false
from_span_groups = true
qualifiers = ["label_"]
scorer = {"@scorers":"eds.span_qualifier_scorer.v1"}
[components.span_qualifier.model]
@architectures = "eds.span_multi_classifier.v1"
projection_mode = "dot"
pooler_mode = "max"
n_labels = null
# (1) We use a transformer instead below here
[components.span_qualifier.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecTransformer.v3"
name = ${path.bert}
tokenizer_config = {"use_fast": false}
transformer_config = {}
grad_factor = 1.0
mixed_precision = true
grad_scaler_config = {"init_scale": 32768}
[corpora]
[corpora.train]
@readers = "test-span-classification-corpus"
path = ${path.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.dev]
@readers = "test-span-classification-corpus"
path = ${path.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 10000
max_epochs = 0
max_steps = 10
eval_frequency = 5
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
accuracy = 1.0
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
To train it, run the following command :
spacy train config.cfg --output training/ --paths.train your_corpus/train.spacy --paths.dev your_corpus/dev.spacy
To use it, load the model and process a text :
import spacy
nlp = spacy.load("training/model-best")
doc = nlp.make_doc("Arret du ttt si folfox inefficace")
doc.ents = [
# event = "Arret"
spacy.tokens.Span(doc, 0, 1, "event"),
# criteria = "si"
spacy.tokens.Span(doc, 3, 4, "criteria"),
# drug = "folfox"
spacy.tokens.Span(doc, 4, 5, "drug"),
]
doc = nlp(doc)
[ent._.negation for ent in doc.ents]
# Out: [True, False, False]
[ent._.event_type for ent in doc.ents]
# Out: ["start", None, None]
Configuration
The span_qualifier
pipeline component can be configured using the following parameters :
PARAMETER | DESCRIPTION |
---|---|
model |
The model to extract the spans
TYPE:
|
on_ents |
Whether to look into
TYPE:
|
on_span_groups |
Whether to look into
TYPE:
|
qualifiers |
The qualifiers to predict or train on. If None, keys from the
TYPE:
|
label_constraints |
Constraints to select qualifiers for each span depending on their labels. Keys of the dict are the qualifiers and values are the labels for which the qualifier is allowed. If None, all qualifiers will be used for all spans
TYPE:
|
candidate_getter |
Optional method to call to extract the candidate spans and the qualifiers
to predict or train on. If None, a candidate getter will be created from
the other parameters:
TYPE:
|
The default model eds.span_multi_classifier.v1
can be configured using the following parameters :
Authors and citation
The eds.span_qualifier
pipeline was developed by AP-HP's Data Science team.