Deep-learning tutorial
In this tutorial, we'll see how we can write our own deep learning model training script with EDS-NLP. We will implement a script to train a named-entity recognition (NER) model.
If you do not care about the details and just want to train a model, we suggest you to use the training API and move on to the next tutorial.
Hardware requirements
Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like Google Colab, Kaggle, Paperspace or Vast.ai.
Under the hood, EDS-NLP uses PyTorch to train deep-learning models. EDS-NLP acts as a sidekick to PyTorch, providing a set of tools to perform preprocessing, composition and evaluation. The trainable TorchComponents
are actually PyTorch modules with a few extra methods to handle the feature preprocessing and postprocessing. Therefore, EDS-NLP is fully compatible with the PyTorch ecosystem.
Step-by-step walkthrough
Training a supervised deep-learning model consists in feeding batches of annotated samples taken from a training corpus to a model and optimizing its parameters of the model to decrease its prediction error. The process of training a pipeline with EDS-NLP is structured as follows:
1. Defining the model
We first start by seeding the random states and instantiating a new trainable pipeline composed of trainable pipes. The model described here computes text embeddings with a pre-trained transformer followed by a CNN, and performs the NER prediction task using a Conditional Random Field (CRF) token classifier.
import edsnlp, edsnlp.pipes as eds
from confit.utils.random import set_seed
set_seed(42)
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.ner_crf( # (1)!
mode="joint", # (2)!
target_span_getter="gold-ner", # (3)!
window=20,
embedding=eds.text_cnn( # (4)!
kernel_sizes=[3],
embedding=eds.transformer( # (5)!
model="prajjwal1/bert-tiny", # (6)!
window=128,
stride=96,
),
),
),
name="ner",
)
- We use the
eds.ner_crf
NER task module, which classifies word embeddings into NER labels (BIOUL scheme) using a CRF. - Each component of the pipeline can be configured with a dictionary, using the parameter described in the component's page.
- The
target_span_getter
parameter defines the name of the span group used to train the NER model. In this case, the model will look for the entities to train on indoc.spans["gold-ner"]
. This is important because we might store entities in other span groups with a different purpose (e.g.doc.spans["sections"]
contain the sections Spans, but we don't want to train on these). We will need to make sure the entities from the training dataset are assigned to this span group (next section). - The word embeddings used by the CRF are computed by a CNN, which builds on top of another embedding layer.
- The base embedding layer is a pretrained transformer, which computes contextualized word embeddings.
- We chose the
prajjwal1/bert-tiny
model in this tutorial for testing purposes, but we recommend using a larger model likebert-base-cased
orcamembert-base
(French) for real-world applications.
2. Loading the raw dataset and convert it into Doc objects
To train a pipeline, we must convert our annotated data into Doc
objects that will be either used as training samples or evaluation samples. We will assume the dataset is in Standoff format, usually produced by the Brat annotation tool, but any format can be used.
At this step, we might also want to perform data augmentation, filtering, splitting or any other data transformation. In this tutorial, we will split on line jumps and filter out empty documents from the training data. We will use our Stream API to handle the data processing, but you can use any method you like, so long as you end up with a collection of Doc
objects.
import edsnlp
def skip_empty_docs(batch):
for doc in batch:
if len(doc.ents) > 0:
yield doc
training_data = (
edsnlp.data.read_standoff( # (1)!
train_data_path,
tokenizer=nlp.tokenizer, # (2)!
span_setter=["ents", "gold-ner"], # (3)!
)
.map(eds.split(regex="\n\n")) # (4)!
.map_batches(skip_empty_docs) # (5)!
)
- Read the data from the brat directory and convert it into Docs.
- Tokenize the training docs with the same tokenizer as the trained model
- Store the annotated Brat entities as spans in
doc.ents
, anddoc.spans["gold-ner"]
- Split the documents on line jumps.
- Filter out empty documents.
As for the validation data, we will keep all the documents, even empty ones, to obtain representative metrics.
val_data = edsnlp.data.read_standoff(
val_data_path,
tokenizer=nlp.tokenizer,
span_setter=["ents", "gold-ner"],
)
val_docs = list(val_data) # (1)!
- Cache the stream result into a list of
Doc
3. Complete the initialization of the model
We initialize the missing or incomplete components attributes (such as label vocabularies) with the training dataset. Indeed, when defining the model, we specified the architecture of the model, but we did not specify the types of named entities that the model will predict. This can be done either
- explicitly by setting the
labels
parameter ineds.ner_crf
in the definition above, - automatically with
post_init
: theneds.ner_crf
looks indoc.spans[target_span_getter]
of all docs intraining_data
to infer the labels.
nlp.post_init(training_data)
4. Making the stream of mini-batches
The training dataset of Doc
objects is then preprocessed into features to be fed to the model during the training loop. We will continue to use EDS-NLP's streams to handle the data processing :
-
We first request the training data stream to loop on the input data, since we want that each example is seen multiple times during the training until a given number of steps is reached
Looping in EDS-NLP Streams
Note that in EDS-NLP, looping on a stream is always done on the input data, no matter when
loop()
is called. This means that shuffling or any further preprocessing step will be applied multiple times, each time we loop. This is usually a good thing if preprocessing contains randomness to increase the diversity of the training samples while avoiding loading multiple versions of a same document in memory. To loop after preprocessing, we can collect the stream into a list and loop on the list (edsnlp.data.from_iterable(list(training_data)), loop=True
). -
We shuffle the data before batching to diversify the samples in each mini-batch
- We extract the features and labels required by each component (and sub-components) of the pipeline
- Finally, we group the samples into mini-batches, such that each mini-batch contains a maximum number of tokens, or any other batching criterion and assemble (or "collate") the features into tensors
from edsnlp.utils.batching import stat_batchify
device = "cuda" if torch.cuda.is_available() else "cpu" # (1)!
batches = (
training_data.loop()
.shuffle("dataset") # (2)!
.map(nlp.preprocess, kwargs={"supervision": True}) # (3)!
.batchify(batch_size=32 * 128, batch_by=stat_batchify("tokens")) # (4)!
.map(nlp.collate, kwargs={"device": device})
)
- Check if a GPU is available and set the device accordingly.
- Apply shuffling to our stream. If our dataset is too large to fit in memory, instead of "dataset" we can set the shuffle batch size to "100 docs" for example, or "fragment" for parquet datasets.
- This will call the
preprocess_supervised
method of the TorchComponent class and return a nested dictionary containing the required features and labels. - Make batches that contain at most 32 * 128 tokens (e.g. 32 samples of 128 tokens, but this accounts samples may have different lengths). We use the
stat_batchify
function to look for a key containingtokens
in the featuresstats
sub-dictionary and add samples to the batch until the sum of the*tokens*
stats exceeds 32 * 128.
and that's it ! We now have a looping stream of mini-batches that we can feed to our model. For better efficiency, we can also perform this in parallel in a separate worker by setting num_cpu_workers
to 1 or more. Note that streams in EDS-NLP are lazy, meaning that the execution has not started yet, and the data is not loaded in memory. This will only happen when we start iterating over the stream in the next section.
batches = batches.set_processing(
num_cpu_workers=1,
process_start_method="spawn" # (1)!
)
- Since we use a GPU, we must use the "spawn" method to create the workers. This is because the default multiprocessing "fork" method is not compatible with CUDA.
5. The training loop
We instantiate a pytorch optimizer and start the training loop
from itertools import chain, repeat
from tqdm import tqdm
import torch
lr = 3e-4
max_steps = 400
# Move the model to the GPU
nlp.to(device)
optimizer = torch.optim.AdamW(
params=nlp.parameters(),
lr=lr,
)
iterator = iter(batches)
for step in tqdm(range(max_steps), "Training model", leave=True):
batch = next(iterator)
optimizer.zero_grad()
6. Optimizing the weights
Inside the training loop, the trainable components are fed the collated batches from the dataloader by calling the TorchComponent.forward
method (via a simple call) to compute the losses. In the case we train a multitask model (not in this tutorial) and the outputs of a shared embedding are reused between components, we enable caching by wrapping this step in a cache context. The training loop is otherwise carried in a similar fashion to a standard pytorch training loop.
with nlp.cache():
loss = torch.zeros((), device=device)
for name, component in nlp.torch_components():
output = component(batch[name])
if "loss" in output:
loss += output["loss"]
loss.backward()
optimizer.step()
7. Evaluating the model
Finally, the model is evaluated on the validation dataset and saved at regular intervals. We will use the NerExactMetric
to evaluate the NER performance using Precision, Recall and F1 scores. This metric only counts an entity as correct if it matches the label and boundaries of a target entity.
from edsnlp.metrics.ner import NerExactMetric
from copy import deepcopy
metric = NerExactMetric(span_getter=nlp.pipes.ner.target_span_getter)
...
if ((step + 1) % 100) == 0:
with nlp.select_pipes(enable=["ner"]): # (1)!
preds = deepcopy(val_docs)
for doc in preds:
doc.ents = doc.spans["gold-ner"] = [] # (2)!
preds = nlp.pipe(preds) # (3)!
print(metric(val_docs, preds))
nlp.to_disk("model") #(4)!
- In the case we have multiple pipes in our model, we may want to selectively evaluate each pipe, thus we use the
select_pipes
method to disable every pipe except "ner". - Clean the documents that our model will annotate
- We use the
pipe
method to run the "ner" component on the validation dataset. This method is similar to the__call__
method of EDS-NLP components, but it is used to run a component on a list of Docs. This is also equivalent topreds = ( edsnlp.data .from_iterable(preds) .map_pipeline(nlp) )
- We could also have saved the model with
torch.save(model, "model.pt")
, butnlp.to_disk
avoids pickling and allows to inspect the model's files by saving them into a structured directory.
Full example
Let's wrap the training code in a function, and make it callable from the command line using confit !
train.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
- This will become useful in the next section, when we will use the configuration file to define the pipeline. If you don't want to use a configuration file, you can remove this decorator.
We can now copy the above code in a notebook and run it, or call this script from the command line:
python train.py
At the end of the training, the pipeline is ready to use since every trained component of the pipeline is self-sufficient, ie contains the preprocessing, inference and postprocessing code required to run it.
Configuration
To decouple the configuration and the code of our training script, let's define a configuration file where we will describe both our training parameters and the pipeline. You can either write the config of the pipeline by hand, or generate a pipeline config draft from an instantiated pipeline by running:
print(nlp.config.to_yaml_str())
nlp:
"@core": "pipeline"
lang: "eds"
components:
ner:
"@factory": "eds.ner_crf"
mode: "joint"
target_span_getter: "gold-ner"
window: 20
embedding:
"@factory": "eds.text_cnn"
kernel_sizes: [3]
embedding:
"@factory": "eds.transformer"
model: "prajjwal1/bert-tiny"
window: 128
stride: 96
train:
nlp: ${ nlp }
train_data_path: my_brat_data/train
val_data_path: my_brat_data/val
batch_size: ${ 32 * 128 }
lr: 1e-4
max_steps: 400
num_preprocessing_workers: 1
evaluation_interval: 100
And replace the end of the script by
if __name__ == "__main__":
app.run()
That's it ! We can now call the training script with the configuration file as a parameter, and override some of its values:
python train.py --config config.cfg --nlp.components.ner.embedding.embedding.transformer.window=64 --seed 43
Going further
EDS-NLP also provides a generic training script that follows the same structure as the one we just wrote. You can learn more about in the next Training API tutorial.
This tutorial gave you a glimpse of the training API of EDS-NLP. To build a custom trainable component, you can refer to the TorchComponent class or look up the implementation of some of the trainable components on GitHub.
We also recommend looking at an existing project as a reference, such as eds-pseudo or mlg-norm.