Making a training script
In this tutorial, we'll see how we can train a deep learning model with EDS-NLP. We will implement a script to train a named-entity recognition (NER) model.
Step-by-step walkthrough
Training a supervised deep-learning model consists in feeding batches of annotated samples taken from a training corpus to a model and optimizing its parameters of the model to decrease its prediction error. The process of training a pipeline with EDS-NLP is structured as follows:
1. Defining the model
We first start by seeding the random states and instantiating a new trainable pipeline. The model described here computes text embeddings with a pre-trained transformer followed by a CNN, and performs the NER prediction task using a Conditional Random Field (CRF) token classifier. To compose deep-learning modules, we nest them in a dictionary : each new dictionary will instantiate a new module, and the @factory
key will be used to select the class of the module.
import edsnlp
from confit.utils.random import set_seed
set_seed(42)
nlp = edsnlp.blank("eds")
nlp.add_pipe(
"eds.ner_crf", #
name="ner",
config={
"mode": "joint", #
"target_span_getter": "ml-ner",
"window": 20,
"embedding": {
"@factory": "eds.text_cnn", #
"kernel_sizes": [3],
"embedding": {
"@factory": "eds.transformer", #
"model": "prajjwal1/bert-tiny", #
"window": 128,
"stride": 96,
},
},
},
)
2. Adapting a dataset
To train a pipeline, we must convert our annotated data into documents that will be either used as training samples or a evaluation samples. This is done by designing a function to convert the dataset into a list of spaCy Doc objects. We will assume the dataset has been annotated using Brat, but any format can be used.
At this step, we might also want to perform data augmentation, filtering, splitting or any other data transformation. Note that this function will be used to load both the training data and the test data.
from pydantic import DirectoryPath
from edsnlp import registry
from edsnlp.connectors.brat import BratConnector
@registry.adapters.register("ner_adapter")
def ner_adapter(
path: DirectoryPath,
skip_empty: bool = False, #
):
def generator(nlp):
docs = BratConnector(path).brat2docs(nlp)
for doc in docs:
if skip_empty and len(doc.ents) == 0:
continue
doc.spans["ml-ner"] = doc.ents
yield doc
return generator
3. Loading the data
We then load and adapt (i.e., convert into spaCy Doc objects) the training and validation dataset. Since the adaption of raw documents depends on tokenization used in the trained model, we need to pass the model to the adapter function.
train_adapter = ner_adapter(train_data_path)
val_adapter = ner_adapter(val_data_path)
train_docs = list(train_adapter(nlp))
val_docs = list(val_adapter(nlp))
4. Complete the initialization with the training data
We initialize the missing or incomplete components attributes (such as label vocabularies) with the training dataset
nlp.post_init(train_docs)
5. Preprocessing the data
The training dataset is then preprocessed into features. The resulting preprocessed dataset is then wrapped into a pytorch DataLoader to be fed to the model during the training loop with the model's own collate method.
import torch
batch_size = 8
preprocessed = list(
nlp.preprocess_many( #
train_docs,
supervision=True,
)
)
dataloader = torch.utils.data.DataLoader(
preprocessed,
batch_size=batch_size,
collate_fn=nlp.collate,
shuffle=True,
)
6. Looping through the training data
We instantiate an optimizer and start the training loop
from itertools import chain, repeat
from tqdm import tqdm
lr = 3e-4
max_steps = 400
optimizer = torch.optim.AdamW(
params=nlp.parameters(),
lr=lr,
)
# We will loop over the dataloader
iterator = chain.from_iterable(repeat(dataloader))
for step in tqdm(range(max_steps), "Training model", leave=True):
batch = next(iterator)
optimizer.zero_grad()
7. Optimizing the weights
Inside the training loop, the trainable components are fed the collated batches from the dataloader by calling the TorchComponent.module_forward
methods to compute the losses. In the case we train a multi-task model (not in this tutorial), the outputs of shared embedding are reused between components, we enable caching by wrapping this step in a cache context. The training loop is otherwise carried in a similar fashion to a standard pytorch training loop
with nlp.cache():
loss = torch.zeros((), device="cpu")
for name, component in nlp.torch_components():
output = component.module_forward(batch[component.name]) #
if "loss" in output:
loss += output["loss"]
loss.backward()
optimizer.step()
8. Evaluating the model
Finally, the model is evaluated on the validation dataset and saved at regular intervals.
from edsnlp.scorers.ner import create_ner_exact_scorer
from copy import deepcopy
scorer = create_ner_exact_scorer(nlp.get_pipe('ner').target_span_getter)
...
if (step % 100) == 0:
with nlp.select_pipes(enable=["ner"]): #
print(scorer(val_docs, nlp.pipe(deepcopy(val_docs)))) #
nlp.save("model") #
Full example
Let's wrap the training code in a function, and make it callable from the command line using confit !
train.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
|
We can now copy the above code in a notebook and run it, or call this script from the command line:
python train.py --seed 42
At the end of the training, the pipeline is ready to use (with the .pipe
method) since every trained component of the pipeline is self-sufficient, ie contains the preprocessing, inference and postprocessing code required to run it.
Configuration
To decouple the configuration and the code of our training script, let's define a configuration file where we will describe both our training parameters and the pipeline. You can either write the config of the pipeline by hand, or generate a pipeline config draft from an instantiated pipeline by running:
print(nlp.config.to_str())
# This is this equivalent of the API-based declaration
# at the beginning of the tutorial
[nlp]
lang = "eds"
pipeline = ["ner"]
components = ${ components }
[components]
[components.ner]
@factory = "eds.ner_crf"
mode = "joint"
target_span_getter = "ml-ner"
window = 20
embedding = ${ cnn }
[cnn]
@factory = "eds.text_cnn"
kernel_sizes = [3]
embedding = ${ transformer }
[transformer]
@factory = "eds.transformer"
model = "prajjwal1/bert-tiny"
window = 128
stride = ${ transformer.window//2 }
# This is were we define the training script parameters
# the "train" section refers to the name of the command
# in the training script
[train]
nlp = ${ nlp }
train_adapter = { "@adapters": "ner_adapter", "path": "data/train" }
val_adapter = { "@adapters": "ner_adapter", "path": "data/val" }
max_steps = 400
seed = 42
lr = 3e-4
batch_size = 8
And replace the end of the script by
if __name__ == "__main__":
app.run()
That's it ! We can now call the training script with the configuration file as a parameter, and override some of its values:
python train.py --config config.cfg --transformer.window=64 --seed 43
Going further
This tutorial gave you a glimpse of the training API of EDS-NLP. We provide a more complete example of a training script in tests at tests/training/test_training.py. To build a custom trainable component, you can refer to the TorchComponent class or look up the implementation of some of the trainable components on GitHub.