Deep-learning tutorial

In this tutorial, we'll see how we can write our own deep learning model training script with EDS-NLP. We will implement a script to train a named-entity recognition (NER) model.

If you do not care about the details and just want to train a model, we suggest you to use the training API and move on to the next tutorial.

Hardware requirements

Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like Google Colab, Kaggle, Paperspace or Vast.ai.

Under the hood, EDS-NLP uses PyTorch to train deep-learning models. EDS-NLP acts as a sidekick to PyTorch, providing a set of tools to perform preprocessing, composition and evaluation. The trainable TorchComponents are actually PyTorch modules with a few extra methods to handle the feature preprocessing and postprocessing. Therefore, EDS-NLP is fully compatible with the PyTorch ecosystem.

Step-by-step walkthrough

Training a supervised deep-learning model consists in feeding batches of annotated samples taken from a training corpus to a model and optimizing its parameters of the model to decrease its prediction error. The process of training a pipeline with EDS-NLP is structured as follows:

1. Defining the model

We first start by seeding the random states and instantiating a new trainable pipeline composed of trainable pipes. The model described here computes text embeddings with a pre-trained transformer followed by a CNN, and performs the NER prediction task using a Conditional Random Field (CRF) token classifier.

import edsnlp, edsnlp.pipes as eds
from confit.utils.random import set_seed

set_seed(42)

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    eds.ner_crf(  We use the eds.ner_crf NER task module, which classifies word embeddings into NER labels (BIOUL scheme) using a CRF.

        mode="joint",  Each component of the pipeline can be configured with a dictionary, using the parameter described in the component's page.

        target_span_getter="gold-ner",  The target_span_getter parameter defines the name of the span group used to train the NER model. In this case, the model will look for the entities to train on in doc.spans["gold-ner"]. This is important because we might store entities in other span groups with a different purpose (e.g. doc.spans["sections"] contain the sections Spans, but we don't want to train on these). We will need to make sure the entities from the training dataset are assigned to this span group (next section).

        window=20,
        embedding=eds.text_cnn(  The word embeddings used by the CRF are computed by a CNN, which builds on top of another embedding layer.

            kernel_sizes=[3],
            embedding=eds.transformer(  The base embedding layer is a pretrained transformer, which computes contextualized word embeddings.

                model="prajjwal1/bert-tiny",  We chose the prajjwal1/bert-tiny model in this tutorial for testing purposes, but we recommend using a larger model like bert-base-cased or camembert-base (French) for real-world applications.

                window=128,
                stride=96,
            ),
        ),
    ),
    name="ner",
)

2. Loading the raw dataset and convert it into Doc objects

To train a pipeline, we must convert our annotated data into Doc objects that will be either used as training samples or evaluation samples. We will assume the dataset is in Standoff format, usually produced by the Brat annotation tool, but any format can be used.

At this step, we might also want to perform data augmentation, filtering, splitting or any other data transformation. In this tutorial, we will split on line jumps and filter out empty documents from the training data. We will use our Stream API to handle the data processing, but you can use any method you like, so long as you end up with a collection of Doc objects.

import edsnlp


def skip_empty_docs(batch):
    for doc in batch:
        if len(doc.ents) > 0:
            yield doc


training_data = (
    edsnlp.data.read_standoff(  Read the data from the brat directory and convert it into Docs.

        train_data_path,
        tokenizer=nlp.tokenizer,  Tokenize the training docs with the same tokenizer as the trained model

        span_setter=["ents", "gold-ner"],  Store the annotated Brat entities as spans in doc.ents, and doc.spans["gold-ner"]

    )
    .map(eds.split(regex="\n\n"))  Split the documents on line jumps.

    .map_batches(skip_empty_docs)  Filter out empty documents.

)

As for the validation data, we will keep all the documents, even empty ones, to obtain representative metrics.

val_data = edsnlp.data.read_standoff(
    val_data_path,
    tokenizer=nlp.tokenizer,
    span_setter=["ents", "gold-ner"],
)
val_docs = list(val_data)  Cache the stream result into a list of Doc

3. Complete the initialization of the model

We initialize the missing or incomplete components attributes (such as label vocabularies) with the training dataset. Indeed, when defining the model, we specified the architecture of the model, but we did not specify the types of named entities that the model will predict. This can be done either

explicitly by setting the labels parameter in eds.ner_crf in the definition above,
automatically with post_init: then eds.ner_crf looks in doc.spans[target_span_getter] of all docs in training_data to infer the labels.

nlp.post_init(training_data)

4. Making the stream of mini-batches

The training dataset of Doc objects is then preprocessed into features to be fed to the model during the training loop. We will continue to use EDS-NLP's streams to handle the data processing :

We first request the training data stream to loop on the input data, since we want that each example is seen multiple times during the training until a given number of steps is reached

Looping in EDS-NLP Streams

Note that in EDS-NLP, looping on a stream is always done on the input data, no matter when loop() is called. This means that shuffling or any further preprocessing step will be applied multiple times, each time we loop. This is usually a good thing if preprocessing contains randomness to increase the diversity of the training samples while avoiding loading multiple versions of a same document in memory. To loop after preprocessing, we can collect the stream into a list and loop on the list (edsnlp.data.from_iterable(list(training_data)), loop=True).
We shuffle the data before batching to diversify the samples in each mini-batch
We extract the features and labels required by each component (and sub-components) of the pipeline
Finally, we group the samples into mini-batches, such that each mini-batch contains a maximum number of tokens, or any other batching criterion and assemble (or "collate") the features into tensors

from edsnlp.utils.batching import stat_batchify

device = "cuda" if torch.cuda.is_available() else "cpu"  Check if a GPU is available and set the device accordingly.

batches = (
   training_data.loop()
    .shuffle("dataset")  Apply shuffling to our stream. If our dataset is too large to fit in memory, instead of "dataset" we can set the shuffle batch size to "100 docs" for example, or "fragment" for parquet datasets.

    .map(nlp.preprocess, kwargs={"supervision": True})  This will call the preprocess_supervised method of the TorchComponent class and return a nested dictionary containing the required features and labels.

    .batchify(batch_size=32 * 128, batch_by=stat_batchify("tokens"))  Make batches that contain at most 32 * 128 tokens (e.g. 32 samples of 128 tokens, but this accounts samples may have different lengths). We use the stat_batchify function to look for a key containing tokens in the features stats sub-dictionary and add samples to the batch until the sum of the *tokens* stats exceeds 32 * 128.

    .map(nlp.collate, kwargs={"device": device})
)

and that's it ! We now have a looping stream of mini-batches that we can feed to our model. For better efficiency, we can also perform this in parallel in a separate worker by setting num_cpu_workers to 1 or more. Note that streams in EDS-NLP are lazy, meaning that the execution has not started yet, and the data is not loaded in memory. This will only happen when we start iterating over the stream in the next section.

batches = batches.set_processing(
   num_cpu_workers=1,
   process_start_method="spawn"  Since we use a GPU, we must use the "spawn" method to create the workers. This is because the default multiprocessing "fork" method is not compatible with CUDA.

)

5. The training loop

We instantiate a pytorch optimizer and start the training loop

from itertools import chain, repeat
from tqdm import tqdm
import torch

lr = 3e-4
max_steps = 400

# Move the model to the GPU
nlp.to(device)

optimizer = torch.optim.AdamW(
    params=nlp.parameters(),
    lr=lr,
)

iterator = iter(batches)

for step in tqdm(range(max_steps), "Training model", leave=True):
    batch = next(iterator)
    optimizer.zero_grad()

6. Optimizing the weights

Inside the training loop, the trainable components are fed the collated batches from the dataloader by calling the TorchComponent.forward method (via a simple call) to compute the losses. In the case we train a multitask model (not in this tutorial) and the outputs of a shared embedding are reused between components, we enable caching by wrapping this step in a cache context. The training loop is otherwise carried in a similar fashion to a standard pytorch training loop.

    with nlp.cache():
        loss = torch.zeros((), device=device)
        for name, component in nlp.torch_components():
            output = component(batch[name])
            if "loss" in output:
                loss += output["loss"]

    loss.backward()

    optimizer.step()

7. Evaluating the model

Finally, the model is evaluated on the validation dataset and saved at regular intervals. We will use the NerExactMetric to evaluate the NER performance using Precision, Recall and F1 scores. This metric only counts an entity as correct if it matches the label and boundaries of a target entity.

from edsnlp.metrics.ner import NerExactMetric
from copy import deepcopy

metric = NerExactMetric(span_getter=nlp.pipes.ner.target_span_getter)

    ...
    if ((step + 1) % 100) == 0:
        with nlp.select_pipes(enable=["ner"]):  
            preds = deepcopy(val_docs)
            for doc in preds:
                doc.ents = doc.spans["gold-ner"] = []  
            preds = nlp.pipe(preds)  
            print(metric(val_docs, preds))

    nlp.to_disk("model")

Full example

Let's wrap the training code in a function, and make it callable from the command line using confit !

train.py

from copy import deepcopy
from typing import Iterator

import torch
from confit import Cli
from tqdm import tqdm

import edsnlp
import edsnlp.pipes as eds
from edsnlp.metrics.ner import NerExactMetric
from edsnlp.utils.batching import stat_batchify

app = Cli(pretty_exceptions_show_locals=False)


@app.command(name="train", registry=edsnlp.registry)  
def train_model(
    nlp: edsnlp.Pipeline,
    train_data_path: str,
    val_data_path: str,
    batch_size: int = 32 * 128,
    lr: float = 1e-4,
    max_steps: int = 400,
    num_preprocessing_workers: int = 1,
    evaluation_interval: int = 100,
):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Define function to skip empty docs
    def skip_empty_docs(batch: Iterator) -> Iterator:
        for doc in batch:
            if len(doc.ents) > 0:
                yield doc

    # Load and process training data
    training_data = (
        edsnlp.data.read_standoff(
            train_data_path,
            span_setter=["ents", "gold-ner"],
            tokenizer=nlp.tokenizer,
        )
        .map(eds.split(regex="\n\n"))
        .map_batches(skip_empty_docs)
    )

    # Load validation data
    val_data = edsnlp.data.read_standoff(
        val_data_path,
        span_setter=["ents", "gold-ner"],
        tokenizer=nlp.tokenizer,
    )
    val_docs = list(val_data)

    # Initialize components
    nlp.post_init(training_data)

    # Prepare the stream of batches
    batches = (
        training_data.loop()
        .shuffle("dataset")
        .map(nlp.preprocess, kwargs={"supervision": True})
        .batchify(batch_size=batch_size, batch_by=stat_batchify("tokens"))
        .map(nlp.collate, kwargs={"device": device})
        .set_processing(num_cpu_workers=1, process_start_method="spawn")
    )

    # Move the model to the GPU if available
    nlp.to(device)

    # Initialize optimizer
    optimizer = torch.optim.AdamW(params=nlp.parameters(), lr=lr)

    metric = NerExactMetric(span_getter=nlp.pipes.ner.target_span_getter)

    # Training loop
    iterator = iter(batches)
    for step in tqdm(range(max_steps), "Training model", leave=True):
        batch = next(iterator)
        optimizer.zero_grad()

        with nlp.cache():
            loss = torch.zeros((), device=device)
            for name, component in nlp.torch_components():
                output = component(batch[name])
                if "loss" in output:
                    loss += output["loss"]

        loss.backward()
        optimizer.step()

        # Evaluation and model saving
        if ((step + 1) % evaluation_interval) == 0:
            with nlp.select_pipes(enable=["ner"]):
                # Clean the documents that our model will annotate
                preds = deepcopy(val_docs)
                for doc in preds:
                    doc.ents = doc.spans["gold-ner"] = []
                preds = nlp.pipe(preds)
                print(metric(val_docs, preds))

            nlp.to_disk("model")


if __name__ == "__main__":
    nlp = edsnlp.blank("eds")
    nlp.add_pipe(
        eds.ner_crf(
            mode="joint",
            target_span_getter="gold-ner",
            window=20,
            embedding=eds.text_cnn(
                kernel_sizes=[3],
                embedding=eds.transformer(
                    model="prajjwal1/bert-tiny",
                    window=128,
                    stride=96,
                ),
            ),
        ),
        name="ner",
    )
    train_model(
        nlp,
        train_data_path="my_brat_data/train",
        val_data_path="my_brat_data/val",
        batch_size=32 * 128,
        lr=1e-4,
        max_steps=1000,
        num_preprocessing_workers=1,
        evaluation_interval=100,
    )

We can now copy the above code in a notebook and run it, or call this script from the command line:

python train.py

At the end of the training, the pipeline is ready to use since every trained component of the pipeline is self-sufficient, ie contains the preprocessing, inference and postprocessing code required to run it.

Configuration

To decouple the configuration and the code of our training script, let's define a configuration file where we will describe both our training parameters and the pipeline. You can either write the config of the pipeline by hand, or generate a pipeline config draft from an instantiated pipeline by running:

print(nlp.config.to_yaml_str())

config.yml

nlp:
  "@core": "pipeline"
  lang: "eds"
  components:
    ner:
      "@factory": "eds.ner_crf"
      mode: "joint"
      target_span_getter: "gold-ner"
      window: 20

      embedding:
        "@factory": "eds.text_cnn"
        kernel_sizes: [3]

        embedding:
          "@factory": "eds.transformer"
          model: "prajjwal1/bert-tiny"
          window: 128
          stride: 96

train:
  nlp: ${ nlp }
  train_data_path: my_brat_data/train
  val_data_path: my_brat_data/val
  batch_size: ${ 32 * 128 }
  lr: 1e-4
  max_steps: 400
  num_preprocessing_workers: 1
  evaluation_interval: 100

And replace the end of the script by

if __name__ == "__main__":
    app.run()

That's it ! We can now call the training script with the configuration file as a parameter, and override some of its values:

python train.py --config config.cfg --nlp.components.ner.embedding.embedding.transformer.window=64 --seed 43

Going further

EDS-NLP also provides a generic training script that follows the same structure as the one we just wrote. You can learn more about in the next Training API tutorial.

This tutorial gave you a glimpse of the training API of EDS-NLP. To build a custom trainable component, you can refer to the TorchComponent class or look up the implementation of some of the trainable components on GitHub.

We also recommend looking at an existing project as a reference, such as eds-pseudo or mlg-norm.