Training API

In this tutorial, we'll see how we can quickly train a deep learning model with EDS-NLP using the edsnlp.train function.

Hardware requirements

Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like Google Colab, Kaggle, Paperspace or Vast.ai.

If you need a high level of control over the training procedure, we suggest you read the previous "Deep learning tutorial" to understand how to build a training loop from scratch with EDS-NLP.

Creating a project

If you already have installed edsnlp[ml] and do not want to setup a project, you can skip to the next section.

Create a new project:

mkdir my_ner_project
cd my_ner_project

touch README.md pyproject.toml
mkdir -p configs data/dataset

Add a standard pyproject.toml file with the following content. This file will be used to manage the dependencies of the project and its versioning.

pyproject.toml

[project]
name = "my_ner_project"
version = "0.1.0"
description = ""
authors = [
    { name="Firstname Lastname", email="firstname.lastname@domain.com" }
]
readme = "README.md"
requires-python = ">3.7.1,<4.0"

dependencies = [
    "edsnlp[ml]>=0.16.0",
    "sentencepiece>=0.1.96"
]

[project.optional-dependencies]
dev = [
    "dvc>=2.37.0; python_version >= '3.8'",
    "pandas>=1.1.0,<2.0.0; python_version < '3.8'",
    "pandas>=1.4.0,<2.0.0; python_version >= '3.8'",
    "pre-commit>=2.18.1",
    "accelerate>=0.21.0; python_version >= '3.8'",
    "rich-logger>=0.3.0"
]

We recommend using a virtual environment ("venv") to isolate the dependencies of your project and using uv to install the dependencies:

pip install uv
# skip the next two lines if you do not want a venv
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]" -p $(uv python find)

Training the model

EDS-NLP supports training models either from the command line or from a Python script or notebook, and switching between the two is straightforward thanks to the use of Confit.

A word about Confit

EDS-NLP makes heavy use of Confit, a configuration library that allows you call functions from Python or the CLI, and validate and optionally cast their arguments.

The EDS-NLP function used in this script is the train function of the edsnlp.train module. When passing a dict to a type-hinted argument (either from a config.yml file, or by calling the function in Python), Confit will instantiate the correct class with the arguments provided in the dict. For instance, we pass a dict to the val_data parameter, which is actually type hinted as a SampleGenerator: this dict will actually be used as keyword arguments to instantiate this SampleGenerator object. You can also instantiate a SampleGenerator object directly and pass it to the function.

You can also tell Confit specifically which class you want to instantiate by using the @register_name = "name_of_the_registered_class" key and value in a dict or config section. We make a heavy use of this mechanism to build pipeline architectures.

From the command lineFrom a script or a notebook

Create a config.yml file in the configs folder with the following content:

configs/config.yml

# Some variables are grouped here for conviency but we could also
# put their values directly in the config in place of their reference
vars:
  train: './data/dataset/train'
  dev: './data/dataset/test'

# 🤖 PIPELINE DEFINITION
nlp:
  '@core': pipeline   Why do we use '@core': pipeline here ? Because we need the reference used in optimizer.module = ${ nlp } to be the actual Pipeline and not its keyword arguments : when confit sees '@core': pipeline, it will instantiate the Pipeline class with the arguments provided in the dict.
 In fact, you could also use '@core': eds.pipeline in every config when you define a pipeline, but sometimes it's more convenient to let Confit infer that the type of the nlp argument based on the function when it's type hinted. Not specifying '@core': pipeline is also more aligned with spacy's pipeline config API. However, in general, explicit is better than implicit, so feel free to use explicitly write '@core': eds.pipeline when you define a pipeline.
 

  lang: eds  # Word-level tokenization: use the "eds" tokenizer

  # Our pipeline will contain a single NER pipe
  # The NER pipe will be a CRF model
  components:
    ner:
      '@factory': eds.ner_crf
      mode: 'joint'
      target_span_getter: 'gold_spans'
      # Set spans as both to ents and in separate `ent.label` groups
      span_setter: [ "ents", "*" ]
      infer_span_setter: true

      # The CRF model will use a CNN to re-contextualize embeddings
      embedding:
        '@factory': eds.text_cnn
        kernel_sizes: [ 3 ]

        # The base embeddings will be computed by a transformer
        embedding:
          '@factory': eds.transformer
          model: 'camembert-base'
          window: 128
          stride: 96

# 📈 SCORERS
scorer:
  ner:
    '@metrics': eds.ner_exact
    span_getter: ${ nlp.components.ner.target_span_getter }

# 🎛️ OPTIMIZER
optimizer:
  "@core": optimizer
  optim: adamw
  groups:
    # Assign parameters starting with transformer (ie the parameters of the transformer component)
    # to a first group
    "^transformer":
      lr:
        '@schedules': linear
        "warmup_rate": 0.1
        "start_value": 0
        "max_value": 5e-5
    # And every other parameters to the second group
    "":
      lr:
        '@schedules': linear
        "warmup_rate": 0.1
        "start_value": 3e-4
        "max_value": 3e-4
  module: ${ nlp }
  total_steps: ${ train.max_steps }

# 📚 DATA
train_data:
  - data:
      # In what kind of files (ie. their extensions) is our
      # training data stored
      '@readers': standoff
      path: ${ vars.train }
      converter:
        # What schema is used in the data files
        - '@factory': eds.standoff_dict2doc
          span_setter: 'gold_spans'
        # How to preprocess each doc for training
        - '@factory': eds.split
          nlp: null
          max_length: 2000
          regex: '\n\n+'
    shuffle: dataset
    batch_size: 4096 tokens  # 32 * 128 tokens
    pipe_names: [ "ner" ]

val_data:
  '@readers': standoff
  path: ${ vars.dev }
  # What schema is used in the data files
  converter:
    - '@factory': eds.standoff_dict2doc
      span_setter: 'gold_spans'

# 🚀 TRAIN SCRIPT OPTIONS
# -> python -m edsnlp.train --config configs/config.yml
train:
  nlp: ${ nlp }
  output_dir: 'artifacts'
  train_data: ${ train_data }
  val_data: ${ val_data }
  max_steps: 2000
  validation_interval: ${ train.max_steps//10 }
  max_grad_norm: 1.0
  scorer: ${ scorer }
  optimizer: ${ optimizer }
  # Do preprocessing in parallel on 1 worker
  num_workers: 1
  # Enable on Mac OS X or if you don't want to use available GPUs
  # cpu: true

# 📦 PACKAGE SCRIPT OPTIONS
# -> python -m edsnlp.package --config configs/config.yml
package:
  pipeline: ${ train.output_dir }
  name: 'my_ner_model'

To train the model, you can use the following command:

python -m edsnlp.train --config configs/config.yml --seed 42

Any option can also be set either via the CLI or in config.yml under [train].

Create a notebook, with the following content:

import edsnlp
from edsnlp.training import train, ScheduledOptimizer, TrainingData
from edsnlp.metrics.ner import NerExactMetric
import edsnlp.pipes as eds
import torch

# 🤖 PIPELINE DEFINITION
nlp = edsnlp.blank("eds")
nlp.add_pipe(
    # The NER pipe will be a CRF model
    eds.ner_crf(
        mode="joint",
        target_span_getter="gold_spans",
        # Set spans as both to ents and in separate `ent.label` groups
        span_setter=["ents", "*"],
        infer_span_setter=True,
        # The CRF model will use a CNN to re-contextualize embeddings
        embedding=eds.text_cnn(
            kernel_sizes=[3],
            # The base embeddings will be computed by a transformer
            embedding=eds.transformer(
                model="camembert-base",
                window=128,
                stride=96,
            ),
        ),
    )
)

# 📈 SCORERS
ner_metric = NerExactMetric(span_getter="gold_spans")

# 📚 DATA
train_data = (
    edsnlp.data
    .read_standoff("./data/dataset/train", span_setter="gold_spans")
    .map(eds.split(nlp=None, max_length=2000, regex="\n\n+"))
)
val_data = (
    edsnlp.data
    .read_standoff("./data/dataset/test", span_setter="gold_spans")
)

# 🎛️ OPTIMIZER
max_steps = 2000
optimizer = ScheduledOptimizer(
    optim=torch.optim.Adam,
    module=nlp,
    total_steps=max_steps,
    groups={
        "^transformer": {
            "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,},
        },
        "": {
            "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,},
        },
    },
)

# 🚀 TRAIN
train(
    nlp=nlp,
    max_steps=max_steps,
    validation_interval=max_steps // 10,
    train_data=TrainingData(
        data=train_data,
        batch_size="4096 tokens",  # 32 * 128 tokens
        pipe_names=["ner"],
        shuffle="dataset",
    ),
    val_data=val_data,
    scorer={"ner": ner_metric},
    optimizer=optimizer,
    max_grad_norm=1.0,
    output_dir="artifacts",
    # Do preprocessing in parallel on 1 worker
    num_workers=1,
    # Enable on Mac OS X or if you don't want to use available GPUs
    # cpu=True,
)

or use the config file:

from edsnlp.train import train
import edsnlp
import confit

cfg = confit.Config.from_disk(
    "configs/config.yml", resolve=True, registry=edsnlp.registry
)
nlp = train(**cfg["train"])

Here are the parameters you can pass to the train function:

Parameters

PARAMETER DESCRIPTION

nlp

The pipeline that will be trained in place.

TYPE: Pipeline

train_data

The training data. Can be a single TrainingData object, a dict that will be cast or a list of these objects.

TrainingData object/dictionary

PARAMETER	DESCRIPTION
`data`	The stream of documents to train on. The documents will be preprocessed and collated according to the pipeline's components. TYPE: `Stream`
`batch_size`	The batch size. Can be a batching expression like "2000 words", an int (number of documents), or a tuple (batch_size, batch_by). The batch_by argument should be a statistic produced by the pipes that will be trained. For instance, the `eds.span_pooler` component produces a "spans" statistic, that can be used to produce batches of no more than 16 spans by setting batch_size to "16 spans". TYPE: `BatchSizeArg`
`shuffle`	The shuffle strategy. Can be "dataset" to shuffle the entire dataset (this can be memory-intensive for large file based datasets), "fragment" to shuffle the fragment-based datasets like parquet files, or a batching expression like "2000 words" to shuffle the dataset in chunks of 2000 words. TYPE: `Union[str, Literal[False]]`
`sub_batch_size`	How to split each batch into sub-batches that will be fed to the model independently to accumulate gradients over. TYPE: `Optional[BatchSizeArg]` DEFAULT: `None`
`pipe_names`	The names of the pipes that should be trained on this data. If None, defaults to all trainable pipes. TYPE: `Optional[Collection[str]]` DEFAULT: `None`
`post_init`	Whether to call the pipeline's post_init method with the data before training. TYPE: `bool` DEFAULT: `True`

TYPE: AsList[TrainingData]

val_data

The validation data. Can be a single Stream object or a list of Stream.

TYPE: AsList[Stream] DEFAULT: []

seed

The random seed

TYPE: int DEFAULT: 42

max_steps

The maximum number of training steps

TYPE: int DEFAULT: 1000

optimizer

The optimizer. If None, a default optimizer will be used.

ScheduledOptimizer object/dictionary

PARAMETER	DESCRIPTION
`optim`	The optimizer to use. If a string (like "adamw") or a type to instantiate, the`module` and `groups` must be provided. TYPE: `Union[str, Type[Optimizer], Optimizer]`
`module`	The module to optimize. Usually the `nlp` pipeline object. TYPE: `Optional[Union[PipelineProtocol, Module]]` DEFAULT: `None`
`total_steps`	The total number of steps, used for schedules. TYPE: `Optional[int]` DEFAULT: `None`
`groups`	The groups to optimize. The key is a regex selector to match parameters in `module.named_parameters()` and the value is a dictionary with the keys `params` and `schedules`. The matching is performed by running `regex.search(selector, name)` so you do not have to match the full name. Note that the order of dict keys matter. If a parameter name matches multiple selectors, the configurations of these selectors are combined in reverse order (from the last matched selector to the first), allowing later selectors to complete options from earlier ones. If a selector maps to `False`, any parameters matching it are excluded from optimization and not included in any parameter group. TYPE: `Optional[Dict[str, Group]]` DEFAULT: `None`

TYPE: Union[ScheduledOptimizer, Optimizer] DEFAULT: None

validation_interval

The number of steps between each evaluation. Defaults to 1/10 of max_steps

TYPE: Optional[int] DEFAULT: None

checkpoint_interval

The number of steps between each model save. Defaults to validation_interval

TYPE: Optional[int] DEFAULT: None

max_grad_norm

The maximum gradient norm

TYPE: float DEFAULT: 5.0

loss_scales

The loss scales for each component (useful for multi-task learning)

TYPE: Dict[str, float] DEFAULT: {}

scorer

How to score the model. Expects a GenericScorer object or a dict containing a mapping of metric names to metric objects.

TYPE: GenericScorer DEFAULT: GenericScorer()

num_workers

The number of workers to use for preprocessing the data in parallel. Setting it to 0 means no parallelization : data is processed on the main thread which may induce latency slow down the training. To avoid this, a good practice consist in doing the preprocessing either before training or in parallel in a separate process. Because of how EDS-NLP handles stream multiprocessing, changing this value will affect the order of the documents in the produces batches. A stream [1, 2, 3, 4, 5, 6] split in batches of size 3 will produce:

[1, 2, 3] and [4, 5, 6] with 1 worker
[1, 3, 5] and [2, 4, 6] with 2 workers

TYPE: int DEFAULT: 0

cpu

Whether to use force training on CPU. On MacOS, this might be necessary to get around some mps backend issues.

TYPE: bool DEFAULT: False

mixed_precision

The mixed precision mode. Can be "no", "fp16", "bf16" or "fp8".

TYPE: Literal['no', 'fp16', 'bf16', 'fp8'] DEFAULT: 'no'

output_dir

The output directory, which will contain a model-last directory with the last model, and a train_metrics.json file with the training metrics and stats.

TYPE: Union[Path, str] DEFAULT: Path('artifacts')

output_model_dir

The directory where to save the model. If None, defaults to output_dir / "model-last".

TYPE: Optional[Union[Path, str]] DEFAULT: None

save_model

Whether to save the model or not. This can be useful if you are only interested in the metrics, but no the model, and want to avoid spending time dumping the model weights to the disk.

TYPE: bool DEFAULT: True

logger

Whether to log the validation metrics in a rich table.

TYPE: bool DEFAULT: True

on_validation_callback

A callback function invoked during validation steps to handle custom logic.

TYPE: Optional[Callable[[Dict], None]] DEFAULT: None

kwargs

Additional keyword arguments.

DEFAULT: {}

Use the model

You can now load the model and use it to process some text:

import edsnlp

nlp = edsnlp.load("artifacts/model-last")
doc = nlp("Some sample text")
for ent in doc.ents:
    print(ent, ent.label_)

Packaging the model

To package the model and share it with friends or family (if the model does not contain sensitive data), you can use the following command:

python -m edsnlp.package --pipeline artifacts/model-last/ --name my_ner_model --distributions sdist

Parametrize either via the CLI or in config.yml under [package].

Tthe model saved at the train script output path (artifacts/model-last) will be named my_ner_model and will be saved in the dist folder. You can upload it to a package registry or install it directly with

pip install dist/my_ner_model-0.1.0.tar.gz