Training API
In this tutorial, we'll see how we can quickly train a deep learning model with EDS-NLP using the edsnlp.train
function.
Hardware requirements
Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like Google Colab, Kaggle, Paperspace or Vast.ai.
If you need a high level of control over the training procedure, we suggest you read the previous "Deep learning tutorial" to understand how to build a training loop from scratch with EDS-NLP.
Creating a project
If you already have installed edsnlp[ml]
and do not want to setup a project, you can skip to the next section.
Create a new project:
mkdir my_ner_project
cd my_ner_project
touch README.md pyproject.toml
mkdir -p configs data/dataset
Add a standard pyproject.toml
file with the following content. This file will be used to manage the dependencies of the project and its versioning.
[project]
name = "my_ner_project"
version = "0.1.0"
description = ""
authors = [
{ name="Firstname Lastname", email="firstname.lastname@domain.com" }
]
readme = "README.md"
requires-python = ">3.7.1,<4.0"
dependencies = [
"edsnlp[ml]>=0.15.0",
"sentencepiece>=0.1.96"
]
[project.optional-dependencies]
dev = [
"dvc>=2.37.0; python_version >= '3.8'",
"pandas>=1.1.0,<2.0.0; python_version < '3.8'",
"pandas>=1.4.0,<2.0.0; python_version >= '3.8'",
"pre-commit>=2.18.1",
"accelerate>=0.21.0; python_version >= '3.8'",
"rich-logger>=0.3.0"
]
We recommend using a virtual environment ("venv") to isolate the dependencies of your project and using uv to install the dependencies:
pip install uv
# skip the next two lines if you do not want a venv
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]" -p $(uv python find)
Training the model
EDS-NLP supports training models either from the command line or from a Python script or notebook, and switching between the two is straightforward thanks to the use of Confit.
A word about Confit
EDS-NLP makes heavy use of Confit, a configuration library that allows you call functions from Python or the CLI, and validate and optionally cast their arguments.
The EDS-NLP function used in this script is the train
function of the edsnlp.train
module. When passing a dict to a type-hinted argument (either from a config.yml
file, or by calling the function in Python), Confit will instantiate the correct class with the arguments provided in the dict. For instance, we pass a dict to the val_data
parameter, which is actually type hinted as a SampleGenerator
: this dict will actually be used as keyword arguments to instantiate this SampleGenerator
object. You can also instantiate a SampleGenerator
object directly and pass it to the function.
You can also tell Confit specifically which class you want to instantiate by using the @register_name = "name_of_the_registered_class"
key and value in a dict or config section. We make a heavy use of this mechanism to build pipeline architectures.
Create a config.yml
file in the configs
folder with the following content:
# Some variables are grouped here for conviency but we could also
# put their values directly in the config in place of their reference
vars:
train: './data/dataset/train'
dev: './data/dataset/test'
# 🤖 PIPELINE DEFINITION
nlp:
'@core': pipeline #(1)!
lang: eds # Word-level tokenization: use the "eds" tokenizer
# Our pipeline will contain a single NER pipe
# The NER pipe will be a CRF model
components:
ner:
'@factory': eds.ner_crf
mode: 'joint'
target_span_getter: 'gold_spans'
# Set spans as both to ents and in separate `ent.label` groups
span_setter: [ "ents", "*" ]
infer_span_setter: true
# The CRF model will use a CNN to re-contextualize embeddings
embedding:
'@factory': eds.text_cnn
kernel_sizes: [ 3 ]
# The base embeddings will be computed by a transformer
embedding:
'@factory': eds.transformer
model: 'camembert-base'
window: 128
stride: 96
# 📈 SCORERS
scorer:
ner:
'@metrics': eds.ner_exact
span_getter: ${ nlp.components.ner.target_span_getter }
# 🎛️ OPTIMIZER
optimizer:
"@core": optimizer
optim: adamw
groups:
# Assign parameters starting with transformer (ie the parameters of the transformer component)
# to a first group
"^transformer":
lr:
'@schedules': linear
"warmup_rate": 0.1
"start_value": 0
"max_value": 5e-5
# And every other parameters to the second group
"":
lr:
'@schedules': linear
"warmup_rate": 0.1
"start_value": 3e-4
"max_value": 3e-4
module: ${ nlp }
total_steps: ${ train.max_steps }
# 📚 DATA
train_data:
- data:
# In what kind of files (ie. their extensions) is our
# training data stored
'@readers': standoff
path: ${ vars.train }
converter:
# What schema is used in the data files
- '@factory': eds.standoff_dict2doc
span_setter: 'gold_spans'
# How to preprocess each doc for training
- '@factory': eds.split
nlp: null
max_length: 2000
regex: '\n\n+'
shuffle: dataset
batch_size: 4096 tokens # 32 * 128 tokens
pipe_names: [ "ner" ]
val_data:
'@readers': standoff
path: ${ vars.dev }
# What schema is used in the data files
converter:
- '@factory': eds.standoff_dict2doc
span_setter: 'gold_spans'
# 🚀 TRAIN SCRIPT OPTIONS
# -> python -m edsnlp.train --config configs/config.yml
train:
nlp: ${ nlp }
output_dir: 'artifacts'
train_data: ${ train_data }
val_data: ${ val_data }
max_steps: 2000
validation_interval: ${ train.max_steps//10 }
max_grad_norm: 1.0
scorer: ${ scorer }
optimizer: ${ optimizer }
# Do preprocessing in parallel on 1 worker
num_workers: 1
# Enable on Mac OS X or if you don't want to use available GPUs
# cpu: true
# 📦 PACKAGE SCRIPT OPTIONS
# -> python -m edsnlp.package --config configs/config.yml
package:
pipeline: ${ train.output_dir }
name: 'my_ner_model'
-
Why do we use
'@core': pipeline
here ? Because we need the reference used inoptimizer.module = ${ nlp }
to be the actual Pipeline and not its keyword arguments : when confit sees'@core': pipeline
, it will instantiate thePipeline
class with the arguments provided in the dict.In fact, you could also use
'@core': eds.pipeline
in every config when you define a pipeline, but sometimes it's more convenient to let Confit infer that the type of the nlp argument based on the function when it's type hinted. Not specifying'@core': pipeline
is also more aligned withspacy
's pipeline config API. However, in general, explicit is better than implicit, so feel free to use explicitly write'@core': eds.pipeline
when you define a pipeline.
To train the model, you can use the following command:
python -m edsnlp.train --config configs/config.yml --seed 42
Any option can also be set either via the CLI or in config.yml
under [train]
.
Create a notebook, with the following content:
import edsnlp
from edsnlp.training import train, ScheduledOptimizer, TrainingData
from edsnlp.metrics.ner import NerExactMetric
import edsnlp.pipes as eds
import torch
# 🤖 PIPELINE DEFINITION
nlp = edsnlp.blank("eds")
nlp.add_pipe(
# The NER pipe will be a CRF model
eds.ner_crf(
mode="joint",
target_span_getter="gold_spans",
# Set spans as both to ents and in separate `ent.label` groups
span_setter=["ents", "*"],
infer_span_setter=True,
# The CRF model will use a CNN to re-contextualize embeddings
embedding=eds.text_cnn(
kernel_sizes=[3],
# The base embeddings will be computed by a transformer
embedding=eds.transformer(
model="camembert-base",
window=128,
stride=96,
),
),
)
)
# 📈 SCORERS
ner_metric = NerExactMetric(span_getter="gold_spans")
# 📚 DATA
train_data = (
edsnlp.data
.read_standoff("./data/dataset/train", span_setter="gold_spans")
.map(eds.split(nlp=None, max_length=2000, regex="\n\n+"))
)
val_data = (
edsnlp.data
.read_standoff("./data/dataset/test", span_setter="gold_spans")
)
# 🎛️ OPTIMIZER
max_steps = 2000
optimizer = ScheduledOptimizer(
optim=torch.optim.Adam,
module=nlp,
total_steps=max_steps,
groups={
"^transformer": {
"lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,},
},
"": {
"lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,},
},
},
)
# 🚀 TRAIN
train(
nlp=nlp,
max_steps=max_steps,
validation_interval=max_steps // 10,
train_data=TrainingData(
data=train_data,
batch_size="4096 tokens", # 32 * 128 tokens
pipe_names=["ner"],
shuffle="dataset",
),
val_data=val_data,
scorer={"ner": ner_metric},
optimizer=optimizer,
max_grad_norm=1.0,
output_dir="artifacts",
# Do preprocessing in parallel on 1 worker
num_workers=1,
# Enable on Mac OS X or if you don't want to use available GPUs
# cpu=True,
)
or use the config file:
from edsnlp.train import train
import edsnlp
import confit
cfg = confit.Config.from_disk(
"configs/config.yml", resolve=True, registry=edsnlp.registry
)
nlp = train(**cfg["train"])
Here are the parameters you can pass to the train
function:
Parameters
PARAMETER | DESCRIPTION | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
nlp | The pipeline that will be trained in place. TYPE: | ||||||||||||||
train_data | The training data. Can be a single TrainingData object, a dict that will be cast or a list of these objects. |
PARAMETER | DESCRIPTION |
---|---|
data | The stream of documents to train on. The documents will be preprocessed and collated according to the pipeline's components. TYPE: |
batch_size | The batch size. Can be a batching expression like "2000 words", an int (number of documents), or a tuple (batch_size, batch_by). The batch_by argument should be a statistic produced by the pipes that will be trained. For instance, the TYPE: |
shuffle | The shuffle strategy. Can be "dataset" to shuffle the entire dataset (this can be memory-intensive for large file based datasets), "fragment" to shuffle the fragment-based datasets like parquet files, or a batching expression like "2000 words" to shuffle the dataset in chunks of 2000 words. TYPE: |
sub_batch_size | How to split each batch into sub-batches that will be fed to the model independently to accumulate gradients over. TYPE: |
pipe_names | The names of the pipes that should be trained on this data. If None, defaults to all trainable pipes. TYPE: |
post_init | Whether to call the pipeline's post_init method with the data before training. TYPE: |
TYPE: AsList[TrainingData]
val_data
The validation data. Can be a single Stream object or a list of Stream.
TYPE: AsList[Stream]
DEFAULT: []
seed
The random seed
TYPE: int
DEFAULT: 42
max_steps
The maximum number of training steps
TYPE: int
DEFAULT: 1000
optimizer
The optimizer. If None, a default optimizer will be used.
ScheduledOptimizer
object/dictionary
PARAMETER | DESCRIPTION |
---|---|
optim | The optimizer to use. If a string (like "adamw") or a type to instantiate, the TYPE: |
module | The module to optimize. Usually the TYPE: |
total_steps | The total number of steps, used for schedules. TYPE: |
groups | The groups to optimize. The key is a regex selector to match parameters in The matching is performed by running TYPE: |
TYPE: Union[ScheduledOptimizer, Optimizer]
DEFAULT: None
validation_interval
The number of steps between each evaluation. Defaults to 1/10 of max_steps
TYPE: Optional[int]
DEFAULT: None
checkpoint_interval
The number of steps between each model save. Defaults to validation_interval
TYPE: Optional[int]
DEFAULT: None
max_grad_norm
The maximum gradient norm
TYPE: float
DEFAULT: 5.0
loss_scales
The loss scales for each component (useful for multi-task learning)
TYPE: Dict[str, float]
DEFAULT: {}
scorer
How to score the model. Expects a GenericScorer
object or a dict containing a mapping of metric names to metric objects.
TYPE: GenericScorer
DEFAULT: GenericScorer()
num_workers
The number of workers to use for preprocessing the data in parallel. Setting it to 0 means no parallelization : data is processed on the main thread which may induce latency slow down the training. To avoid this, a good practice consist in doing the preprocessing either before training or in parallel in a separate process. Because of how EDS-NLP handles stream multiprocessing, changing this value will affect the order of the documents in the produces batches. A stream [1, 2, 3, 4, 5, 6] split in batches of size 3 will produce:
- [1, 2, 3] and [4, 5, 6] with 1 worker
- [1, 3, 5] and [2, 4, 6] with 2 workers
TYPE: int
DEFAULT: 0
cpu
Whether to use force training on CPU. On MacOS, this might be necessary to get around some mps
backend issues.
TYPE: bool
DEFAULT: False
mixed_precision
The mixed precision mode. Can be "no", "fp16", "bf16" or "fp8".
TYPE: Literal['no', 'fp16', 'bf16', 'fp8']
DEFAULT: 'no'
output_dir
The output directory, which will contain a model-last
directory with the last model, and a train_metrics.json
file with the training metrics and stats.
TYPE: Union[Path, str]
DEFAULT: Path('artifacts')
output_model_dir
The directory where to save the model. If None, defaults to output_dir / "model-last"
.
TYPE: Optional[Union[Path, str]]
DEFAULT: None
save_model
Whether to save the model or not. This can be useful if you are only interested in the metrics, but no the model, and want to avoid spending time dumping the model weights to the disk.
TYPE: bool
DEFAULT: True
logger
Whether to log the validation metrics in a rich table.
TYPE: bool
DEFAULT: True
kwargs
Additional keyword arguments.
DEFAULT: {}
Use the model
You can now load the model and use it to process some text:
import edsnlp
nlp = edsnlp.load("artifacts/model-last")
doc = nlp("Some sample text")
for ent in doc.ents:
print(ent, ent.label_)
Packaging the model
To package the model and share it with friends or family (if the model does not contain sensitive data), you can use the following command:
python -m edsnlp.package --pipeline artifacts/model-last/ --name my_ner_model --distributions sdist
Parametrize either via the CLI or in config.yml
under [package]
.
Tthe model saved at the train script output path (artifacts/model-last
) will be named my_ner_model
and will be saved in the dist
folder. You can upload it to a package registry or install it directly with
pip install dist/my_ner_model-0.1.0.tar.gz