Training API
In this tutorial, we'll see how we can quickly train a deep learning model with EDS-NLP using the edsnlp.train
function.
Hardware requirements
Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like Google Colab, Kaggle, Paperspace or Vast.ai.
If you need a high level of control over the training procedure, we suggest you read the previous "Deep learning tutorial" to understand how to build a training loop from scratch with EDS-NLP.
Creating a project
If you already have installed edsnlp[ml]
and do not want to setup a project, you can skip to the next section.
Create a new project:
mkdir my_ner_project
cd my_ner_project
touch README.md pyproject.toml
mkdir -p configs data/dataset
Add a standard pyproject.toml
file with the following content. This file will be used to manage the dependencies of the project and its versioning.
[project]
name = "my_ner_project"
version = "0.1.0"
description = ""
authors = [
{ name="Firstname Lastname", email="firstname.lastname@domain.com" }
]
readme = "README.md"
requires-python = ">3.7.1,<4.0"
dependencies = [
"edsnlp[ml]>=0.14.0",
"sentencepiece>=0.1.96"
]
[project.optional-dependencies]
dev = [
"dvc>=2.37.0; python_version >= '3.8'",
"pandas>=1.1.0,<2.0.0; python_version < '3.8'",
"pandas>=1.4.0,<2.0.0; python_version >= '3.8'",
"pre-commit>=2.18.1",
"accelerate>=0.21.0; python_version >= '3.8'",
"rich-logger>=0.3.0"
]
We recommend using a virtual environment ("venv") to isolate the dependencies of your project and using uv to install the dependencies:
pip install uv
# skip the next two lines if you do not want a venv
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]" -p $(uv python find)
Training the model
EDS-NLP supports training models either from the command line or from a Python script or notebook, and switching between the two is straightforward thanks to the use of Confit.
A word about Confit
EDS-NLP makes heavy use of Confit, a configuration library that allows you call functions from Python or the CLI, and validate and optionally cast their arguments.
The EDS-NLP function used in this script is the train
function of the edsnlp.train
module. When passing a dict to a type-hinted argument (either from a config.yml
file, or by calling the function in Python), Confit will instantiate the correct class with the arguments provided in the dict. For instance, we pass a dict to the val_data
parameter, which is actually type hinted as a SampleGenerator
: this dict will actually be used as keyword arguments to instantiate this SampleGenerator
object. You can also instantiate a SampleGenerator
object directly and pass it to the function.
You can also tell Confit specifically which class you want to instantiate by using the @register_name = "name_of_the_registered_class"
key and value in a dict or config section. We make a heavy use of this mechanism to build pipeline architectures.
Create a config.yml
file in the configs
folder with the following content:
# Some variables are grouped here for conviency but we could also
# put their values directly in the config in place of their reference
vars:
train: './data/dataset/train'
dev: './data/dataset/test'
# 🤖 PIPELINE DEFINITION
nlp:
'@core': pipeline #(1)!
lang: eds # Word-level tokenization: use the "eds" tokenizer
# Our pipeline will contain a single NER pipe
# The NER pipe will be a CRF model
components:
ner:
'@factory': eds.ner_crf
mode: 'joint'
target_span_getter: 'gold_spans'
# Set spans as both to ents and in separate `ent.label` groups
span_setter: [ "ents", "*" ]
infer_span_setter: true
# The CRF model will use a CNN to re-contextualize embeddings
embedding:
'@factory': eds.text_cnn
kernel_sizes: [ 3 ]
# The base embeddings will be computed by a transformer
embedding:
'@factory': eds.transformer
model: 'camembert-base'
window: 128
stride: 96
# 📈 SCORERS
scorer:
ner:
'@metrics': eds.ner_exact
span_getter: ${ nlp.components.ner.target_span_getter }
# 🎛️ OPTIMIZER
optimizer:
"@core": optimizer
optim: adamw
groups:
# Assign parameters starting with transformer (ie the parameters of the transformer component)
# to a first group
"^transformer":
lr:
'@schedules': linear
"warmup_rate": 0.1
"start_value": 0
"max_value": 5e-5
# And every other parameters to the second group
"":
lr:
'@schedules': linear
"warmup_rate": 0.1
"start_value": 3e-4
"max_value": 3e-4
module: ${ nlp }
total_steps: ${ train.max_steps }
# 📚 DATA
train_data:
- data:
# In what kind of files (ie. their extensions) is our
# training data stored
'@readers': standoff
path: ${ vars.train }
converter:
# What schema is used in the data files
- '@factory': eds.standoff_dict2doc
span_setter: 'gold_spans'
# How to preprocess each doc for training
- '@factory': eds.split
nlp: null
max_length: 2000
regex: '\n\n+'
shuffle: dataset
batch_size: 4096 tokens # 32 * 128 tokens
pipe_names: [ "ner" ]
val_data:
'@readers': standoff
path: ${ vars.dev }
# What schema is used in the data files
converter:
- '@factory': eds.standoff_dict2doc
span_setter: 'gold_spans'
# 🚀 TRAIN SCRIPT OPTIONS
# -> python -m edsnlp.train --config configs/config.yml
train:
nlp: ${ nlp }
output_dir: 'artifacts'
train_data: ${ train_data }
val_data: ${ val_data }
max_steps: 2000
validation_interval: ${ train.max_steps//10 }
max_grad_norm: 1.0
scorer: ${ scorer }
optimizer: ${ optimizer }
# Do preprocessing in parallel on 1 worker
num_workers: 1
# Enable on Mac OS X or if you don't want to use available GPUs
# cpu: true
# 📦 PACKAGE SCRIPT OPTIONS
# -> python -m edsnlp.package --config configs/config.yml
package:
pipeline: ${ train.output_dir }
name: 'my_ner_model'
-
Why do we use
'@core': pipeline
here ? Because we need the reference used inoptimizer.module = ${ nlp }
to be the actual Pipeline and not its keyword arguments : when confit sees'@core': pipeline
, it will instantiate thePipeline
class with the arguments provided in the dict.In fact, you could also use
'@core': eds.pipeline
in every config when you define a pipeline, but sometimes it's more convenient to let Confit infer that the type of the nlp argument based on the function when it's type hinted. Not specifying'@core': pipeline
is also more aligned withspacy
's pipeline config API. However, in general, explicit is better than implicit, so feel free to use explicitly write'@core': eds.pipeline
when you define a pipeline.
To train the model, you can use the following command:
python -m edsnlp.train --config configs/config.yml --seed 42
Any option can also be set either via the CLI or in config.yml
under [train]
.
Create a notebook, with the following content:
import edsnlp
from edsnlp.training import train, ScheduledOptimizer, TrainingData
from edsnlp.metrics.ner import NerExactMetric
import edsnlp.pipes as eds
import torch
# 🤖 PIPELINE DEFINITION
nlp = edsnlp.blank("eds")
nlp.add_pipe(
# The NER pipe will be a CRF model
eds.ner_crf(
mode="joint",
target_span_getter="gold_spans",
# Set spans as both to ents and in separate `ent.label` groups
span_setter=["ents", "*"],
infer_span_setter=True,
# The CRF model will use a CNN to re-contextualize embeddings
embedding=eds.text_cnn(
kernel_sizes=[3],
# The base embeddings will be computed by a transformer
embedding=eds.transformer(
model="camembert-base",
window=128,
stride=96,
),
),
)
)
# 📈 SCORERS
ner_metric = NerExactMetric(span_getter="gold_spans")
# 📚 DATA
train_data = (
edsnlp.data
.read_standoff("./data/dataset/train", span_setter="gold_spans")
.map(eds.split(nlp=None, max_length=2000, regex="\n\n+"))
)
val_data = (
edsnlp.data
.read_standoff("./data/dataset/test", span_setter="gold_spans")
)
# 🎛️ OPTIMIZER
max_steps = 2000
optimizer = ScheduledOptimizer(
optim=torch.optim.Adam,
module=nlp,
total_steps=max_steps,
groups={
"^transformer": {
"lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,},
},
"": {
"lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,},
},
},
)
# 🚀 TRAIN
train(
nlp=nlp,
max_steps=max_steps,
validation_interval=max_steps // 10,
train_data=TrainingData(
data=train_data,
batch_size="4096 tokens", # 32 * 128 tokens
pipe_names=["ner"],
shuffle="dataset",
),
val_data=val_data,
scorer={"ner": ner_metric},
optimizer=optimizer,
max_grad_norm=1.0,
output_dir="artifacts",
# Do preprocessing in parallel on 1 worker
num_workers=1,
# Enable on Mac OS X or if you don't want to use available GPUs
# cpu=True,
)
or use the config file:
from edsnlp.train import train
import edsnlp
import confit
cfg = confit.Config.from_disk(
"configs/config.yml", resolve=True, registry=edsnlp.registry
)
nlp = train(**cfg["train"])
Here are the parameters you can pass to the train
function:
Parameters
PARAMETER | DESCRIPTION | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
nlp | The pipeline that will be trained in place. TYPE: | ||||||||||||||
train_data | The training data. Can be a single TrainingData object, a dict that will be cast or a list of these objects. |
PARAMETER | DESCRIPTION |
---|---|
data | The stream of documents to train on. The documents will be preprocessed and collated according to the pipeline's components. TYPE: |
batch_size | The batch size. Can be a batching expression like "2000 words", an int (number of documents), or a tuple (batch_size, batch_by). The batch_by argument should be a statistic produced by the pipes that will be trained. For instance, the TYPE: |
shuffle | The shuffle strategy. Can be "dataset" to shuffle the entire dataset (this can be memory-intensive for large file based datasets), "fragment" to shuffle the fragment-based datasets like parquet files, or a batching expression like "2000 words" to shuffle the dataset in chunks of 2000 words. TYPE: |
accumulation_batch_size | How to split each batch into sub-batches that will be fed to the model independently to accumulate gradients over. TYPE: |
pipe_names | The names of the pipes that should be trained on this data. If None, defaults to all trainable pipes. TYPE: |
post_init | Whether to call the pipeline's post_init method with the data before training. TYPE: |
TYPE: AsList[TrainingData]
val_data
The validation data. Can be a single Stream object or a list of Stream.
TYPE: AsList[Stream]
seed
The random seed
TYPE: int
DEFAULT: 42
max_steps
The maximum number of training steps
TYPE: int
DEFAULT: 1000
optimizer
The optimizer. If None, a default optimizer will be used.
ScheduledOptimizer
object/dictionary
PARAMETER | DESCRIPTION |
---|---|
optim | The optimizer to use. If a string (like "adamw") or a type to instantiate, the TYPE: |
module | The module to optimize. Usually the TYPE: |
total_steps | The total number of steps, used for schedules. TYPE: |
groups | The groups to optimize. The key is a regex selector to match parameters in The matching is performed by running TYPE: |
TYPE: Union[ScheduledOptimizer, Optimizer]
DEFAULT: None
validation_interval
The number of steps between each evaluation. Defaults to 1/10 of max_steps
TYPE: Optional[int]
DEFAULT: None
checkpoint_interval
The number of steps between each model save. Defaults to validation_interval
TYPE: Optional[int]
DEFAULT: None
max_grad_norm
The maximum gradient norm
TYPE: float
DEFAULT: 5.0
loss_scales
The loss scales for each component (useful for multi-task learning)
TYPE: Dict[str, float]
DEFAULT: {}
scorer
How to score the model. Expects a GenericScorer
object or a dict containing a mapping of metric names to metric objects.
TYPE: GenericScorer
DEFAULT: GenericScorer()
num_workers
The number of workers to use for preprocessing the data in parallel. Setting it to 0 means no parallelization : data is processed on the main thread which may induce latency slow down the training. To avoid this, a good practice consist in doing the preprocessing either before training or in parallel in a separate process. Because of how EDS-NLP handles stream multiprocessing, changing this value will affect the order of the documents in the produces batches. A stream [1, 2, 3, 4, 5, 6] split in batches of size 3 will produce:
- [1, 2, 3] and [4, 5, 6] with 1 worker
- [1, 3, 5] and [2, 4, 6] with 2 workers
TYPE: int
DEFAULT: 0
cpu
Whether to use force training on CPU. On MacOS, this might be necessary to get around some mps
backend issues.
TYPE: bool
DEFAULT: False
mixed_precision
The mixed precision mode. Can be "no", "fp16", "bf16" or "fp8".
TYPE: Literal['no', 'fp16', 'bf16', 'fp8']
DEFAULT: 'no'
output_dir
The output directory, which will contain a model-last
directory with the last model, and a train_metrics.json
file with the training metrics and stats.
TYPE: Union[Path, str]
DEFAULT: Path('artifacts')
kwargs
Additional keyword arguments.
DEFAULT: {}
Use the model
You can now load the model and use it to process some text:
import edsnlp
nlp = edsnlp.load("artifacts/model-last")
doc = nlp("Some sample text")
for ent in doc.ents:
print(ent, ent.label_)
Packaging the model
To package the model and share it with friends or family (if the model does not contain sensitive data), you can use the following command:
python -m edsnlp.package --pipeline artifacts/model-last/ --name my_ner_model --distributions sdist
Parametrize either via the CLI or in config.yml
under [package]
.
Tthe model saved at the train script output path (artifacts/model-last
) will be named my_ner_model
and will be saved in the dist
folder. You can upload it to a package registry or install it directly with
pip install dist/my_ner_model-0.1.0.tar.gz