edsnlp.training.trainer
GenericScorer
A scorer to evaluate the model performance on various tasks.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch_size | The batch size to use for scoring. Can be an int (number of documents) or a string (batching expression like "2000 words"). TYPE: |
speed | Whether to compute the model speed (words/documents per second) TYPE: |
autocast | Whether to use autocasting for mixed precision during the evaluation, defaults to True. TYPE: |
metrics | A keyword arguments mapping of metric names to metrics objects. See the metrics documentation for more info. DEFAULT: |
TrainingData
A training data object.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
data | The stream of documents to train on. The documents will be preprocessed and collated according to the pipeline's components. TYPE: |
batch_size | The batch size. Can be a batching expression like "2000 words", an int (number of documents), or a tuple (batch_size, batch_by). The batch_by argument should be a statistic produced by the pipes that will be trained. For instance, the TYPE: |
shuffle | The shuffle strategy. Can be "dataset" to shuffle the entire dataset (this can be memory-intensive for large file based datasets), "fragment" to shuffle the fragment-based datasets like parquet files, or a batching expression like "2000 words" to shuffle the dataset in chunks of 2000 words. TYPE: |
sub_batch_size | How to split each batch into sub-batches that will be fed to the model independently to accumulate gradients over. To split a batch of 8000 tokens into smaller batches of 1000 tokens each, just set this to "1000 tokens". You can also request a number of splits, like "4 splits", to split the batch into N parts each close to (but less than) batch_size / N. TYPE: |
pipe_names | The names of the pipes that should be trained on this data. If None, defaults to all trainable pipes. TYPE: |
post_init | Whether to call the pipeline's post_init method with the data before training. TYPE: |
train
Train a pipeline.
Parameters
| PARAMETER | DESCRIPTION | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
nlp | The pipeline that will be trained in place. TYPE: | ||||||||||||||
train_data | The training data. Can be a single TrainingData object, a dict that will be cast or a list of these objects. |
| PARAMETER | DESCRIPTION |
|---|---|
data | The stream of documents to train on. The documents will be preprocessed and collated according to the pipeline's components. TYPE: |
batch_size | The batch size. Can be a batching expression like "2000 words", an int (number of documents), or a tuple (batch_size, batch_by). The batch_by argument should be a statistic produced by the pipes that will be trained. For instance, the TYPE: |
shuffle | The shuffle strategy. Can be "dataset" to shuffle the entire dataset (this can be memory-intensive for large file based datasets), "fragment" to shuffle the fragment-based datasets like parquet files, or a batching expression like "2000 words" to shuffle the dataset in chunks of 2000 words. TYPE: |
sub_batch_size | How to split each batch into sub-batches that will be fed to the model independently to accumulate gradients over. To split a batch of 8000 tokens into smaller batches of 1000 tokens each, just set this to "1000 tokens". You can also request a number of splits, like "4 splits", to split the batch into N parts each close to (but less than) batch_size / N. TYPE: |
pipe_names | The names of the pipes that should be trained on this data. If None, defaults to all trainable pipes. TYPE: |
post_init | Whether to call the pipeline's post_init method with the data before training. TYPE: |
TYPE: AsList[TrainingData]
val_dataThe validation data. Can be a single Stream object or a list of Stream.
TYPE: AsList[Stream] DEFAULT: []
seedThe random seed
TYPE: int DEFAULT: 42
max_stepsThe maximum number of training steps
TYPE: int DEFAULT: 1000
optimizerThe optimizer. If None, a default optimizer will be used.
ScheduledOptimizer object/dictionary
| PARAMETER | DESCRIPTION |
|---|---|
optim | The optimizer to use. If a string (like "adamw") or a type to instantiate, the TYPE: |
module | The module to optimize. Usually the TYPE: |
total_steps | The total number of steps, used for schedules. TYPE: |
groups | The groups to optimize. Each group is a dictionary containing:
The matching is performed by running TYPE: |
TYPE: Union[Draft[ScheduledOptimizer], ScheduledOptimizer, Optimizer] DEFAULT: None
validation_intervalThe number of steps between each evaluation. Defaults to 1/10 of max_steps
TYPE: Optional[int] DEFAULT: None
checkpoint_intervalThe number of steps between each model save. Defaults to validation_interval
TYPE: Optional[int] DEFAULT: None
grad_max_normThe maximum gradient norm
TYPE: float DEFAULT: 5.0
grad_dev_policyThe policy to apply when a gradient spike is detected, ie. when the gradient norm is higher than the mean + std * grad_max_dev. Can be:
- "clip_mean": clip the gradients to the mean gradient norm
- "clip_threshold": clip the gradients to the mean + std * grad_max_dev
- "skip": skip the step
These do not apply to grad_max_norm that is always enforced when it is not None, since grad_max_norm is not adaptive and would most likely prohibit the model from learning during the early stages of training when gradients are expected to be high.
TYPE: Optional[Literal['clip_mean', 'clip_threshold', 'skip']] DEFAULT: None
grad_ewm_windowApproximately how many steps should we look back to compute the average gradient norm and variance to detect gradient deviation spikes.
TYPE: int DEFAULT: 100
grad_max_devThe threshold to apply to detect gradient spikes. A spike is detected when the value is higher than the mean + variance * threshold.
TYPE: float DEFAULT: 7.0
loss_scalesThe loss scales for each component (useful for multi-task learning)
TYPE: Dict[str, float] DEFAULT: {}
scorerHow to score the model. Expects a GenericScorer object or a dict containing a mapping of metric names to metric objects.
GenericScorer object/dictionary
| PARAMETER | DESCRIPTION |
|---|---|
batch_size | The batch size to use for scoring. Can be an int (number of documents) or a string (batching expression like "2000 words"). TYPE: |
speed | Whether to compute the model speed (words/documents per second) TYPE: |
autocast | Whether to use autocasting for mixed precision during the evaluation, defaults to True. TYPE: |
metrics | A keyword arguments mapping of metric names to metrics objects. See the metrics documentation for more info. DEFAULT: |
TYPE: GenericScorer DEFAULT: GenericScorer()
num_workersThe number of workers to use for preprocessing the data in parallel. Setting it to 0 means no parallelization : data is processed on the main thread which may induce latency slow down the training. To avoid this, a good practice consist in doing the preprocessing either before training or in parallel in a separate process. Because of how EDS-NLP handles stream multiprocessing, changing this value will affect the order of the documents in the produces batches. A stream [1, 2, 3, 4, 5, 6] split in batches of size 3 will produce:
- [1, 2, 3] and [4, 5, 6] with 1 worker
- [1, 3, 5] and [2, 4, 6] with 2 workers
TYPE: int DEFAULT: 0
cpuWhether to use force training on CPU. On MacOS, this might be necessary to get around some mps backend issues.
TYPE: bool DEFAULT: False
mixed_precisionThe mixed precision mode. Can be "no", "fp16", "bf16" or "fp8".
TYPE: Literal['no', 'fp16', 'bf16', 'fp8'] DEFAULT: 'no'
output_dirThe output directory, which will contain a model-last directory with the last model, and a train_metrics.json file with the training metrics and stats.
TYPE: Union[Path, str] DEFAULT: Path('artifacts')
output_model_dirThe directory where to save the model. If None, defaults to output_dir / "model-last".
TYPE: Optional[Union[Path, str]] DEFAULT: None
save_modelWhether to save the model or not. This can be useful if you are only interested in the metrics, but no the model, and want to avoid spending time dumping the model weights to the disk.
TYPE: bool DEFAULT: True
loggerThe logger to use. Can be a boolean to use the default loggers (rich and json), a list of logger names, or a list of logger objects.
You can use huggingface accelerate integrated loggers (tensorboard, wandb, comet_ml, aim, mlflow, clearml, dvclive), or EDS-NLP simple loggers, or a combination of both:
csv: logs to a CSV file inoutput_dir(artifacts/metrics.csv)json: logs to a JSON file inoutput_dir(artifacts/metrics.json)rich: logs to a rich table in the terminal
TYPE: Union[bool, AsList[Union[str, GeneralTracker, Draft[GeneralTracker]]]] DEFAULT: True
log_weight_gradsWhether to log the weight gradients during training.
TYPE: bool DEFAULT: False
on_validation_callbackA callback function invoked during validation steps to handle custom logic.
TYPE: Optional[Callable[[Dict], None]] DEFAULT: None
project_nameThe project name, used to group experiments in some loggers. If None, defaults to the path of the config file, relative to the home directory, with slashes replaced by double underscores.
TYPE: str DEFAULT: None
kwargsAdditional keyword arguments.
DEFAULT: {}
| RETURNS | DESCRIPTION |
|---|---|
Pipeline | The trained pipeline |