Skip to content

Training a custom model

If neither the rule-based model nor the public model are sufficient for your needs, you can train your own model. This section will guide you through the process.

Requirements

To train a model, you will need to provide:

  • A labelled dataset
  • A HuggingFace transformers model, or a publicly available model like camembert-base
  • Ideally, a GPU to accelerate training

In any case, you will need to modify the configs/config.cfg file to reflect these changes. This configuration already contains the rule-based components of the previous section, feel free to add or remove them as you see fit. The configs/config.cfg file also contains the name of the package model in the [package] section (defaults to eds-pseudo-public).

DVC

We use DVC to manage the training pipeline. DVC is a version control system for data science and machine learning projects. We recommend you use it too. First, import some data (this basically copies the data to data/dataset, but in a version-controlled fashion):

dvc import-url url/or/path/to/your/dataset data/dataset

and execute the following command to (re)train the model and package it

dvc repro
Content of the dvc.yaml file

The above command runs the dvc.yaml config file to sequentially execute :

# Trains the model, and outputs it to artifacts/model-last
python scripts/train.py --config configs/config.cfg

# Evaluates the model, and outputs the results to artifacts
python scripts/evaluate.py --config configs/config.cfg

# Packages the model
python scripts/package.py

You should now be able to install and use it:

pip install dist/eds_pseudo_your_eds-0.3.0-*

Use it

To test it, execute

import eds_pseudo_your_eds

# Load the model
nlp = eds_pseudo_your_eds.load()
import edsnlp

# Load the model
nlp = edsnlp.load("artifacts/model-last")
# Apply it to a text
doc = nlp(
    "En 2015, M. Charles-François-Bienvenu "
    "Myriel était évêque de Digne. C’était un vieillard "
    "d’environ soixante-quinze ans ; il occupait le "
    "siège de Digne depuis le 2 janveir 2006."
)
for e in doc.ents:
    print(f"{e.text: <30}{e.label_: <10}{str(e._.date): <15}{e._.date_format}")

# Text                           Label      Date            Format
# -----------------------------  ---------  --------------  ---------
# 2015                           DATE       2015-??-??      %Y
# Charles-François-Bienvenu      PRENOM     None            None
# Myriel                         NOM        None            None
# Digne                          VILLE      None            None
# Digne                          VILLE      None            None
# 2 janveir 2006                 DATE       2006-01-02      %-d %B %Y

You can also add the NER component to an existing model (this is only compatible with edsnlp, not spaCy)

# Given an existing model
existing_nlp = ...

existing_nlp.add_pipe(nlp.get_pipe("ner"), name="ner")

To apply the model in parallel on many documents using one or more GPUs, refer to the Inference page.