Quickstart

Deployment

This project trains our pseudonymisation pipeline, and make it pip-installable.

Requirements

To use this repository, you will need to supply:

A labelled dataset
A HuggingFace transformers model, or use a publicly available model like camembert-base

In any case, you will need to modify the configuration to reflect these changes.

Installation

Install the requirements by running the following command at the root of the repo

poetry install

Training a model

EDS-Pseudonymisation is a spaCy project. We created a single workflow that:

Converts the datasets to spaCy format
Trains the pipeline
Evaluates the pipeline using the test set
Packages the resulting model to make it pip-installable

To add a new dataset, run

dvc import-url url/or/path/to/your/dataset data/dataset

To (re-)train a model and package it, just run:

dvc repro

You should now be able to install and publish it:

pip install dist/eds_pseudonymisation-0.2.0-*

Use it

To use it, execute

import eds_pseudonymisation

nlp = eds_pseudonymisation.load()
doc = nlp(
    """En 1815, M. Charles-François-Bienvenu
Myriel était évêque de Digne. C’était un vieillard
d’environ soixante-quinze ans ; il occupait le
siège de Digne depuis 1806. """
)
for ent in doc.ents:
    print(ent, ent.label)

# 1815 DATE
# Charles-François-Bienvenu NOM
# Myriel PRENOM
# Digne VILLE
# 1806 DATE