Skip to content

Overview

EDS-Pseudonymisation is a spaCy-based project used at APHP to extract and replace identifying entities in medical documents.

Getting started

EDS-Pseudonymisation is a spaCy project. We created a single workflow that:

  • Converts the datasets to spaCy format
  • Trains the pipeline
  • Evaluates the pipeline using the test set
  • Packages the resulting model to make it pip-installable

To use it, you will need to supply:

  • A labelled dataset
  • A HuggingFace transformers model, or use camembert-base

In any case, you will need to modify the configuration to reflect these changes.

Entities

Label Description
ADRESSE Street address, eg 33 boulevard de Picpus
DATE Any absolute date other than a birthdate
DATE_NAISSANCE Birthdate
HOPITAL Hospital name, eg Hôpital Rothschild
IPP Internal AP-HP identifier for patients, displayed as a number
MAIL Email address
NDA Internal AP-HP identifier for visits, displayed as a number
NOM Any last name (patients, doctors, third parties)
PRENOM Any first name (patients, doctors, etc)
SECU Social security number
TEL Any phone number
VILLE Any city
ZIP Any zip code

Commands

Command Description
convert Convert the data to spaCy's binary format
train Train the NER model
evaluate Evaluate the model and export metrics
package Package the trained model as a pip package
visualize-model Visualize the model's output interactively using Streamlit

Run the command with

spacy project run [command] [options]