Skip to content

Dataset

Disclaimer

We do not provide our internal dataset due to privacy and regulatory constraints. You will however find the description of the dataset below. We also release the code for the rule-based annotation system.

You can find the fictive dataset generation description in the synthetic dataset section.

Format

We expect the annotations to be a jsonlines file with the following format:

{ "note_id": "any-id-1", "note_text": "Jacques Chirac a été maire de Paris", "entities": [{"start": 0, "end": 7, ...] }
{ "note_id": "any-id-2", "note_text": "Elle est née en 2006", "entities": [{"start": 16, "end": 20, ...] }
...

but you can change the format by modifying the config file, and the "datasets" part of it in particular, or the code of the adapter which is reponsible for loading the data during the training and evaluation.

Internal Data Selection

We annotated around 4000 documents, selected according to the distribution of AP-HP's Clinical Data Warehouse (CDW), to obtain a sample that is representative of the actual documents present within the CDW.

Training data are selected among notes that were edited after August 2017, in order to skew the model towards more recent clinical notes. The test set, however, is sampled without any time constraints, to make sure the model performs well overall.

To ensure the robustness of the model, training and test sets documents were generated from two different PDF extraction methods:

  • the legacy method, based on PDFBox with a fixed mask
  • our new method EDS-PDF with an adaptative (machine-learned) mask

Annotated Entities

We annotated clinical documents with the following entities :

Label Description
ADRESSE Street address, eg 33 boulevard de Picpus
DATE Any absolute date other than a birthdate
DATE_NAISSANCE Birthdate
HOPITAL Hospital name, eg Hôpital Rothschild
IPP Internal AP-HP identifier for patients, displayed as a number
MAIL Email address
NDA Internal AP-HP identifier for visits, displayed as a number
NOM Any last name (patients, doctors, third parties)
PRENOM Any first name (patients, doctors, etc)
SECU Social security number
TEL Any phone number
VILLE Any city
ZIP Any zip code

Statistics

To inspect the statistics for the latest version of our dataset, please refer to the v0.2.0 release.

Software

The software tools used to annotate the documents with personal identification entities were:

  • LabelStudio for the first annotation campaign
  • Metanno for the second annotation campaign but any annotation software will do.

Synthetic dataset

We will now describe the synthetic dataset generation process, used to produce the public pseudonymisation model.

Augmentation

Each synthetic training document is generated by augmenting a base fictitious template, replacing annotated entities with random values, generated from scratch or picked from a predefined public list :

  • PRENOM: INSEE deceased list and INSEE natality list
  • NOM: INSEE deceased list
  • VILLE: INSEE deceased list
  • HOPITAL: Handcrafted list
  • DATE: Random dates, formatted as the original value in the template
  • ADRESSE: No augmentation for now
  • MAIL: Gen from fake first names, last names and handcrafted domains
  • TEL: Random phone number
  • ZIP: Random zip code
  • IPP: Random number
  • NDA: Random number
  • SECU: Random number (following the French NSS format constraints)
  • DATE_NAISSANCE: Random date

Template augmentation

Template writing

The template writing process was done iteratively:

  1. We wrote a few starting annotated samples that we added to the base template list.
  2. We augmented the base templates following the process of the previous section.
  3. We trained the model on this augmented dataset.
  4. We evaluated the model on the internal training set (acting as a validation set).
  5. We picked the examples with the worst performance and wrote fictitious snippets with similar grammatical and syntactic structures and added them to the base template list.
  6. At the same time, we improved the augmentation process to account for these errors.
  7. We repeated the process starting from step 2 until we reached a satisfactory performance.