Skip to content

Dataset

Disclaimer

We do not provide the dataset due to privacy and regulatory constraints. You will however find the description of the dataset below. We also release the code for the rule-based annotation system.

Data Selection

We annotated around 4000 documents, selected according to the distribution of AP-HP's Clinical Data Warehouse (CDW), to obtain a sample that is representative of the actual documents present within the CDW.

Training data are selected among notes that were edited after August 2017, in order to skew the model towards more recent clinical notes. The test set, however, is sampled without any time constraints, to make sure the model performs well overall.

To ensure the robustness of the model, training and test sets documents were generated from two different PDF extraction methods:

  • the legacy method, based on PDFBox with a fixed mask
  • our new method EDS-PDF with an adaptative (machine-learned) mask

Annotated Entities

We annotated clinical documents with the following entities :

Label Description
ADRESSE Street address, eg 33 boulevard de Picpus
DATE Any absolute date other than a birthdate
DATE_NAISSANCE Birthdate
HOPITAL Hospital name, eg Hôpital Rothschild
IPP Internal AP-HP identifier for patients, displayed as a number
MAIL Email address
NDA Internal AP-HP identifier for visits, displayed as a number
NOM Any last name (patients, doctors, third parties)
PRENOM Any first name (patients, doctors, etc)
SECU Social security number
TEL Any phone number
VILLE Any city
ZIP Any zip code

Statistics

The following table presents the counts of annotated entities per split and per label.

  train dev test
  edspdf pdfbox edspdf pdfbox edspdf pdfbox
DATE 14711 2360 878 113 1973 2831
LASTNAME 4910 4299 292 236 625 4150
FIRSTNAME 3468 3826 215 216 478 3739
HOPITAL 1451 758 87 47 162 796
PHONE 397 1589 23 148 77 1851
BIRTHDATE 916 519 52 31 87 484
PATIENT ID 121 339 8 18 8 392
CITY 592 742 27 47 44 810
VISIT ID 49 283 7 17 6 282
ADDRESS 212 543 10 35 12 625
EMAIL 20 182 0 17 1 166
ZIP 215 552 10 35 14 635
NSS 73 79 6 7 4 32
ENTS 27135 16071 1615 967 3491 16793
DOCS 3025 348 200 22 348 348

Software

The software tools used to annotate the documents with personal identification entities were:

  • LabelStudio for the first annotation campaign
  • Metanno for the second annotation campaign but any annotation software will do.

The convert step takes as input either a jsonlines file (.jsonl) or a folder containing Standoff files (.ann) from an annotation with Brat.

Feel free to submit a pull request if these formats do not suit you!