Dataset
Disclaimer
We do not provide the dataset due to privacy and regulatory constraints. You will however find the description of the dataset below. We also release the code for the rule-based annotation system.
Data Selection
We annotated around 4000 documents, selected according to the distribution of AP-HP's Clinical Data Warehouse (CDW), to obtain a sample that is representative of the actual documents present within the CDW.
Training data are selected among notes that were edited after August 2017, in order to skew the model towards more recent clinical notes. The test set, however, is sampled without any time constraints, to make sure the model performs well overall.
To ensure the robustness of the model, training and test sets documents were generated from two different PDF extraction methods:
- the legacy method, based on PDFBox with a fixed mask
- our new method EDS-PDF with an adaptative (machine-learned) mask
Annotated Entities
We annotated clinical documents with the following entities :
Label | Description |
---|---|
ADRESSE |
Street address, eg 33 boulevard de Picpus |
DATE |
Any absolute date other than a birthdate |
DATE_NAISSANCE |
Birthdate |
HOPITAL |
Hospital name, eg Hôpital Rothschild |
IPP |
Internal AP-HP identifier for patients, displayed as a number |
MAIL |
Email address |
NDA |
Internal AP-HP identifier for visits, displayed as a number |
NOM |
Any last name (patients, doctors, third parties) |
PRENOM |
Any first name (patients, doctors, etc) |
SECU |
Social security number |
TEL |
Any phone number |
VILLE |
Any city |
ZIP |
Any zip code |
Statistics
The following table presents the counts of annotated entities per split and per label.
train | dev | test | ||||
---|---|---|---|---|---|---|
edspdf | pdfbox | edspdf | pdfbox | edspdf | pdfbox | |
DATE | 14711 | 2360 | 878 | 113 | 1973 | 2831 |
LASTNAME | 4910 | 4299 | 292 | 236 | 625 | 4150 |
FIRSTNAME | 3468 | 3826 | 215 | 216 | 478 | 3739 |
HOPITAL | 1451 | 758 | 87 | 47 | 162 | 796 |
PHONE | 397 | 1589 | 23 | 148 | 77 | 1851 |
BIRTHDATE | 916 | 519 | 52 | 31 | 87 | 484 |
PATIENT ID | 121 | 339 | 8 | 18 | 8 | 392 |
CITY | 592 | 742 | 27 | 47 | 44 | 810 |
VISIT ID | 49 | 283 | 7 | 17 | 6 | 282 |
ADDRESS | 212 | 543 | 10 | 35 | 12 | 625 |
20 | 182 | 0 | 17 | 1 | 166 | |
ZIP | 215 | 552 | 10 | 35 | 14 | 635 |
NSS | 73 | 79 | 6 | 7 | 4 | 32 |
ENTS | 27135 | 16071 | 1615 | 967 | 3491 | 16793 |
DOCS | 3025 | 348 | 200 | 22 | 348 | 348 |
Software
The software tools used to annotate the documents with personal identification entities were:
- LabelStudio for the first annotation campaign
- Metanno for the second annotation campaign but any annotation software will do.
The convert
step takes as input either a jsonlines file (.jsonl
) or a folder
containing Standoff files (.ann
) from an annotation with Brat.
Feel free to submit a pull request if these formats do not suit you!