PDF Annotation
In this section, we will cover one methodology to annotate PDF documents.
Data annotation at AP-HP's CDW
At AP-HP's CDW1, we recently moved away from a rule- and Java-based PDF extraction pipeline (using PDFBox) to one using EDS-PDF. Hence, EDS-PDF is used in production, helping extract text from around 100k PDF documents every day.
To train our pipeline presently in production, we annotated around 270 documents, and reached a f1-score of 0.98 on the body classification.
Preparing the data for annotation
We will frame the annotation phase as an image segmentation task,
where annotators are asked to draw bounding boxes around the different sections.
Hence, the very first step is to convert PDF documents to images. We suggest using the
library pdf2image
for that step.
The following script will convert the PDF documents located in a data/pdfs
directory
to PNG images inside the data/images
folder.
import pdf2image
from pathlib import Path
DATA_DIR = Path("data")
PDF_DIR = DATA_DIR / "pdfs"
IMAGE_DIR = DATA_DIR / "images"
for pdf in PDF_DIR.glob("*.pdf"):
imgs = pdf2image.convert_from_bytes(pdf)
for page, img in enumerate(imgs):
path = IMAGE_DIR / f"{pdf.stem}_{page}.png"
img.save(path)
You can use any annotation tool to annotate the images. If you're looking for a simple way to annotate from within a Jupyter Notebook, ipyannotations might be a good fit.
You will need to post-process the output to convert the annotations to the following format:
Key | Description |
---|---|
page |
Page within the PDF (0-indexed) |
x0 |
Horizontal position of the top-left corner of the bounding box |
x1 |
Horizontal position of the bottom-right corner of the bounding box |
y0 |
Vertical position of the top-left corner of the bounding box |
y1 |
Vertical position of the bottom-right corner of the bounding box |
label |
Class of the bounding box (eg body , header ...) |
All dimensions should be normalised by the height and width of the page.
Saving the dataset
Once the annotation phase is complete, make sure the train/test split is performed once and for all when you create the dataset.
We suggest the following structure:
dataset/
├── train/
│ ├── <note_id_1>.pdf
│ ├── <note_id_1>.json
│ ├── <note_id_2>.pdf
│ ├── <note_id_2>.json
│ └── ...
└── test/
├── <note_id_n>.pdf
├── <note_id_n>.json
└── ...
Where the normalised annotation resides in a JSON file living next to the related PDF, and uses the following schema:
Key | Description |
---|---|
note_id |
Reference to the document |
<properties> |
Optional property of the document itself |
annotations |
List of annotations, following the schema above |
This structure presents the advantage of being machine- and human-friendly. The JSON file contains annotated regions as well as any document property that could be useful to adapt the pipeline (typically for the classification step).
Extracting annotations
The following snippet extracts the annotations into a workable format:
from pathlib import Path
import pandas as pd
def get_annotations(
directory: Path,
) -> pd.DataFrame:
"""
Read annotations from the dataset directory.
Parameters
----------
directory : Path
Dataset directory
Returns
-------
pd.DataFrame
Pandas DataFrame containing the annotations.
"""
dfs = []
iterator = tqdm(list(directory.glob("*.json")))
for path in iterator:
meta = json.loads(path.read_text())
df = pd.DataFrame.from_records(meta.pop("annotations"))
for k, v in meta.items(): # (1)
df[k] = v
dfs.append(df)
return pd.concat(dfs)
train_path = Path("dataset/train")
annotations = get_annotations(train_path)
- Add a column for each additional property saved in the dataset.
The annotations compiled this way can be used to train a pipeline. See the trained pipeline recipe for more detail.
-
Greater Paris University Hospital's Clinical Data Warehouse ↩