Skip to content

Extraction

The extraction phase consists of reading the PDF document and gather text blocs, along with their dimensions and position within the document. Said blocs will go on to the classification phase to separate the body from the rest.

Text-based PDF

We provide a multiple extractor architectures for text-based PDFs :

Factory name Description
pdfminer-extractor Extracts text lines with the pdfminer library
mupdf-extractor Extracts text lines with the pymupdf library
poppler-extractor Extracts text lines with the poppler library

Image-based PDF

Image-based PDF documents require an OCR1 step, which is not natively supported by EDS-PDF. However, you can easily extend EDS-PDF by adding such a method to the registry.

We plan on adding such an OCR extractor component in the future.


  1. Optical Character Recognition, or OCR, is the process of extracting characters and words from an image.