Extraction

The extraction phase consists of reading the PDF document and gather text blocs, along with their dimensions and position within the document. Said blocs will go on to the classification phase to separate the body from the rest.

Text-based PDF

We provide a multiple extractor architectures for text-based PDFs :

Component	Description
pdfminer-extractor	Text-based PDF extraction using PDFMiner
mupdf-extractor	Text-based PDF extraction using MuPDF
poppler-extractor	Text-based PDF extraction using Poppler

Image-based PDF

Image-based PDF documents require an OCR¹ step, which is not natively supported by EDS-PDF. However, you can easily extend EDS-PDF by adding such a method to the registry.

We plan on adding such an OCR extractor component in the future.

Optical Character Recognition, or OCR, is the process of extracting characters and words from an image. ↩