Extraction

The extraction phase consists of reading the PDF document and gather text blocs, along with their dimensions and position within the document. Said blocs will go on to the classification phase to separate the body from the rest.

Text-based PDF

We provide a multiple extractor architectures for text-based PDFs :

Factory name	Description
`pdfminer-extractor`	Extracts text lines with the `pdfminer` library
`mupdf-extractor`	Extracts text lines with the `pymupdf` library
`poppler-extractor`	Extracts text lines with the `poppler` library

Image-based PDF

Image-based PDF documents require an OCR¹ step, which is not natively supported by EDS-PDF. However, you can easily extend EDS-PDF by adding such a method to the registry.

We plan on adding such an OCR extractor component in the future.

Optical Character Recognition, or OCR, is the process of extracting characters and words from an image. ↩