Extraction
The extraction phase consists of reading the PDF document and gather text blocs, along with their dimensions and position within the document. Said blocs will go on to the classification phase to separate the body from the rest.
Text-based PDF
We provide a multiple extractor architectures for text-based PDFs :
Factory name | Description |
---|---|
pdfminer-extractor |
Extracts text lines with the pdfminer library |
mupdf-extractor |
Extracts text lines with the pymupdf library |
poppler-extractor |
Extracts text lines with the poppler library |
Image-based PDF
Image-based PDF documents require an OCR1 step, which is not natively supported by EDS-PDF. However, you can easily extend EDS-PDF by adding such a method to the registry.
We plan on adding such an OCR extractor component in the future.
-
Optical Character Recognition, or OCR, is the process of extracting characters and words from an image. ↩