Skip to content

Pipelines overview

EDS-PDF provides easy-to-use components for defining PDF processing pipelines.

Pipeline Description
pdfminer-extractor Extracts text lines with the pdfminer library
mupdf-extractor Extracts text lines with the pymupdf library
poppler-extractor Extracts text lines with the poppler software
Pipeline Description
deep-classifier Trainable box classification model
mask-classifier Simple rule-based classification
multi-mask-classifier Simple rule-based classification
dummy-classifier Dummy classifier, for testing purposes.
random-classifier To sow chaos
Method Description
simple-aggregator Returns a dictionary with one key for each detected class
styled-aggregator Returns the same dictionary, as well as the information on styles

You can add them to your EDS-PDF pipeline by simply calling add_pipe, for instance:

# ↑ Omitted code that defines the pipeline object ↑
pipeline.add_pipe("pdfminer-extractor", name="component-name", config=...)