Skip to content

Components overview

EDS-PDF provides easy-to-use components for defining PDF processing pipelines.

Factory name Description
pdfminer-extractor Extracts text lines with the pdfminer library
mupdf-extractor Extracts text lines with the pymupdf library
poppler-extractor Extracts text lines with the poppler library
Factory name Description
mask-classifier Simple rule-based classification
multi-mask-classifier Simple rule-based classification
dummy-classifier Dummy classifier, for testing purposes.
random-classifier To sow chaos
trainable-classifier Trainable box classification model
Factory name Description
simple-aggregator Returns a dictionary with one key for each detected class

Factory name Description
simple-text-embedding A module that embeds the textual features of the blocks.
embedding-combiner Encodes boxes using a combination of multiple encoders
sub-box-cnn-pooler Pools the output of a CNN over the elements of a box (like words)
box-layout-embedding Encodes the layout of the boxes
box-transformer Contextualizes box representations using a transformer
huggingface-embedding Box representations using a Huggingface multi-modal model.

You can add them to your EDS-PDF pipeline by simply calling add_pipe, for instance:

# ↑ Omitted code that defines the pipeline object ↑
pipeline.add_pipe("pdfminer-extractor", name="component-name", config=...)