Components overview

EDS-PDF provides easy-to-use components for defining PDF processing pipelines.

Box extractorsBox classifiersAggregatorsEmbeddings

Factory name	Description
`pdfminer-extractor`	Extracts text lines with the `pdfminer` library
`mupdf-extractor`	Extracts text lines with the `pymupdf` library
`poppler-extractor`	Extracts text lines with the `poppler` library

Factory name	Description
`mask-classifier`	Simple rule-based classification
`multi-mask-classifier`	Simple rule-based classification
`dummy-classifier`	Dummy classifier, for testing purposes.
`random-classifier`	To sow chaos
`trainable-classifier`	Trainable box classification model

Factory name	Description
`simple-aggregator`	Returns a dictionary with one key for each detected class

Factory name	Description
`simple-text-embedding`	A module that embeds the textual features of the blocks.
`embedding-combiner`	Encodes boxes using a combination of multiple encoders
`sub-box-cnn-pooler`	Pools the output of a CNN over the elements of a box (like words)
`box-layout-embedding`	Encodes the layout of the boxes
`box-transformer`	Contextualizes box representations using a transformer
`huggingface-embedding`	Box representations using a Huggingface multi-modal model.

You can add them to your EDS-PDF pipeline by simply calling add_pipe, for instance:

# ↑ Omitted code that defines the pipeline object ↑
pipeline.add_pipe("pdfminer-extractor", name="component-name", config=...)