Components overview
EDS-PDF provides easy-to-use components for defining PDF processing pipelines.
Factory name | Description |
---|---|
pdfminer-extractor |
Extracts text lines with the pdfminer library |
mupdf-extractor |
Extracts text lines with the pymupdf library |
poppler-extractor |
Extracts text lines with the poppler library |
Factory name | Description |
---|---|
mask-classifier |
Simple rule-based classification |
multi-mask-classifier |
Simple rule-based classification |
dummy-classifier |
Dummy classifier, for testing purposes. |
random-classifier |
To sow chaos |
trainable-classifier |
Trainable box classification model |
Factory name | Description |
---|---|
simple-aggregator |
Returns a dictionary with one key for each detected class |
Factory name | Description |
---|---|
simple-text-embedding |
A module that embeds the textual features of the blocks. |
embedding-combiner |
Encodes boxes using a combination of multiple encoders |
sub-box-cnn-pooler |
Pools the output of a CNN over the elements of a box (like words) |
box-layout-embedding |
Encodes the layout of the boxes |
box-transformer |
Contextualizes box representations using a transformer |
huggingface-embedding |
Box representations using a Huggingface multi-modal model. |
You can add them to your EDS-PDF pipeline by simply calling add_pipe
, for instance:
# ↑ Omitted code that defines the pipeline object ↑
pipeline.add_pipe("pdfminer-extractor", name="component-name", config=...)