Skip to content

Embeddings

We offer multiple embedding methods to encode the text and layout information of the PDFs. The following components can be added to a pipeline or composed together, and contain preprocessing and postprocessing logic to convert and batch documents.

Factory name Description
simple-text-embedding A module that embeds the textual features of the blocks.
embedding-combiner Encodes boxes using a combination of multiple encoders
sub-box-cnn-pooler Pools the output of a CNN over the elements of a box (like words)
box-layout-embedding Encodes the layout of the boxes
box-transformer Contextualizes box representations using a transformer
huggingface-embedding Box representations using a Huggingface multi-modal model.

Layers

These components are not to be confused with layers, which are standard PyTorch modules that can be used to build trainable components, such as the ones described here.