Skip to content

Embeddings

We offer multiple embedding methods to encode the text and layout information of the PDFs. The following components can be added to a pipeline or composed together, and contain preprocessing and postprocessing logic to convert and batch documents.

Factory name Description
simple-text-embedding A module that embeds the textual features of the blocks.
embedding-combiner Encodes boxes using a combination of multiple encoders
sub-box-cnn-pooler Pools the output of a CNN over the elements of a box (like words)
box-layout-embedding Encodes the layout of the boxes
box-transformer Contextualizes box representations using a transformer

Layers

These components are not to be confused with layers, which are standard PyTorch modules that can be used to build trainable components, such as the ones described here.