Embeddings

We offer multiple embedding methods to encode the text and layout information of the PDFs. The following components can be added to a pipeline or composed together, and contain preprocessing and postprocessing logic to convert and batch documents.

Factory name	Description
`simple-text-embedding`	A module that embeds the textual features of the blocks.
`embedding-combiner`	Encodes boxes using a combination of multiple encoders
`sub-box-cnn-pooler`	Pools the output of a CNN over the elements of a box (like words)
`box-layout-embedding`	Encodes the layout of the boxes
`box-transformer`	Contextualizes box representations using a transformer

Layers

These components are not to be confused with layers, which are standard PyTorch modules that can be used to build trainable components, such as the ones described here.