Skip to content

HuggingfaceEmbedding

The HuggingfaceEmbeddings component is a wrapper around the Huggingface multi-modal models. Such pre-trained models should offer better results than a model trained from scratch. Compared to using the raw Huggingface model, we offer a simple mechanism to split long documents into strided windows before feeding them to the model.

Windowing

The HuggingfaceEmbedding component splits long documents into smaller windows before feeding them to the model. This is done to avoid hitting the maximum number of tokens that can be processed by the model on a single device. The window size and stride can be configured using the window and stride parameters. The default values are 510 and 255 respectively, which means that the model will process windows of 510 tokens, each separated by 255 tokens. Whenever a token appears in multiple windows, the embedding of the "most contextualized" occurrence is used, i.e. the occurrence that is the closest to the center of its window.

Here is an overview how this works in a classifier model : Transformer windowing

Examples

Here is an example of how to define a pipeline with the HuggingfaceEmbedding component:

from edspdf import Pipeline

model = Pipeline()
model.add_pipe(
    "pdfminer-extractor",
    name="extractor",
    config={
        "render_pages": True,
    },
)
model.add_pipe(
    "huggingface-embedding",
    name="embedding",
    config={
        "model": "microsoft/layoutlmv3-base",
        "use_image": False,
        "window": 128,
        "stride": 64,
        "line_pooling": "mean",
    },
)
model.add_pipe(
    "trainable-classifier",
    name="classifier",
    config={
        "embedding": model.get_pipe("embedding"),
        "labels": [],
    },
)

This model can then be trained following the training recipe.

Parameters

PARAMETER DESCRIPTION
pipeline

The pipeline instance

TYPE: Pipeline DEFAULT: None

name

The component name

TYPE: str DEFAULT: 'huggingface-embedding'

model

The Huggingface model name or path

TYPE: str DEFAULT: None

use_image

Whether to use the image or not in the model

TYPE: bool DEFAULT: True

window

The window size to use when splitting long documents into smaller windows before feeding them to the Transformer model (default: 510 = 512 - 2)

TYPE: int DEFAULT: 510

stride

The stride (distance between windows) to use when splitting long documents into smaller windows: (default: 510 / 2 = 255)

TYPE: int DEFAULT: 255

line_pooling

The pooling strategy to use when combining the embeddings of the tokens in a line into a single line embedding

TYPE: Literal['mean', 'max', 'sum'] DEFAULT: 'mean'

max_tokens_per_device

The maximum number of tokens that can be processed by the model on a single device. This does not affect the results but can be used to reduce the memory usage of the model, at the cost of a longer processing time.

TYPE: int DEFAULT: maxsize

quantization_config

The quantization configuration to use when loading the model

TYPE: Optional[BitsAndBytesConfig] DEFAULT: None

kwargs

Additional keyword arguments to pass to the Huggingface AutoModel.from_pretrained method

DEFAULT: {}