HuggingfaceEmbedding

The HuggingfaceEmbeddings component is a wrapper around the Huggingface multi-modal models. Such pre-trained models should offer better results than a model trained from scratch. Compared to using the raw Huggingface model, we offer a simple mechanism to split long documents into strided windows before feeding them to the model.

Windowing

The HuggingfaceEmbedding component splits long documents into smaller windows before feeding them to the model. This is done to avoid hitting the maximum number of tokens that can be processed by the model on a single device. The window size and stride can be configured using the window and stride parameters. The default values are 510 and 255 respectively, which means that the model will process windows of 510 tokens, each separated by 255 tokens. Whenever a token appears in multiple windows, the embedding of the "most contextualized" occurrence is used, i.e. the occurrence that is the closest to the center of its window.

Here is an overview how this works in a classifier model :

Examples

Here is an example of how to define a pipeline with the HuggingfaceEmbedding component:

from edspdf import Pipeline

model = Pipeline()
model.add_pipe(
    "pdfminer-extractor",
    name="extractor",
    config={
        "render_pages": True,
    },
)
model.add_pipe(
    "huggingface-embedding",
    name="embedding",
    config={
        "model": "microsoft/layoutlmv3-base",
        "use_image": False,
        "window": 128,
        "stride": 64,
        "line_pooling": "mean",
    },
)
model.add_pipe(
    "trainable-classifier",
    name="classifier",
    config={
        "embedding": model.get_pipe("embedding"),
        "labels": [],
    },
)

This model can then be trained following the training recipe.

Parameters

PARAMETER	DESCRIPTION
`pipeline`	The pipeline instance TYPE: `Pipeline` DEFAULT: `None`
`name`	The component name TYPE: `str` DEFAULT: `'huggingface-embedding'`
`model`	The Huggingface model name or path TYPE: `str` DEFAULT: `None`
`use_image`	Whether to use the image or not in the model TYPE: `bool` DEFAULT: `True`
`window`	The window size to use when splitting long documents into smaller windows before feeding them to the Transformer model (default: 510 = 512 - 2) TYPE: `int` DEFAULT: `510`
`stride`	The stride (distance between windows) to use when splitting long documents into smaller windows: (default: 510 / 2 = 255) TYPE: `int` DEFAULT: `255`
`line_pooling`	The pooling strategy to use when combining the embeddings of the tokens in a line into a single line embedding TYPE: `Literal['mean', 'max', 'sum']` DEFAULT: `'mean'`
`max_tokens_per_device`	The maximum number of tokens that can be processed by the model on a single device. This does not affect the results but can be used to reduce the memory usage of the model, at the cost of a longer processing time. TYPE: `int` DEFAULT: `maxsize`
`quantization_config`	The quantization configuration to use when loading the model TYPE: `Optional[BitsAndBytesConfig]` DEFAULT: `None`
`kwargs`	Additional keyword arguments to pass to the Huggingface `AutoModel.from_pretrained` method DEFAULT: `{}`