Skip to content

Trainable classifier

This component predicts a label for each box over the whole document using machine learning.

Note

You must train the model your model to use this classifier. See Model training for more information

Examples

The classifier is composed of the following blocks:

  • a configurable box embedding layer
  • a linear classification layer

In this example, we use a box-embedding layer to generate the embeddings of the boxes. It is composed of a text encoder that embeds the text features of the boxes and a layout encoder that embeds the layout features of the boxes. These two embeddings are summed and passed through an optional contextualizer, here a box-transformer.

pipeline.add_pipe(
    "trainable-classifier",
    name="classifier",
    config={
        # simple embedding computed by pooling embeddings of words in each box
        "embedding": {
            "@factory": "sub-box-cnn-pooler",
            "out_channels": 64,
            "kernel_sizes": (3, 4, 5),
            "embedding": {
                "@factory": "simple-text-embedding",
                "size": 72,
            },
        },
        "labels": ["body", "pollution"],
        "activation": "relu",
    },
)
[components.classifier]
@factory = "trainable-classifier"
labels = ["body", "pollution"]
activation = "relu"

[components.classifier.embedding]
@factory = "sub-box-cnn-pooler"
out_channels = 64
kernel_sizes = (3, 4, 5)

[components.classifier.embedding.embedding]
@factory = "simple-text-embedding"
size = 72

Parameters

PARAMETER DESCRIPTION
labels

Initial labels of the classifier (will be completed during initialization)

TYPE: Sequence[str] DEFAULT: ('pollution')

embedding

Embedding module to encode the PDF boxes

TYPE: TrainablePipe[EmbeddingOutput]

activation

Name of the activation function

TYPE: ActivationFunction DEFAULT: 'gelu'

dropout_p

Dropout probability used on the output of the box and textual encoders

TYPE: float DEFAULT: 0.0

scorer

Scoring function

TYPE: Scorer DEFAULT: classifier_scorer