Skip to content

Mask Classification

We developed a simple classifier that roughly uses the same strategy as PDFBox, namely:

  • define a "mask" on the PDF documents ;
  • keep every text bloc within that mask, tag everything else as pollution.

Factories

Two factories are available in the classifiers registry: mask-classifier and multi-mask-classifier.

mask-classifier

The simplest form of mask classification. You define the mask, everything else is tagged as pollution.

PARAMETER DESCRIPTION
pipeline

The pipeline object

TYPE: Pipeline DEFAULT: None

name

The name of the component

TYPE: str DEFAULT: 'mask-classifier'

x0

The x0 coordinate of the mask

TYPE: float

y0

The y0 coordinate of the mask

TYPE: float

x1

The x1 coordinate of the mask

TYPE: float

y1

The y1 coordinate of the mask

TYPE: float

threshold

The threshold for the alignment

TYPE: float DEFAULT: 1.0

Examples

pipeline.add_pipe(
    "mask-classifier",
    name="classifier",
    config={
        "threshold": 0.9,
        "x0": 0.1,
        "y0": 0.1,
        "x1": 0.9,
        "y1": 0.9,
    },
)
[components.classifier]
@classifiers = "mask-classifier"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.9
threshold = 0.9

multi-mask-classifier

A generalisation, wherein the user defines a number of regions.

The following configuration produces exactly the same classifier as mask.v1 example above.

Any bloc that is not part of a mask is tagged as pollution.

PARAMETER DESCRIPTION
pipeline

The pipeline object

TYPE: Pipeline DEFAULT: None

name

TYPE: str DEFAULT: 'multi-mask-classifier'

threshold

The threshold for the alignment

TYPE: float DEFAULT: 1.0

masks

The masks

TYPE: Box DEFAULT: {}

Examples

pipeline.add_pipe(
    "multi-mask-classifier",
    name="classifier",
    config={
        "threshold": 0.9,
        "mymask": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "body"},
    },
)
[components.classifier]
@factory = "multi-mask-classifier"
threshold = 0.9

[components.classifier.mymask]
label = "body"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.9

The following configuration defines a header region.

pipeline.add_pipe(
    "multi-mask-classifier",
    name="classifier",
    config={
        "threshold": 0.9,
        "body": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "header"},
        "header": {"x0": 0.1, "y0": 0.3, "x1": 0.9, "y1": 0.9, "label": "body"},
    },
)
[components.classifier]
@factory = "multi-mask-classifier"
threshold = 0.9

[components.classifier.header]
label = "header"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.3

[components.classifier.body]
label = "body"
x0 = 0.1
y0 = 0.3
x1 = 0.9
y1 = 0.9