Mask Classification
We developed a simple classifier that roughly uses the same strategy as PDFBox, namely:
- define a "mask" on the PDF documents ;
- keep every text bloc within that mask, tag everything else as pollution.
Factories
Two factories are available in the classifiers registry: mask-classifier and multi-mask-classifier.
mask-classifier
The simplest form. You define the mask, everything else is tagged as pollution.
Example configuration :
pipeline.add_pipe(
"mask-classifier",
name="classifier",
config={
"threshold": 0.9,
"x0": 0.1,
"y0": 0.1,
"x1": 0.9,
"y1": 0.9,
},
)
[components.classifier]
@classifiers = "mask-classifier"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.9
threshold = 0.9
multi-mask-classifier
A generalisation, wherein the user defines a number of regions.
The following configuration produces exactly the same classifier as mask.v1 example above.
pipeline.add_pipe(
"multi-mask-classifier",
name="classifier",
config={
"threshold": 0.9,
"body": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "header"},
},
)
[components.classifier]
@factory = "multi-mask-classifier"
threshold = 0.9
[components.classifier.body]
label = "body"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.9
The following configuration defines a header region.
pipeline.add_pipe(
"multi-mask-classifier",
name="classifier",
config={
"threshold": 0.9,
"body": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "header"},
"header": {"x0": 0.1, "y0": 0.3, "x1": 0.9, "y1": 0.9, "label": "body"},
},
)
[components.classifier]
@factory = "multi-mask-classifier"
threshold = 0.9
[components.classifier.header]
label = "header"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.3
[components.classifier.body]
label = "body"
x0 = 0.1
y0 = 0.3
x1 = 0.9
y1 = 0.9
Any bloc that is not part of a mask is tagged as pollution.