Mask Classification
We developed a simple classifier that roughly uses the same strategy as PDFBox, namely:
- define a "mask" on the PDF documents ;
- keep every text bloc within that mask, tag everything else as pollution.
Factories
Two factories are available in the classifiers
registry: mask-classifier
and multi-mask-classifier
.
mask-classifier
The simplest form of mask classification. You define the mask, everything else is tagged as pollution.
PARAMETER | DESCRIPTION |
---|---|
pipeline |
The pipeline object
TYPE:
|
name |
The name of the component
TYPE:
|
x0 |
The x0 coordinate of the mask
TYPE:
|
y0 |
The y0 coordinate of the mask
TYPE:
|
x1 |
The x1 coordinate of the mask
TYPE:
|
y1 |
The y1 coordinate of the mask
TYPE:
|
threshold |
The threshold for the alignment
TYPE:
|
Examples
pipeline.add_pipe(
"mask-classifier",
name="classifier",
config={
"threshold": 0.9,
"x0": 0.1,
"y0": 0.1,
"x1": 0.9,
"y1": 0.9,
},
)
[components.classifier]
@classifiers = "mask-classifier"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.9
threshold = 0.9
multi-mask-classifier
A generalisation, wherein the user defines a number of regions.
The following configuration produces exactly the same classifier as mask.v1
example above.
Any bloc that is not part of a mask is tagged as pollution
.
PARAMETER | DESCRIPTION |
---|---|
pipeline |
The pipeline object
TYPE:
|
name |
TYPE:
|
threshold |
The threshold for the alignment
TYPE:
|
masks |
The masks
TYPE:
|
Examples
pipeline.add_pipe(
"multi-mask-classifier",
name="classifier",
config={
"threshold": 0.9,
"mymask": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "body"},
},
)
[components.classifier]
@factory = "multi-mask-classifier"
threshold = 0.9
[components.classifier.mymask]
label = "body"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.9
The following configuration defines a header
region.
pipeline.add_pipe(
"multi-mask-classifier",
name="classifier",
config={
"threshold": 0.9,
"body": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "header"},
"header": {"x0": 0.1, "y0": 0.3, "x1": 0.9, "y1": 0.9, "label": "body"},
},
)
[components.classifier]
@factory = "multi-mask-classifier"
threshold = 0.9
[components.classifier.header]
label = "header"
x0 = 0.1
y0 = 0.1
x1 = 0.9
y1 = 0.3
[components.classifier.body]
label = "body"
x0 = 0.1
y0 = 0.3
x1 = 0.9
y1 = 0.9