SimpleAggregator
Aggregator that returns texts and styles. It groups all text boxes with the same
label under the aggregated_text
, and additionally aggregates the
styles of the text boxes.
Examples
Create a pipeline
pipeline = ...
pipeline.add_pipe(
"simple-aggregator",
name="aggregator",
config={
"new_line_threshold": 0.2,
"new_paragraph_threshold": 1.5,
"label_map": {
"body": "text",
"table": "text",
},
},
)
...
[components.aggregator]
@factory = "simple-aggregator"
new_line_threshold = 0.2
new_paragraph_threshold = 1.5
# To build the "text" label, we will aggregate lines from
# "title", "body" and "table" and output "title" lines in a
# separate field "title" as well.
label_map = {
"text" : [ "title", "body", "table" ],
"title" : "title",
}
...
and run it on a document:
doc = pipeline(doc)
print(doc.aggregated_texts)
# {
# "text": "This is the body of the document, followed by a table | A | B |"
# }
Parameters
PARAMETER | DESCRIPTION |
---|---|
pipeline |
The pipeline object
TYPE:
|
name |
The name of the component
TYPE:
|
sort |
Whether to sort text boxes inside each label group by (page, y, x) position before merging them.
TYPE:
|
new_line_threshold |
Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate lines
TYPE:
|
new_paragraph_threshold |
Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate paragraphs and thus add a newline character between them.
TYPE:
|
label_map |
A dictionary mapping from new labels to old labels. This is useful to group labels together, for instance, to output both "body" and "table" as "text".
TYPE:
|
Source code in edspdf/pipes/aggregators/simple.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|