Skip to content

SimpleAggregator

Aggregator that returns texts and styles. It groups all text boxes with the same label under the aggregated_text, and additionally aggregates the styles of the text boxes.

Examples

Create a pipeline

pipeline = ...
pipeline.add_pipe(
    "simple-aggregator",
    name="aggregator",
    config={
        "new_line_threshold": 0.2,
        "new_paragraph_threshold": 1.5,
        "label_map": {
            "body": "text",
            "table": "text",
        },
    },
)
...

[components.aggregator]
@factory = "simple-aggregator"
new_line_threshold = 0.2
new_paragraph_threshold = 1.5
# To build the "text" label, we will aggregate lines from
# "title", "body" and "table" and output "title" lines in a
# separate field "title" as well.
label_map = {
    "text" : [ "title", "body", "table" ],
    "title" : "title",
    }
...

and run it on a document:

doc = pipeline(doc)
print(doc.aggregated_texts)
# {
#     "text": "This is the body of the document, followed by a table | A | B |"
# }

Parameters

PARAMETER DESCRIPTION
pipeline

The pipeline object

TYPE: Pipeline DEFAULT: None

name

The name of the component

TYPE: str DEFAULT: 'simple-aggregator'

sort

Whether to sort text boxes inside each label group by (page, y, x) position before merging them.

TYPE: bool DEFAULT: False

new_line_threshold

Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate lines

TYPE: float DEFAULT: 0.2

new_paragraph_threshold

Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate paragraphs and thus add a newline character between them.

TYPE: float DEFAULT: 1.5

label_map

A dictionary mapping from new labels to old labels. This is useful to group labels together, for instance, to output both "body" and "table" as "text".

TYPE: Dict[str, Union[str, List[str]]] DEFAULT: {}

Source code in edspdf/pipes/aggregators/simple.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
def __init__(
    self,
    pipeline: Pipeline = None,
    name: str = "simple-aggregator",
    sort: bool = False,
    new_line_threshold: float = 0.2,
    new_paragraph_threshold: float = 1.5,
    label_map: Dict[str, Union[str, List[str]]] = {},
) -> None:
    self.name = name
    self.sort = sort
    self.label_map = {
        label: [old_labels] if not isinstance(old_labels, list) else old_labels
        for label, old_labels in label_map.items()
    }
    self.new_line_threshold = new_line_threshold
    self.new_paragraph_threshold = new_paragraph_threshold