`SimpleAggregator`

Aggregator that returns texts and styles. It groups all text boxes with the same label under the aggregated_text, and additionally aggregates the styles of the text boxes.

Examples

Create a pipeline

API-basedConfiguration-based

pipeline = ...
pipeline.add_pipe(
    "simple-aggregator",
    name="aggregator",
    config={
        "new_line_threshold": 0.2,
        "new_paragraph_threshold": 1.5,
        "label_map": {
            "body": "text",
            "table": "text",
        },
    },
)

...

[components.aggregator]
@factory = "simple-aggregator"
new_line_threshold = 0.2
new_paragraph_threshold = 1.5
# To build the "text" label, we will aggregate lines from
# "title", "body" and "table" and output "title" lines in a
# separate field "title" as well.
label_map = {
    "text" : [ "title", "body", "table" ],
    "title" : "title",
    }
...

and run it on a document:

doc = pipeline(doc)
print(doc.aggregated_texts)
# {
#     "text": "This is the body of the document, followed by a table | A | B |"
# }

Parameters

PARAMETER	DESCRIPTION
`pipeline`	The pipeline object TYPE: `Pipeline` DEFAULT: `None`
`name`	The name of the component TYPE: `str` DEFAULT: `'simple-aggregator'`
`sort`	Whether to sort text boxes inside each label group by (page, y, x) position before merging them. TYPE: `bool` DEFAULT: `False`
`new_line_threshold`	Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate lines TYPE: `float` DEFAULT: `0.2`
`new_paragraph_threshold`	Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate paragraphs and thus add a newline character between them. TYPE: `float` DEFAULT: `1.5`
`label_map`	A dictionary mapping from new labels to old labels. This is useful to group labels together, for instance, to output both "body" and "table" as "text". TYPE: `Dict[str, Union[str, List[str]]]` DEFAULT: `{}`

Source code in edspdf/pipes/aggregators/simple.py

def __init__(
    self,
    pipeline: Pipeline = None,
    name: str = "simple-aggregator",
    sort: bool = False,
    new_line_threshold: float = 0.2,
    new_paragraph_threshold: float = 1.5,
    label_map: Dict[str, Union[str, List[str]]] = {},
) -> None:
    self.name = name
    self.sort = sort
    self.label_map = {
        label: [old_labels] if not isinstance(old_labels, list) else old_labels
        for label, old_labels in label_map.items()
    }
    self.new_line_threshold = new_line_threshold
    self.new_paragraph_threshold = new_paragraph_threshold