Visualisation

EDS-PDF provides utilities to help you visualise the output of the pipeline.

Visualising a pipeline's output

You can use EDS-PDF to overlay labelled bounding boxes on top of a PDF document.

import edspdf
from confit import Config
from pathlib import Path
from edspdf.visualization import show_annotations

config = """
[pipeline]
pipeline = ["extractor", "classifier"]

[components]

[components.extractor]
@factory = "pdfminer-extractor"
extract_style = true

[components.classifier]
@factory = "mask-classifier"
x0 = 0.25
x1 = 0.95
y0 = 0.3
y1 = 0.9
threshold = 0.1
"""

model = edspdf.load(Config.from_str(config))

# Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()

# Construct the DataFrame of blocs
doc = model(pdf)

# Compute an image representation of each page of the PDF
# overlaid with the predicted bounding boxes
imgs = show_annotations(pdf=pdf, annotations=doc.text_boxes)

imgs[0]

If you run this code in a Jupyter notebook, you'll see the following:

Merging blocs together

To help debug a pipeline (or a labelled dataset), you might want to merge blocs together according to their labels. EDS-PDF provides a merge_lines method that does just that.

# ↑ Omitted code above ↑
from edspdf.visualization import merge_boxes, show_annotations

merged = merge_boxes(doc.text_boxes)

imgs = show_annotations(pdf=pdf, annotations=merged)
imgs[0]

See the difference:

OriginalMerged

The merge_boxes method uses the notion of maximal cliques to compute merges. It forbids the combined blocs from overlapping with any bloc from another label.