Configuration

EDS-PDF is built on top of the confit configuration system.

The following catalogue registries are included within EDS-PDF:

Section	Description
`factory`	Components factories (most often classes)
`adapter`	Raw data preprocessing functions

EDS-PDF pipelines are meant to be reproducible and serializable, such that you can always define a pipeline through the configuration system.

To wit, compare the API-based approach to the configuration-based approach (the two are strictly equivalent):

API-basedConfiguration-based

import edspdf
from pathlib import Path

model = edspdf.Pipeline()
model.add_pipe("pdfminer-extractor", name="extractor")
model.add_pipe("mask-classifier", name="classifier", config=dict(
    x0=0.2,
    x1=0.9,
    y0=0.3,
    y1=0.6,
    threshold=0.1,
)
model.add_pipe("simple-aggregator", name="aggregator")

# Get a PDF
pdf = Path("letter.pdf").read_bytes()

pdf = model(pdf)

str(pdf.aggregated_texts["body"])
# Out: Cher Pr ABC, Cher DEF,\n...

config.cfg

[pipeline]
pipeline = ["extractor", "classifier", "aggregator"]

[components.extractor]
@factory = "pdfminer-extractor"

[components.classifier]
@factory = "mask-classifier"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1

[components.aggregator]
@factory = "simple-aggregator"

import edspdf
from pathlib import Path

pipeline = edspdf.load("config.cfg")

# Get a PDF
pdf = Path("letter.pdf").read_bytes()

pdf = pipeline(pdf)

str(pdf.aggregated_texts["body"])
# Out: Cher Pr ABC, Cher DEF,\n...

The configuration-based approach strictly separates the definition of the pipeline to its application and avoids tucking away important configuration details. Changes to the pipeline are transparent as there is a single source of truth: the configuration file.