Key Concepts
The goal of EDS-PDF is to provide a framework for text extraction from PDF documents, along with some utilities and a few pipelines, stitched together by a robust configuration system powered by Thinc.
Organisation
The core object within EDS-PDF is the reader, which organises the extraction along four
well-defined steps:
- The extraction step extracts text blocs from the PDF and compiles them into a pandas DataFrame object, where each row relates to a single bloc.
- The transformation step is optional. It computes user-defined transformation on the data, to provide the classification algorithm with additional features.
- The classification step categorises each bloc, typically between
body,header,footer... - The aggregation step compiles the blocs together, exploiting the classification to re-create the original text.
Data Structure
EDS-PDF parses the PDF into a pandas DataFrame object where each row represents a text bloc. The DataFrame is carried all the way down to the aggregation step.
The following columns are reserved:
| Column | Description |
|---|---|
text |
Bloc text content |
page |
Page within the PDF (starting at 0) |
x0 |
Horizontal position of the top-left corner of the bloc bounding box |
x1 |
Horizontal position of the bottom-right corner of the bloc bounding box |
y0 |
Vertical position of the top-left corner of the bloc bounding box |
y1 |
Vertical position of the bottom-right corner of the bloc bounding box |
label |
Class of the bloc (eg body, header...) |
Position of bloc bounding boxes
The positional information (columns x0/x1/y0/y1) is normalised, and takes the top-left corner of
the page as reference.
Note that this contrasts with the PDF convention, which takes the bottom left corner as origin instead.
Some transformations may create their own columns. It's your responsibility to verify that the column names do not override each other.
We can review the different stages of the pipeline:
| Step | Input | Output | Description |
|---|---|---|---|
| Extraction | PDF (bytes) | DataFrame | Extracts text blocs from the PDF |
| Transformation | DataFrame | DataFrame | Compute hand-defined transformations on the blocs |
| Classification | DataFrame | DataFrame | Categorises each bloc |
| Aggregation | DataFrame | Dict | Re-creates the original text |
Configuration
Following the example of spaCy, EDS-PDF is organised around Explosion's
catalogue library, enabling a powerful configuration
system based on an extendable registry.
The following catalogues are included within EDS-PDF:
| Section | Description |
|---|---|
readers |
Top-level object, encapsulating a full EDS-PDF pipeline |
extractors |
Text bloc extraction models |
transforms |
Transformations that can be applied to each bloc before classification |
classifiers |
Classification routines (eg rule- or ml-based) |
misc |
Some miscellaneous utility functions |
Much like spaCy pipelines, EDS-PDF pipelines are meant to be reproducible and serialisable, such that the primary way to define a pipeline is through the configuration system.
To wit, compare the API-based approach to the configuration-based approach (the two are strictly equivalent):
from edspdf import aggregation, reading, extraction, classification
from pathlib import Path
reader = reading.PdfReader(
extractor=extraction.PdfMinerExtractor(),
classifier=classification.simple_mask_classifier_factory(
x0=0.2,
x1=0.9,
y0=0.3,
y1=0.6,
threshold=0.1,
),
aggregator=aggregation.SimpleAggregation(),
)
# Get a PDF
pdf = Path("letter.pdf").read_bytes()
texts = reader(pdf)
texts["body"]
# Out: Cher Pr ABC, Cher DEF,\n...
[reader]
@readers = "pdf-reader.v1"
[reader.extractor]
@extractors = "pdfminer.v1"
[reader.classifier]
@classifiers = "mask.v1"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1
[reader.aggregator]
@aggregators = "simple.v1"
from edspdf import registry, Config
from pathlib import Path
config = Config().from_disk("config.cfg")
reader = registry.resolve(config)["reader"]
# Get a PDF
pdf = Path("letter.pdf").read_bytes()
texts = reader(pdf)
texts["body"]
# Out: Cher Pr ABC, Cher DEF,\n...
The configuration-based approach strictly separates the definition of the pipeline to its application and avoids tucking away important configuration details. Changes to the pipeline are transparent as there is a single source of truth: the configuration file.
For more information on the configuration system, refer to the documentations of Thinc and spaCy.
Modularity and Extensibility
EDS-PDF includes everything you need to get started on text extraction, and ships with a number of trainable classifiers. But it also makes it extremely easy to extend its functionalities by designing new pipelines.