Changelog

v0.6.3 - 2023-01-23

Fixed

Allow corrupted PDF to not raise an error by default (they are treated as empty PDFs)
Fix classification and aggregation for empty PDFs

v0.6.2 - 2022-12-07

Cast bytes-like extractor inputs as bytes

v0.6.1 - 2022-12-07

Performance and cuda related fixes.

v0.6.0 - 2022-12-05

Many, many changes: - added torch as the main deep learning framework instead of spaCy and thinc - added poppler and mupdf as alternatives to pdfminer - new pipeline / config / registry system to facilitate consistency between training and inference - standardization of the exchange format between components with dataclass models (attrs more specifically) instead of pandas dataframes

v0.5.3 - 2022-08-31

Added

Add label mapping parameter to aggregators (to merge different types of blocks such as title and body)
Improved line aggregation formula

v0.5.2 - 2022-08-30

Fixed

Fix aggregation for empty documents

v0.5.1 - 2022-07-26

Changed

Drop the pdf2image dependency, replacing it with pypdfium2 (easier installation)

v0.5.0 - 2022-07-25

Changed

Major refactoring of the library. Moved from concepts (aggregation) to plural names (aggregators).

v0.4.3 - 2022-07-20

Fixed

Multi page boxes alignment

v0.4.2 - 2022-07-06

Added

package-resource.v1 in the misc registry

v0.4.1 - 2022-06-14

Fixed

Remove importlib.metadata dependency, which led to issues with Python 3.7

v0.4.0 - 2022-06-14

Added

Python 3.7 support, by relaxing dependency constraints
Support for package-resource pipeline for sklearn-pipeline.v1

v0.3.2 - 2022-06-03

Added

compare_results in visualisation

v0.3.1 - 2022-06-02

Fixed

Rescale transform now keeps origin on top-left corner

v0.3.0 - 2022-06-01

Added

Styles management within the extractor
styled.v1 aggregator, to handle styles
rescale.v1 transform, to go back to the original height and width

Changed

Styles and text extraction is handled by the extractor directly
The PDFMiner line object is not carried around any more

Removed

Outdated params entry in the EDS-PDF registry.

v0.2.2 - 2022-05-12

Changed

Fixed merge_lines bug when lines were empty
Modified the demo consequently

v0.2.1 - 2022-05-09

Changed

The extractor always returns a pandas DataFrame, be it empty. It enhances robustness and stability.

v0.2.0 - 2022-05-09

Added

aggregation submodule to handle the specifics of aggregating text blocs
Base classes for better-defined modules
Uniformise the columns to labels
Add arbitrary contextual information

Removed

typer legacy dependency
models submodule, which handled the configurations for Spark distribution (deferred to another package)
specific orbis context, which was APHP-specific

v0.1.0 - 2022-05-06

Inception !

Features

spaCy-like configuration system
Available classifiers :
dummy.v1, that classifies everything to body
mask.v1, for simple rule-based classification
sklearn.v1, that uses a Scikit-Learn pipeline
random.v1, to better sow chaos
Merge different blocs together for easier visualisation
Streamlit demo with visualisation