Changelog

v0.8.0

Added

Add multi-modal transformers (huggingface-embedding) with windowing options
Add render_page option to pdfminer extractor, for multi-modal PDF features
Add inference utilities (accelerators), with simple mono process support and multi gpu / cpu support
Packaging utils (pipeline.package(...)) to make a pip installable package from a pipeline

Changed

Updated API to follow EDS-NLP's refactoring
Updated confit to 0.4.2 (better errors) and foldedtensor to 0.3.0 (better multiprocess support)
Removed pipeline.score. You should use pipeline.pipe, a custom scorer and pipeline.select_pipes instead.
Better test coverage
Use hatch instead of setuptools to build the package / docs and run the tests

Fixed

Fixed attrs dependency only being installed in dev mode

v0.7.0

Major refactoring of the library:

Core features

new pipeline system whose API is inspired by spaCy
first-class support for pytorch
hybrid model inference and training (rules + deep learning)
moved from pandas DataFrame to attrs dataclasses (PDFDoc, Page, Box, ...) for representing PDF documents
new configuration system based on [config][https://github.com/aphp/config], with support for instantiation of complex deep learning models, off-the-shelf CLI, ...

Functional features

new extractors: pymupdf and poppler (separate packages for licensing reasons)
many deep learning layers (box-transformer, 2d attention with relative position information, ...)
trainable deep learning classifier
training recipes for deep learning models

v0.6.3 - 2023-01-23

Fixed

Allow corrupted PDF to not raise an error by default (they are treated as empty PDFs)
Fix classification and aggregation for empty PDFs

v0.6.2 - 2022-12-07

Cast bytes-like extractor inputs as bytes

v0.6.1 - 2022-12-07

Performance and cuda related fixes.

v0.6.0 - 2022-12-05

Many, many changes: - added torch as the main deep learning framework instead of spaCy and thinc - added poppler and mupdf as alternatives to pdfminer - new pipeline / config / registry system to facilitate consistency between training and inference - standardization of the exchange format between components with dataclass models (attrs more specifically) instead of pandas dataframes

v0.5.3 - 2022-08-31

Added

Add label mapping parameter to aggregators (to merge different types of blocks such as title and body)
Improved line aggregation formula

v0.5.2 - 2022-08-30

Fixed

Fix aggregation for empty documents

v0.5.1 - 2022-07-26

Changed

Drop the pdf2image dependency, replacing it with pypdfium2 (easier installation)

v0.5.0 - 2022-07-25

Changed

Major refactoring of the library. Moved from concepts (aggregation) to plural names (aggregators).

v0.4.3 - 2022-07-20

Fixed

Multi page boxes alignment

v0.4.2 - 2022-07-06

Added

package-resource.v1 in the misc registry

v0.4.1 - 2022-06-14

Fixed

Remove importlib.metadata dependency, which led to issues with Python 3.7

v0.4.0 - 2022-06-14

Added

Python 3.7 support, by relaxing dependency constraints
Support for package-resource pipeline for sklearn-pipeline.v1

v0.3.2 - 2022-06-03

Added

compare_results in visualisation

v0.3.1 - 2022-06-02

Fixed

Rescale transform now keeps origin on top-left corner

v0.3.0 - 2022-06-01

Added

Styles management within the extractor
styled.v1 aggregator, to handle styles
rescale.v1 transform, to go back to the original height and width

Changed

Styles and text extraction is handled by the extractor directly
The PDFMiner line object is not carried around any more

Removed

Outdated params entry in the EDS-PDF registry.

v0.2.2 - 2022-05-12

Changed

Fixed merge_lines bug when lines were empty
Modified the demo consequently

v0.2.1 - 2022-05-09

Changed

The extractor always returns a pandas DataFrame, be it empty. It enhances robustness and stability.

v0.2.0 - 2022-05-09

Added

aggregation submodule to handle the specifics of aggregating text blocs
Base classes for better-defined modules
Uniformise the columns to labels
Add arbitrary contextual information

Removed

typer legacy dependency
models submodule, which handled the configurations for Spark distribution (deferred to another package)
specific orbis context, which was APHP-specific

v0.1.0 - 2022-05-06

Inception !

Features

spaCy-like configuration system
Available classifiers :
dummy.v1, that classifies everything to body
mask.v1, for simple rule-based classification
sklearn.v1, that uses a Scikit-Learn pipeline
random.v1, to better sow chaos
Merge different blocs together for easier visualisation
Streamlit demo with visualisation