Skip to content

Changelog

v0.8.0

Added

  • Add multi-modal transformers (huggingface-embedding) with windowing options
  • Add render_page option to pdfminer extractor, for multi-modal PDF features
  • Add inference utilities (accelerators), with simple mono process support and multi gpu / cpu support
  • Packaging utils (pipeline.package(...)) to make a pip installable package from a pipeline

Changed

  • Updated API to follow EDS-NLP's refactoring
  • Updated confit to 0.4.2 (better errors) and foldedtensor to 0.3.0 (better multiprocess support)
  • Removed pipeline.score. You should use pipeline.pipe, a custom scorer and pipeline.select_pipes instead.
  • Better test coverage
  • Use hatch instead of setuptools to build the package / docs and run the tests

Fixed

  • Fixed attrs dependency only being installed in dev mode

v0.7.0

Major refactoring of the library:

Core features

  • new pipeline system whose API is inspired by spaCy
  • first-class support for pytorch
  • hybrid model inference and training (rules + deep learning)
  • moved from pandas DataFrame to attrs dataclasses (PDFDoc, Page, Box, ...) for representing PDF documents
  • new configuration system based on [config][https://github.com/aphp/config], with support for instantiation of complex deep learning models, off-the-shelf CLI, ...

Functional features

  • new extractors: pymupdf and poppler (separate packages for licensing reasons)
  • many deep learning layers (box-transformer, 2d attention with relative position information, ...)
  • trainable deep learning classifier
  • training recipes for deep learning models

v0.6.3 - 2023-01-23

Fixed

  • Allow corrupted PDF to not raise an error by default (they are treated as empty PDFs)
  • Fix classification and aggregation for empty PDFs

v0.6.2 - 2022-12-07

Cast bytes-like extractor inputs as bytes

v0.6.1 - 2022-12-07

Performance and cuda related fixes.

v0.6.0 - 2022-12-05

Many, many changes: - added torch as the main deep learning framework instead of spaCy and thinc 🎉 - added poppler and mupdf as alternatives to pdfminer - new pipeline / config / registry system to facilitate consistency between training and inference - standardization of the exchange format between components with dataclass models (attrs more specifically) instead of pandas dataframes

v0.5.3 - 2022-08-31

Added

  • Add label mapping parameter to aggregators (to merge different types of blocks such as title and body)
  • Improved line aggregation formula

v0.5.2 - 2022-08-30

Fixed

  • Fix aggregation for empty documents

v0.5.1 - 2022-07-26

Changed

  • Drop the pdf2image dependency, replacing it with pypdfium2 (easier installation)

v0.5.0 - 2022-07-25

Changed

  • Major refactoring of the library. Moved from concepts (aggregation) to plural names (aggregators).

v0.4.3 - 2022-07-20

Fixed

  • Multi page boxes alignment

v0.4.2 - 2022-07-06

Added

  • package-resource.v1 in the misc registry

v0.4.1 - 2022-06-14

Fixed

  • Remove importlib.metadata dependency, which led to issues with Python 3.7

v0.4.0 - 2022-06-14

Added

  • Python 3.7 support, by relaxing dependency constraints
  • Support for package-resource pipeline for sklearn-pipeline.v1

v0.3.2 - 2022-06-03

Added

  • compare_results in visualisation

v0.3.1 - 2022-06-02

Fixed

  • Rescale transform now keeps origin on top-left corner

v0.3.0 - 2022-06-01

Added

  • Styles management within the extractor
  • styled.v1 aggregator, to handle styles
  • rescale.v1 transform, to go back to the original height and width

Changed

  • Styles and text extraction is handled by the extractor directly
  • The PDFMiner line object is not carried around any more

Removed

  • Outdated params entry in the EDS-PDF registry.

v0.2.2 - 2022-05-12

Changed

  • Fixed merge_lines bug when lines were empty
  • Modified the demo consequently

v0.2.1 - 2022-05-09

Changed

  • The extractor always returns a pandas DataFrame, be it empty. It enhances robustness and stability.

v0.2.0 - 2022-05-09

Added

  • aggregation submodule to handle the specifics of aggregating text blocs
  • Base classes for better-defined modules
  • Uniformise the columns to labels
  • Add arbitrary contextual information

Removed

  • typer legacy dependency
  • models submodule, which handled the configurations for Spark distribution (deferred to another package)
  • specific orbis context, which was APHP-specific

v0.1.0 - 2022-05-06

Inception ! 🎉

Features

  • spaCy-like configuration system
  • Available classifiers :
  • dummy.v1, that classifies everything to body
  • mask.v1, for simple rule-based classification
  • sklearn.v1, that uses a Scikit-Learn pipeline
  • random.v1, to better sow chaos
  • Merge different blocs together for easier visualisation
  • Streamlit demo with visualisation