Changelog
v0.6.3 - 2023-01-23
Fixed
- Allow corrupted PDF to not raise an error by default (they are treated as empty PDFs)
- Fix classification and aggregation for empty PDFs
v0.6.2 - 2022-12-07
Cast bytes-like extractor inputs as bytes
v0.6.1 - 2022-12-07
Performance and cuda related fixes.
v0.6.0 - 2022-12-05
Many, many changes:
- added torch as the main deep learning framework instead of spaCy and thinc
- added poppler and mupdf as alternatives to pdfminer
- new pipeline / config / registry system to facilitate consistency between training and inference
- standardization of the exchange format between components with dataclass models (attrs more specifically) instead of pandas dataframes
v0.5.3 - 2022-08-31
Added
- Add label mapping parameter to aggregators (to merge different types of blocks such as
titleandbody) - Improved line aggregation formula
v0.5.2 - 2022-08-30
Fixed
- Fix aggregation for empty documents
v0.5.1 - 2022-07-26
Changed
- Drop the
pdf2imagedependency, replacing it withpypdfium2(easier installation)
v0.5.0 - 2022-07-25
Changed
- Major refactoring of the library. Moved from concepts (
aggregation) to plural names (aggregators).
v0.4.3 - 2022-07-20
Fixed
- Multi page boxes alignment
v0.4.2 - 2022-07-06
Added
package-resource.v1in the misc registry
v0.4.1 - 2022-06-14
Fixed
- Remove
importlib.metadatadependency, which led to issues with Python 3.7
v0.4.0 - 2022-06-14
Added
- Python 3.7 support, by relaxing dependency constraints
- Support for package-resource pipeline for
sklearn-pipeline.v1
v0.3.2 - 2022-06-03
Added
compare_resultsin visualisation
v0.3.1 - 2022-06-02
Fixed
- Rescale transform now keeps origin on top-left corner
v0.3.0 - 2022-06-01
Added
- Styles management within the extractor
styled.v1aggregator, to handle stylesrescale.v1transform, to go back to the original height and width
Changed
- Styles and text extraction is handled by the extractor directly
- The PDFMiner
lineobject is not carried around any more
Removed
- Outdated
paramsentry in the EDS-PDF registry.
v0.2.2 - 2022-05-12
Changed
- Fixed
merge_linesbug when lines were empty - Modified the demo consequently
v0.2.1 - 2022-05-09
Changed
- The extractor always returns a pandas DataFrame, be it empty. It enhances robustness and stability.
v0.2.0 - 2022-05-09
Added
aggregationsubmodule to handle the specifics of aggregating text blocs- Base classes for better-defined modules
- Uniformise the columns to
labels - Add arbitrary contextual information
Removed
typerlegacy dependencymodelssubmodule, which handled the configurations for Spark distribution (deferred to another package)- specific
orbiscontext, which was APHP-specific
v0.1.0 - 2022-05-06
Inception !
Features
- spaCy-like configuration system
- Available classifiers :
dummy.v1, that classifies everything tobodymask.v1, for simple rule-based classificationsklearn.v1, that uses a Scikit-Learn pipelinerandom.v1, to better sow chaos- Merge different blocs together for easier visualisation
- Streamlit demo with visualisation