Changelog
v0.10.0
Added
- Support packaging models made in setuptools based projects
Fixed
- Support packaging with poetry 2.0
Changed
- Handle cases like distant superscript "³ something" where the super script and the rest of the text are parsed are two lines one above the other, when they should be on the same line.
v0.9.3
- Support pydantic v2
v0.9.2
Changed
- Default to fp16 when inferring with gpu
- Support
inputsparameter inTrainablePipe.postprocess(...)method (as in edsnlp) - We now check that the user isn't trying to write a single file in a split fashion (when
write_in_worker is Trueornum_rows_per_file is not None) and raise an error if they do
Fixed
- Batches full of empty content boxes no longer crash the
huggingface-embeddingcomponent - Ensure models are always loaded in non training mode
- Improved performance of
edspdf.datamethods over a filesystem (fsparameter)
v0.9.1
Fixed
- It is now possible to recursively retrieve pdf files in a directory using
edspdf.data.read_files
v0.9.0
Added
- New unified
edspdf.dataapi (pdf files, pandas, parquet) and LazyCollection object to efficiently read / write data from / to different formats & sources. This API is has been heavily inspired by theedsnlp.dataAPI. - New unified processing API to select the execution backend via
data.set_processing(...)to replace the oldacceleratorsAPI (which is now deprecated, but still available). huggingface-embeddingnow supports quantization and otherAutoModel.from_pretrainedkwargs- It is now possible to add convert a label to multiple labels in the
simple-aggregatorcomponent :
# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
"text" : [ "title", "body", "table" ],
"title": "title",
}
Fixed
huggingface-embeddingnow resize bbox features for large PDFs, instead of making the model crashhuggingface-embeddingandsub-box-cnn-poolernow handle empty PDFs correctly
v0.8.1
Fixed
- Fix typing to allow passing an accelerator dict to
Pipeline.pipe(...) - Removed multiprocessing accelerator debug output
- Fixed absolute links in github-pages docs (e.g. image assets)
Changed
- Added auto-links to components in the docs (by comparing span contents with entry points)
v0.8.0
Added
- Add multi-modal transformers (
huggingface-embedding) with windowing options - Add
render_pageoption topdfminerextractor, for multi-modal PDF features - Add inference utilities (
accelerators), with simple mono process support and multi gpu / cpu support - Packaging utils (
pipeline.package(...)) to make a pip installable package from a pipeline
Changed
- Updated API to follow EDS-NLP's refactoring
- Updated
confitto 0.4.2 (better errors) andfoldedtensorto 0.3.0 (better multiprocess support) - Removed
pipeline.score. You should usepipeline.pipe, a custom scorer andpipeline.select_pipesinstead. - Better test coverage
- Use
hatchinstead ofsetuptoolsto build the package / docs and run the tests
Fixed
- Fixed
attrsdependency only being installed in dev mode
v0.7.0
Major refactoring of the library:
Core features
- new pipeline system whose API is inspired by spaCy
- first-class support for pytorch
- hybrid model inference and training (rules + deep learning)
- moved from pandas DataFrame to attrs dataclasses (
PDFDoc,Page,Box, ...) for representing PDF documents - new configuration system based on [config][https://github.com/aphp/config], with support for instantiation of complex deep learning models, off-the-shelf CLI, ...
Functional features
- new extractors: pymupdf and poppler (separate packages for licensing reasons)
- many deep learning layers (box-transformer, 2d attention with relative position information, ...)
- trainable deep learning classifier
- training recipes for deep learning models
v0.6.3 - 2023-01-23
Fixed
- Allow corrupted PDF to not raise an error by default (they are treated as empty PDFs)
- Fix classification and aggregation for empty PDFs
v0.6.2 - 2022-12-07
Cast bytes-like extractor inputs as bytes
v0.6.1 - 2022-12-07
Performance and cuda related fixes.
v0.6.0 - 2022-12-05
Many, many changes:
- added torch as the main deep learning framework instead of spaCy and thinc
- added poppler and mupdf as alternatives to pdfminer
- new pipeline / config / registry system to facilitate consistency between training and inference
- standardization of the exchange format between components with dataclass models (attrs more specifically) instead of pandas dataframes
v0.5.3 - 2022-08-31
Added
- Add label mapping parameter to aggregators (to merge different types of blocks such as
titleandbody) - Improved line aggregation formula
v0.5.2 - 2022-08-30
Fixed
- Fix aggregation for empty documents
v0.5.1 - 2022-07-26
Changed
- Drop the
pdf2imagedependency, replacing it withpypdfium2(easier installation)
v0.5.0 - 2022-07-25
Changed
- Major refactoring of the library. Moved from concepts (
aggregation) to plural names (aggregators).
v0.4.3 - 2022-07-20
Fixed
- Multi page boxes alignment
v0.4.2 - 2022-07-06
Added
package-resource.v1in the misc registry
v0.4.1 - 2022-06-14
Fixed
- Remove
importlib.metadatadependency, which led to issues with Python 3.7
v0.4.0 - 2022-06-14
Added
- Python 3.7 support, by relaxing dependency constraints
- Support for package-resource pipeline for
sklearn-pipeline.v1
v0.3.2 - 2022-06-03
Added
compare_resultsin visualisation
v0.3.1 - 2022-06-02
Fixed
- Rescale transform now keeps origin on top-left corner
v0.3.0 - 2022-06-01
Added
- Styles management within the extractor
styled.v1aggregator, to handle stylesrescale.v1transform, to go back to the original height and width
Changed
- Styles and text extraction is handled by the extractor directly
- The PDFMiner
lineobject is not carried around any more
Removed
- Outdated
paramsentry in the EDS-PDF registry.
v0.2.2 - 2022-05-12
Changed
- Fixed
merge_linesbug when lines were empty - Modified the demo consequently
v0.2.1 - 2022-05-09
Changed
- The extractor always returns a pandas DataFrame, be it empty. It enhances robustness and stability.
v0.2.0 - 2022-05-09
Added
aggregationsubmodule to handle the specifics of aggregating text blocs- Base classes for better-defined modules
- Uniformise the columns to
labels - Add arbitrary contextual information
Removed
typerlegacy dependencymodelssubmodule, which handled the configurations for Spark distribution (deferred to another package)- specific
orbiscontext, which was APHP-specific
v0.1.0 - 2022-05-06
Inception !
Features
- spaCy-like configuration system
- Available classifiers :
dummy.v1, that classifies everything tobodymask.v1, for simple rule-based classificationsklearn.v1, that uses a Scikit-Learn pipelinerandom.v1, to better sow chaos- Merge different blocs together for easier visualisation
- Streamlit demo with visualisation