Roadmap
- Style extraction
- Custom hybrid torch-based pipeline & configuration system
- Drop pandas DataFrame in favour of a ~~Cython~~ attr wrapper around PDF documents?
- Add training capabilities with a CLI to automate the annotation/preparation/training loop.
Again, draw inspiration from spaCy, and maybe add the notion of a
TrainableClassifier
... - Add complete serialisation capabilities, to save a full pipeline to disk.
Draw inspiration from spaCy, which took great care to solve these issues:
add
save
andload
methods to every pipeline component - Multiple-column extraction
- Table detector
- Integrate third-party OCR module