Roadmap

Style extraction
Custom hybrid torch-based pipeline & configuration system
Drop pandas DataFrame in favour of a ~~Cython~~ attr wrapper around PDF documents?
Add training capabilities with a CLI to automate the annotation/preparation/training loop. Again, draw inspiration from spaCy, and maybe add the notion of a TrainableClassifier...
Add complete serialisation capabilities, to save a full pipeline to disk. Draw inspiration from spaCy, which took great care to solve these issues: add save and load methods to every pipeline component
Multiple-column extraction
Table detector
Integrate third-party OCR module