Poppler Extractor

We provide a PDF line extractor built on top of Poppler.

The poppler software is more difficult to install than its pdfminer-extractor and mupdf-extractor counterparts. In particular, the bindings we provide have not been tested on Windows.

License

Beware, Poppler is distributed under the GPL license, therefore so is this component, and any model depending on this component must be too.

Installation

For the licensing reason mentioned above, the poppler-extractor component is distributed in a separate package edspdf-poppler. To install it, use your favorite Python package manager :

poetry add edspdf-poppler
# or
pip install edspdf-poppler

Example

API-basedConfiguration-based

pipeline.add_pipe(
    "poppler-extractor",
    config=dict(
        extract_style=False,
    ),
)

[components.poppler-extractor]
@factory = "poppler-extractor"
extract_style = false

and use it as follows:

from pathlib import Path

# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())

PARAMETER	DESCRIPTION
`pipeline`	The pipeline object TYPE: `Pipeline` DEFAULT: `None`
`name`	The name of the component TYPE: `str` DEFAULT: `'poppler-extractor'`
`extract_style`	Extract style TYPE: `bool` DEFAULT: `False`
`raise_on_error`	Whether to raise an error when parsing a corrupted PDF (defaults to False) TYPE: `bool` DEFAULT: `False`
`sort_mode`	Box sorting mode "lines": sort by lines, without preserving the order of lines inside blocks "none": do not sort boxes TYPE: `Literal['lines', 'none']` DEFAULT: `'none'`