Skip to content

Poppler Extractor

We provide a PDF line extractor built on top of Poppler.

The poppler software is more difficult to install than its pdfminer-extractor and mupdf-extractor counterparts. In particular, the bindings we provide have not been tested on Windows.

License

Beware, Poppler is distributed under the GPL license, therefore so is this component, and any model depending on this component must be too.

Installation

For the licensing reason mentioned above, the poppler-extractor component is distributed in a separate package edspdf-poppler. To install it, use your favorite Python package manager :

poetry add edspdf-poppler
# or
pip install edspdf-poppler

Example

pipeline.add_pipe(
    "poppler-extractor",
    config=dict(
        extract_style=False,
    ),
)
[components.poppler-extractor]
@factory = "poppler-extractor"
extract_style = false

and use it as follows:

from pathlib import Path

# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())
PARAMETER DESCRIPTION
pipeline

The pipeline object

TYPE: Pipeline DEFAULT: None

name

The name of the component

TYPE: str DEFAULT: 'poppler-extractor'

extract_style

Extract style

TYPE: bool DEFAULT: False

raise_on_error

Whether to raise an error when parsing a corrupted PDF (defaults to False)

TYPE: bool DEFAULT: False

sort_mode

Box sorting mode

  • "lines": sort by lines, without preserving the order of lines inside blocks
  • "none": do not sort boxes

TYPE: Literal['lines', 'none'] DEFAULT: 'none'