Poppler Extractor
We provide a PDF line extractor built on top of Poppler.
The poppler software is more difficult to install than its
pdfminer-extractor
and
mupdf-extractor
counterparts.
In particular, the bindings we provide have not been tested on Windows.
License
Beware, Poppler is distributed under the GPL license, therefore so is this component, and any model depending on this component must be too.
Installation
For the licensing reason mentioned above, the poppler-extractor
component is
distributed in a separate package edspdf-poppler
. To install it, use your favorite
Python package manager :
poetry add edspdf-poppler
# or
pip install edspdf-poppler
Example
pipeline.add_pipe(
"poppler-extractor",
config=dict(
extract_style=False,
),
)
[components.poppler-extractor]
@factory = "poppler-extractor"
extract_style = false
and use it as follows:
from pathlib import Path
# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())
PARAMETER | DESCRIPTION |
---|---|
pipeline |
The pipeline object
TYPE:
|
name |
The name of the component
TYPE:
|
extract_style |
Extract style
TYPE:
|
raise_on_error |
Whether to raise an error when parsing a corrupted PDF (defaults to False)
TYPE:
|
sort_mode |
Box sorting mode
TYPE:
|