Skip to content

PdfMiner Extractor

We provide a PDF line extractor built on top of PdfMiner.

This is the most portable extractor, since it is pure-python and can therefore be run on any platform. Be sure to have a look at their documentation, especially the part providing a bird's eye view of the PDF extraction process.

Examples

pipeline.add_pipe(
    "pdfminer-extractor",
    config=dict(
        extract_style=False,
    ),
)
[components.extractor]
@factory = "pdfminer-extractor"
extract_style = false

And use the pipeline on a PDF document:

from pathlib import Path

# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())

Parameters

PARAMETER DESCRIPTION
line_overlap

See PDFMiner documentation

TYPE: float DEFAULT: 0.5

char_margin

See PDFMiner documentation

TYPE: float DEFAULT: 2.05

line_margin

See PDFMiner documentation

TYPE: float DEFAULT: 0.5

word_margin

See PDFMiner documentation

TYPE: float DEFAULT: 0.1

boxes_flow

See PDFMiner documentation

TYPE: Optional[float] DEFAULT: 0.5

detect_vertical

See PDFMiner documentation

TYPE: bool DEFAULT: False

all_texts

See PDFMiner documentation

TYPE: bool DEFAULT: False

extract_style

Whether to extract style (font, size, ...) information for each line of the document. Default: False

TYPE: bool DEFAULT: False

render_pages

Whether to extract the rendered page as a numpy array in the page.image attribute (defaults to False)

TYPE: bool DEFAULT: False

render_dpi

DPI to use when rendering the page (defaults to 200)

TYPE: int DEFAULT: 200

raise_on_error

Whether to raise an error if the PDF cannot be parsed. Default: False

TYPE: bool DEFAULT: False