PdfMiner Extractor

We provide a PDF line extractor built on top of PdfMiner.

This is the most portable extractor, since it is pure-python and can therefore be run on any platform. Be sure to have a look at their documentation, especially the part providing a bird's eye view of the PDF extraction process.

Examples

API-basedConfiguration-based

pipeline.add_pipe(
    "pdfminer-extractor",
    config=dict(
        extract_style=False,
    ),
)

[components.extractor]
@factory = "pdfminer-extractor"
extract_style = false

And use the pipeline on a PDF document:

from pathlib import Path

# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())

PARAMETER	DESCRIPTION
`line_overlap`	See PDFMiner documentation TYPE: `float` DEFAULT: `0.5`
`char_margin`	See PDFMiner documentation TYPE: `float` DEFAULT: `2.05`
`line_margin`	See PDFMiner documentation TYPE: `float` DEFAULT: `0.5`
`word_margin`	See PDFMiner documentation TYPE: `float` DEFAULT: `0.1`
`boxes_flow`	See PDFMiner documentation TYPE: `Optional[float]` DEFAULT: `0.5`
`detect_vertical`	See PDFMiner documentation TYPE: `bool` DEFAULT: `False`
`all_texts`	See PDFMiner documentation TYPE: `bool` DEFAULT: `False`
`extract_style`	Whether to extract style (font, size, ...) information for each line of the document. Default: False TYPE: `bool` DEFAULT: `False`
`render_pages`	Whether to extract the rendered page as a numpy array in the `page.image` attribute (defaults to False) TYPE: `bool` DEFAULT: `False`
`render_dpi`	DPI to use when rendering the page (defaults to 200) TYPE: `int` DEFAULT: `200`
`raise_on_error`	Whether to raise an error if the PDF cannot be parsed. Default: False TYPE: `bool` DEFAULT: `False`