PdfMiner Extractor
We provide a PDF line extractor built on top of PdfMiner.
This is the most portable extractor, since it is pure-python and can therefore be run on any platform. Be sure to have a look at their documentation, especially the part providing a bird's eye view of the PDF extraction process.
Examples
pipeline.add_pipe(
"pdfminer-extractor",
config=dict(
extract_style=False,
),
)
[components.extractor]
@factory = "pdfminer-extractor"
extract_style = false
And use the pipeline on a PDF document:
from pathlib import Path
# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())
Parameters
PARAMETER | DESCRIPTION |
---|---|
line_overlap |
See PDFMiner documentation
TYPE:
|
char_margin |
See PDFMiner documentation
TYPE:
|
line_margin |
See PDFMiner documentation
TYPE:
|
word_margin |
See PDFMiner documentation
TYPE:
|
boxes_flow |
See PDFMiner documentation
TYPE:
|
detect_vertical |
See PDFMiner documentation
TYPE:
|
all_texts |
See PDFMiner documentation
TYPE:
|
extract_style |
Whether to extract style (font, size, ...) information for each line of the document. Default: False
TYPE:
|
render_pages |
Whether to extract the rendered page as a numpy array in the
TYPE:
|
render_dpi |
DPI to use when rendering the page (defaults to 200)
TYPE:
|
raise_on_error |
Whether to raise an error if the PDF cannot be parsed. Default: False
TYPE:
|