PdfMinerExtractor
We provide a PDF line extractor built on top of PdfMiner.
This is the most portable extractor, since it is pure-python and can therefore be run on any platform. Be sure to have a look at their documentation, especially the part providing a bird's eye view of the PDF extraction process.
Usage
from edspdf import Pipeline
from pathlib import Path
# Add the component to a new pipeline
model = Pipeline()
model.add_pipe(
"pdfminer-extractor",
config=dict(
extract_style=False,
),
)
# Apply on a new document
model(Path("path/to/your/pdf/document").read_bytes())
Configuration
| Parameter | Description | Default |
|---|---|---|
| line_overlap | See PDFMiner documentation | 0.5 |
| char_margin | See PDFMiner documentation | 2.05 |
| line_margin | See PDFMiner documentation | 0.5 |
| word_margin | See PDFMiner documentation | 0.1 |
| boxes_flow | See PDFMiner documentation | 0.5 |
| detect_vertical | See PDFMiner documentation | False |
| all_texts | See PDFMiner documentation | False |
| extract_style | Whether to extract style (font, size, ...) information for each line of the document. | False |