MuPdfExtractor

We provide a PDF line extractor built on top of PyMuPdf.

This extractor is the fastest but may not be as portable as PdfMiner. However, it should also be relatively easy to install on a wide range of architectures, Linux, OS X and Windows.

License

Beware, PyMuPdf is distributed under the AGPL license, therefore so is this component, and any model depending on this component must be too.

Installation

For the licensing reason mentioned above, the mupdf component is distributed in a separate package edspdf-mupdf. To install it, use your favorite Python package manager :

poetry add edspdf-mupdf
# or
pip install edspdf-mupdf

Usage

from edspdf import Pipeline
from pathlib import Path

# Add the component to a new pipeline
model = Pipeline()
model.add_pipe(
    "mupdf-extractor",
    config=dict(
        extract_style=False,
    ),
)

# Apply on a new document
model(Path("path/to/your/pdf/document").read_bytes())

Configuration

Parameter	Description	Default
extract_style	Whether to extract style (font, size, ...) information for each line of the document.	False