Skip to content

MuPDF Extractor

We provide a PDF line extractor built on top of PyMuPdf.

This extractor is the fastest but may not be as portable as that PdfMinerExtractor. However, it should also be relatively easy to install on a wide range of architectures, Linux, OS X and Windows.

License

Beware, PyMuPdf is distributed under the AGPL license, therefore so is this component, and any model depending on this component must be too.

Installation

For the licensing reason mentioned above, the mupdf component is distributed in a separate package edspdf-mupdf. To install it, use your favorite Python package manager :

poetry add edspdf-mupdf
# or
pip install edspdf-mupdf

Example

pipeline.add_pipe(
    "mupdf-extractor",
    config=dict(
        extract_style=False,
    ),
)
[components.mupdf-extractor]
@factory = "mupdf-extractor"
extract_style = false

and use it as follows:

from pathlib import Path

# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())

Parameters

PARAMETER DESCRIPTION
pipeline

The pipeline object

TYPE: Pipeline DEFAULT: None

name

Name of the component

TYPE: str DEFAULT: 'mupdf_extractor'

extract_style

Extract style

TYPE: bool DEFAULT: False

raise_on_error

Whether to raise an error when parsing a corrupted PDF (defaults to False)

TYPE: bool DEFAULT: False

use_cropbox

Whether to use the cropbox instead of the mediabox (defaults to True)

TYPE: bool DEFAULT: True

render_pages

Whether to extract the rendered page as a numpy array in the page.image attribute (defaults to False)

TYPE: bool DEFAULT: False

render_dpi

DPI to use when rendering the page (defaults to 200)

TYPE: int DEFAULT: 200

sort_mode

Box sorting mode

  • "blocks": sort while keeping blocks of boxes intaxct. Use this mode if you trust the PDF to have been generated by a tool that produces blocks of text.
  • "lines": sort by lines, without preserving the order of lines inside blocks
  • "none": do not sort boxes

TYPE: Literal['blocks', 'lines', 'none'] DEFAULT: 'none'