MuPDF Extractor

We provide a PDF line extractor built on top of PyMuPdf.

This extractor is the fastest but may not be as portable as that PdfMinerExtractor. However, it should also be relatively easy to install on a wide range of architectures, Linux, OS X and Windows.

License

Beware, PyMuPdf is distributed under the AGPL license, therefore so is this component, and any model depending on this component must be too.

Installation

For the licensing reason mentioned above, the mupdf component is distributed in a separate package edspdf-mupdf. To install it, use your favorite Python package manager :

poetry add edspdf-mupdf
# or
pip install edspdf-mupdf

Example

API-basedConfiguration-based

pipeline.add_pipe(
    "mupdf-extractor",
    config=dict(
        extract_style=False,
    ),
)

[components.mupdf-extractor]
@factory = "mupdf-extractor"
extract_style = false

and use it as follows:

from pathlib import Path

# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())

Parameters

PARAMETER	DESCRIPTION
`pipeline`	The pipeline object TYPE: `Pipeline` DEFAULT: `None`
`name`	Name of the component TYPE: `str` DEFAULT: `'mupdf_extractor'`
`extract_style`	Extract style TYPE: `bool` DEFAULT: `False`
`raise_on_error`	Whether to raise an error when parsing a corrupted PDF (defaults to False) TYPE: `bool` DEFAULT: `False`
`use_cropbox`	Whether to use the cropbox instead of the mediabox (defaults to True) TYPE: `bool` DEFAULT: `True`
`render_pages`	Whether to extract the rendered page as a numpy array in the `page.image` attribute (defaults to False) TYPE: `bool` DEFAULT: `False`
`render_dpi`	DPI to use when rendering the page (defaults to 200) TYPE: `int` DEFAULT: `200`
`sort_mode`	Box sorting mode "blocks": sort while keeping blocks of boxes intaxct. Use this mode if you trust the PDF to have been generated by a tool that produces blocks of text. "lines": sort by lines, without preserving the order of lines inside blocks "none": do not sort boxes TYPE: `Literal['blocks', 'lines', 'none']` DEFAULT: `'none'`