MuPDF Extractor
We provide a PDF line extractor built on top of PyMuPdf.
This extractor is the fastest but may not be as portable as that PdfMinerExtractor. However, it should also be relatively easy to install on a wide range of architectures, Linux, OS X and Windows.
License
Beware, PyMuPdf is distributed under the AGPL license, therefore so is this component, and any model depending on this component must be too.
Installation
For the licensing reason mentioned above, the mupdf
component is distributed
in a separate package edspdf-mupdf
. To install it, use your favorite Python package manager :
poetry add edspdf-mupdf
# or
pip install edspdf-mupdf
Example
pipeline.add_pipe(
"mupdf-extractor",
config=dict(
extract_style=False,
),
)
[components.mupdf-extractor]
@factory = "mupdf-extractor"
extract_style = false
and use it as follows:
from pathlib import Path
# Apply on a new document
pipeline(Path("path/to/your/pdf/document").read_bytes())
Parameters
PARAMETER | DESCRIPTION |
---|---|
pipeline |
The pipeline object
TYPE:
|
name |
Name of the component
TYPE:
|
extract_style |
Extract style
TYPE:
|
raise_on_error |
Whether to raise an error when parsing a corrupted PDF (defaults to False)
TYPE:
|
use_cropbox |
Whether to use the cropbox instead of the mediabox (defaults to True)
TYPE:
|
render_pages |
Whether to extract the rendered page as a numpy array in the
TYPE:
|
render_dpi |
DPI to use when rendering the page (defaults to 200)
TYPE:
|
sort_mode |
Box sorting mode
TYPE:
|