edspdf
loading
load(path)
Load a complete pipeline.
TODO: implement other ways to load a pipeline.
| PARAMETER | DESCRIPTION |
|---|---|
path |
Path to the pipeline.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
PdfReader
|
A PdfReader object. |
Source code in edspdf/loading.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
from_str(config)
Load a complete pipeline from a string config.
| PARAMETER | DESCRIPTION |
|---|---|
config |
Configuration.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
PdfReader
|
A PdfReader object. |
Source code in edspdf/loading.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
classifiers
align
align_labels(lines, labels, threshold=0.0001)
Align lines with possibly overlapping (and non-exhaustive) labels.
Possible matches are sorted by covered area. Lines with no overlap at all
| PARAMETER | DESCRIPTION |
|---|---|
lines |
DataFrame containing the lines
TYPE:
|
labels |
DataFrame containing the labels
TYPE:
|
threshold |
Threshold to use for discounting a label. Used if the
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
A copy of the lines table, with the labels added. |
Source code in edspdf/classifiers/align.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
random
RandomClassifier
Bases: BaseClassifier
Random classifier, for chaos purposes. Classifies each line to a random element.
Source code in edspdf/classifiers/random.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | |
mask
MaskClassifier
Bases: BaseClassifier
Mask classifier, that reproduces the PdfBox behaviour.
Source code in edspdf/classifiers/mask.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | |
dummy
DummyClassifier
Bases: BaseClassifier
"Dummy" classifier, for testing purposes. Classifies every line to body.
Source code in edspdf/classifiers/dummy.py
10 11 12 13 14 15 16 17 | |
base
BaseClassifier
Bases: ABC
Source code in edspdf/classifiers/base.py
7 8 9 10 11 12 13 14 15 | |
predict(lines)
abstractmethod
Handles the classification.
Source code in edspdf/classifiers/base.py
8 9 10 11 12 | |
extractors
functional
get_blocs(layout)
Extract text blocs from a PDFMiner layout generator.
Arguments
layout: PDFMiner layout generator.
| YIELDS | DESCRIPTION |
|---|---|
bloc
|
Text bloc
TYPE:
|
Source code in edspdf/extractors/functional.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
get_lines(layout)
Extract lines from a PDFMiner layout object.
The line is reframed such that the origin is the top left corner.
| PARAMETER | DESCRIPTION |
|---|---|
layout |
PDFMiner layout object.
TYPE:
|
| YIELDS | DESCRIPTION |
|---|---|
Iterator[Line]
|
Single line object. |
Source code in edspdf/extractors/functional.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |
remove_outside_lines(lines, strict_mode=False)
Filter out lines that are outside the canvas.
| PARAMETER | DESCRIPTION |
|---|---|
lines |
Dataframe of extracted lines
TYPE:
|
strict_mode |
Whether to remove the line if any part of it is outside the canvas, by default False
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
Filtered lines. |
Source code in edspdf/extractors/functional.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
base
BaseExtractor
Bases: ABC
Source code in edspdf/extractors/base.py
6 7 8 9 10 11 12 13 14 | |
extract(pdf)
abstractmethod
Handles the extraction
Source code in edspdf/extractors/base.py
7 8 9 10 11 | |
pdfminer
PdfMinerExtractor
Bases: BaseExtractor
Extractor object. Given a PDF byte stream, produces a list of blocs.
| PARAMETER | DESCRIPTION |
|---|---|
line_overlap |
See PDFMiner documentation
TYPE:
|
char_margin |
See PDFMiner documentation
TYPE:
|
line_margin |
See PDFMiner documentation
TYPE:
|
word_margin |
See PDFMiner documentation
TYPE:
|
boxes_flow |
See PDFMiner documentation
TYPE:
|
detect_vertical |
See PDFMiner documentation
TYPE:
|
all_texts |
See PDFMiner documentation
TYPE:
|
Source code in edspdf/extractors/pdfminer.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
generate_lines(pdf)
Generates dataframe from all blocs in the PDF.
Arguments
pdf: Byte stream representing the PDF.
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
DataFrame representing the blocs. |
Source code in edspdf/extractors/pdfminer.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
extract(pdf)
Process a single PDF document.
| PARAMETER | DESCRIPTION |
|---|---|
pdf |
Raw byte representation of the PDF document.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
DataFrame containing one row for each line extracted using PDFMiner. |
Source code in edspdf/extractors/pdfminer.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
style
models
BaseStyle
Bases: BaseModel
Model acting as an abstraction for a style.
Source code in edspdf/extractors/style/models.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Style
Bases: BaseStyle
Model acting as an abstraction for a style.
Source code in edspdf/extractors/style/models.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
from_fontname(fontname, size, upright, x0, x1, y0, y1)
classmethod
Constructor using the compound fontname representation.
| PARAMETER | DESCRIPTION |
|---|---|
fontname |
Compound description of the font. Often
TYPE:
|
size |
Character size.
TYPE:
|
upright |
Whether the character is upright.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Style
|
Style representation. |
Source code in edspdf/extractors/style/models.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | |
__eq__(other)
Computes equality between two styles.
| PARAMETER | DESCRIPTION |
|---|---|
other |
Style object to compare.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
Whether the two styles are equal. |
Source code in edspdf/extractors/style/models.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | |
StyledText
Bases: BaseModel
Abstraction of a word, containing the style and the text.
Source code in edspdf/extractors/style/models.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | |
transforms
base
BaseTransform
Bases: ABC
Source code in edspdf/transforms/base.py
6 7 8 9 10 11 12 13 14 | |
transform(lines)
abstractmethod
Handles the transformation
Source code in edspdf/transforms/base.py
7 8 9 10 11 | |
readers
reader
PdfReader
Source code in edspdf/readers/reader.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
__init__(extractor=None, classifier=None, aggregator=None, transform=None, meta_labels=dict())
Reads a text-based PDF document,
| PARAMETER | DESCRIPTION |
|---|---|
extractor |
Text bloc extractor.
TYPE:
|
classifier |
Classifier model, to assign a section (eg
TYPE:
|
aggregator |
Aggregator model, to compile labelled text blocs together.
TYPE:
|
transform |
Transformation to apply before classification.
TYPE:
|
meta_labels |
Dictionary of hierarchical labels
(eg
TYPE:
|
Source code in edspdf/readers/reader.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
predict(lines)
Predict the label of each text bloc.
| PARAMETER | DESCRIPTION |
|---|---|
lines |
Text blocs to label.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
Labelled text blocs. |
Source code in edspdf/readers/reader.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
prepare_data(pdf, **context)
Prepare data before classification. Can also be used to generate the training dataset for the classifier.
| PARAMETER | DESCRIPTION |
|---|---|
pdf |
PDF document, as bytes.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
Text blocs as a pandas DataFrame. |
Source code in edspdf/readers/reader.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
__call__(pdf, **context)
Process the PDF document.
| PARAMETER | DESCRIPTION |
|---|---|
pdf |
Byte representation of the PDF document.
TYPE:
|
context : Any Any contextual information that is used by the classifier (eg document type or source).
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, str]
|
Dictionary containing the aggregated text. |
Source code in edspdf/readers/reader.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
aggregators
styled
StyledAggregator
Bases: SimpleAggregator
Aggregator that returns text and styles.
Source code in edspdf/aggregators/styled.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | |
base
BaseAggregator
Bases: ABC
Source code in edspdf/aggregators/base.py
7 8 9 10 11 12 13 14 15 16 17 | |
aggregate(lines)
abstractmethod
Handles the text aggregation
Source code in edspdf/aggregators/base.py
8 9 10 11 12 | |