edspdf.extractors
functional
get_blocs(layout)
Extract text blocs from a PDFMiner layout generator.
Arguments
layout: PDFMiner layout generator.
| YIELDS | DESCRIPTION |
|---|---|
bloc
|
Text bloc
TYPE:
|
Source code in edspdf/extractors/functional.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
get_lines(layout)
Extract lines from a PDFMiner layout object.
The line is reframed such that the origin is the top left corner.
| PARAMETER | DESCRIPTION |
|---|---|
layout |
PDFMiner layout object.
TYPE:
|
| YIELDS | DESCRIPTION |
|---|---|
Iterator[Line]
|
Single line object. |
Source code in edspdf/extractors/functional.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |
remove_outside_lines(lines, strict_mode=False)
Filter out lines that are outside the canvas.
| PARAMETER | DESCRIPTION |
|---|---|
lines |
Dataframe of extracted lines
TYPE:
|
strict_mode |
Whether to remove the line if any part of it is outside the canvas, by default False
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
Filtered lines. |
Source code in edspdf/extractors/functional.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
base
BaseExtractor
Bases: ABC
Source code in edspdf/extractors/base.py
6 7 8 9 10 11 12 13 14 | |
extract(pdf)
abstractmethod
Handles the extraction
Source code in edspdf/extractors/base.py
7 8 9 10 11 | |
pdfminer
PdfMinerExtractor
Bases: BaseExtractor
Extractor object. Given a PDF byte stream, produces a list of blocs.
| PARAMETER | DESCRIPTION |
|---|---|
line_overlap |
See PDFMiner documentation
TYPE:
|
char_margin |
See PDFMiner documentation
TYPE:
|
line_margin |
See PDFMiner documentation
TYPE:
|
word_margin |
See PDFMiner documentation
TYPE:
|
boxes_flow |
See PDFMiner documentation
TYPE:
|
detect_vertical |
See PDFMiner documentation
TYPE:
|
all_texts |
See PDFMiner documentation
TYPE:
|
Source code in edspdf/extractors/pdfminer.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
generate_lines(pdf)
Generates dataframe from all blocs in the PDF.
Arguments
pdf: Byte stream representing the PDF.
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
DataFrame representing the blocs. |
Source code in edspdf/extractors/pdfminer.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
extract(pdf)
Process a single PDF document.
| PARAMETER | DESCRIPTION |
|---|---|
pdf |
Raw byte representation of the PDF document.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
DataFrame containing one row for each line extracted using PDFMiner. |
Source code in edspdf/extractors/pdfminer.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
style
models
BaseStyle
Bases: BaseModel
Model acting as an abstraction for a style.
Source code in edspdf/extractors/style/models.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Style
Bases: BaseStyle
Model acting as an abstraction for a style.
Source code in edspdf/extractors/style/models.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
from_fontname(fontname, size, upright, x0, x1, y0, y1)
classmethod
Constructor using the compound fontname representation.
| PARAMETER | DESCRIPTION |
|---|---|
fontname |
Compound description of the font. Often
TYPE:
|
size |
Character size.
TYPE:
|
upright |
Whether the character is upright.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Style
|
Style representation. |
Source code in edspdf/extractors/style/models.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | |
__eq__(other)
Computes equality between two styles.
| PARAMETER | DESCRIPTION |
|---|---|
other |
Style object to compare.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
Whether the two styles are equal. |
Source code in edspdf/extractors/style/models.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | |
StyledText
Bases: BaseModel
Abstraction of a word, containing the style and the text.
Source code in edspdf/extractors/style/models.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | |