edspdf.readers
reader
PdfReader
Source code in edspdf/readers/reader.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
__init__(extractor=None, classifier=None, aggregator=None, transform=None, meta_labels=dict())
Reads a text-based PDF document,
| PARAMETER | DESCRIPTION |
|---|---|
extractor |
Text bloc extractor.
TYPE:
|
classifier |
Classifier model, to assign a section (eg
TYPE:
|
aggregator |
Aggregator model, to compile labelled text blocs together.
TYPE:
|
transform |
Transformation to apply before classification.
TYPE:
|
meta_labels |
Dictionary of hierarchical labels
(eg
TYPE:
|
Source code in edspdf/readers/reader.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
predict(lines)
Predict the label of each text bloc.
| PARAMETER | DESCRIPTION |
|---|---|
lines |
Text blocs to label.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
Labelled text blocs. |
Source code in edspdf/readers/reader.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
prepare_data(pdf, **context)
Prepare data before classification. Can also be used to generate the training dataset for the classifier.
| PARAMETER | DESCRIPTION |
|---|---|
pdf |
PDF document, as bytes.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
pd.DataFrame
|
Text blocs as a pandas DataFrame. |
Source code in edspdf/readers/reader.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
__call__(pdf, **context)
Process the PDF document.
| PARAMETER | DESCRIPTION |
|---|---|
pdf |
Byte representation of the PDF document.
TYPE:
|
context : Any Any contextual information that is used by the classifier (eg document type or source).
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, str]
|
Dictionary containing the aggregated text. |
Source code in edspdf/readers/reader.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |