Rule-based Extraction
Let's create a rule-based extractor for PDF documents.
Note
This pipeline will likely perform poorly as soon as your PDF documents come in varied forms. In that case, even a very simple trained pipeline may give you a substantial performance boost (see next section).
First, download this example PDF.
We will use the following configuration:
config.cfg
[reader]
@readers = "pdf-reader.v1" # (1)
[reader.extractor]
@extractors = "pdfminer.v1" # (2)
[reader.classifier]
@classifiers = "mask.v1" # (3)
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1
[reader.aggregator]
@aggregators = "styled.v1" # (4)
- This is the top-level object, which organises the entire extraction process.
- Here we use the provided text-based extractor, based on the PDFMiner library
- This is where we define the rule-based classifier. Here, we use a "mask",
meaning that every text bloc that falls within the boundaries will be assigned
the
bodylabel, everything else will be tagged as pollution. - This aggregator returns a tuple of dictionaries. The first contains compiled text for each label, the second exports their style.
Save the configuration as config.cfg and run the following snippet:
import edspdf
from pathlib import Path
reader = edspdf.load("config.cfg") # (1)
# Get a PDF
pdf = Path("letter.pdf").read_bytes()
texts, styles = reader(pdf)
This code will output the following results:
Cher Pr ABC, Cher DEF,
Nous souhaitons remercier le CSE pour son avis favorable quant à l’accès aux données de
l’Entrepôt de Données de Santé du projet n° XXXX.
Nous avons bien pris connaissance des conditions requises pour cet avis favorable, c’est
pourquoi nous nous engageons par la présente à :
• Informer individuellement les patients concernés par la recherche, admis à l'AP-HP
avant juillet 2017, sortis vivants, et non réadmis depuis.
• Effectuer une demande d'autorisation à la CNIL en cas d'appariement avec d’autres
cohortes.
Bien cordialement,
The start and end columns refer to the character indices within the extracted text.
| fontname | font | style | size | upright | x0 | x1 | y0 | y1 | start | end |
|---|---|---|---|---|---|---|---|---|---|---|
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3389 | 0.4949 | 0.3012 | 0.3130 | 0 | 22 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3389 | 0.8024 | 0.3488 | 0.3606 | 24 | 90 |
| BCDHEE+Calibri | BCDHEE+Calibri | Normal | 9.9600 | True | 0.8024 | 0.8066 | 0.3488 | 0.3606 | 90 | 91 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.8067 | 0.9572 | 0.3488 | 0.3606 | 91 | 111 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3030 | 0.3069 | 0.3655 | 0.3773 | 112 | 113 |
| BCDHEE+Calibri | BCDHEE+Calibri | Normal | 9.9600 | True | 0.3069 | 0.3111 | 0.3655 | 0.3773 | 113 | 114 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3111 | 0.6476 | 0.3655 | 0.3773 | 114 | 161 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3389 | 0.9327 | 0.3893 | 0.4011 | 163 | 247 |
| BCDHEE+Calibri | BCDHEE+Calibri | Normal | 9.9600 | True | 0.9327 | 0.9369 | 0.3893 | 0.4011 | 247 | 248 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.9369 | 0.9572 | 0.3893 | 0.4011 | 248 | 251 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3030 | 0.6440 | 0.4060 | 0.4178 | 252 | 300 |
| SymbolMT | SymbolMT | Normal | 9.9600 | True | 0.3444 | 0.3521 | 0.4299 | 0.4418 | 302 | 303 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3746 | 0.9568 | 0.4303 | 0.4422 | 304 | 386 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3746 | 0.7544 | 0.4470 | 0.4588 | 387 | 445 |
| SymbolMT | SymbolMT | Normal | 9.9600 | True | 0.3444 | 0.3521 | 0.4710 | 0.4828 | 447 | 448 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3746 | 0.9096 | 0.4714 | 0.4832 | 449 | 523 |
| BCDHEE+Calibri | BCDHEE+Calibri | Normal | 9.9600 | True | 0.9097 | 0.9139 | 0.4714 | 0.4832 | 523 | 524 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.9139 | 0.9572 | 0.4714 | 0.4832 | 524 | 530 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3746 | 0.4389 | 0.4882 | 0.5000 | 531 | 540 |
| BCDFEE+Calibri | BCDFEE+Calibri | Normal | 9.9600 | True | 0.3389 | 0.4678 | 0.5357 | 0.5475 | 542 | 560 |
