Trained Pipeline with Scikit-Learn
In this section, we'll see how we can train a machine-learning based classifier to get better performances. In this example, we will use a Scikit-Learn pipeline.
Warning
Scikit-Learn is ill-equipped to deal with text data. As such, it is not the best candidate to provide an effective classification method. However, it can still perform quite well and remains a good place to start tinkering with the inner workings of EDS-PDF.
PDF annotation
See the PDF annotation recipe for one annotation methodology. For the rest of this recipe, we will consider that the dataset follows the same structure.
Pipeline definition
Let's use the following pipeline:
[reader]
@readers = "pdf-reader.v1"
[reader.extractor]
@extractors = "pdfminer-extractor.v1"
[reader.transform]
@transforms = "chain.v1"
[reader.transform.*.dates]
@transforms = "dates.v1"
[reader.transform.*.telephone]
@transforms = "telephone.v1"
[reader.transform.*.dimensions]
@transforms = "dimensions.v1"
# The model has not been trained yet
# We still reference it to make sure we use the same configuration
[reader.classifier]
@classifiers = "sklearn.v1"
path = "classifier.joblib"
[reader.aggregator]
@aggregators = "styled.v1"
Data preparation
The reader object exposes a prepare_data method,
which runs the pipeline until the classification phase, and returns the DataFrame as it would be seen
by the classifier. Hence, we can use it to produce a training dataset for the classification step.
It means that we can use the same configuration for preparing the training data for the classifier and for the full pipeline, guaranteeing that the data will be correctly pre-processed at runtime.
# ↑ Omitted code from the annotation recipe ↑
import json
import pandas as pd
from edspdf import registry, Config
from edspdf.reading import PdfReader
from edspdf.classification.align import align_labels
from pathlib import Path
def prepare_dataset(
reader: PdfReader,
directory: Path,
) -> pd.DataFrame:
"""
Read annotations from the dataset directory.
Parameters
----------
directory : Path
Dataset directory
Returns
-------
pd.DataFrame
Pandas DataFrame containing the annotations.
"""
dfs = []
for path in directory.glob("*.pdf"):
meta = json.loads(path.with_suffix(".json").read_text())
del meta["annotations"]
df = reader(path.read_bytes(), **meta)
dfs.append(df)
return pd.concat(dfs)
config = Config().from_disk("config.cfg")
del config["reader"]["classifier"] # (1)
reader = registry.resolve(config)["reader"]
path = Path("dataset/train")
annotations = get_annotations(path) # (2)
lines = prepare_dataset(reader, path)
annotated = align_labels(lines=lines, labels=annotations, threshold=0.8) # (3)
annotated.to_csv("data.csv", index=False)
- We remove the classifier from the pipeline definition since it is not defined at this point.
- See the PDF annotation recipe
- The object
annotatednow contains every text bloc that was covered by an annotated region, along with its label.
Training the machine learning pipeline
Now everything is ready to train a Scikit-Learn pipeline! Let's define a simple classifier:
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
n_components = 20
max_features = 2000
seed = 0
text_vectorizer = Pipeline(
[
("vect", CountVectorizer(strip_accents="ascii", max_features=max_features)),
("tfidf", TfidfTransformer()),
("reduction", TruncatedSVD(n_components=n_components, random_state=seed)),
]
)
classifier = Pipeline(
[
("norm", StandardScaler()),
("clf", RandomForestClassifier(random_state=seed)),
]
)
pipeline = Pipeline(
[
(
"union",
ColumnTransformer(
[
("text", text_vectorizer, "text"),
(
"others",
"passthrough",
[
"page",
"x0",
"x1",
"y0",
"y1",
"telephone",
"date",
"width",
"height",
"area",
],
),
]
),
),
("classifier", classifier),
]
)
And train it:
import pandas as pd
from joblib import dump
from classifier import pipeline
data = pd.read_csv("data.csv")
X_train, Y_train = data.drop(columns=["label"]), data["label"]
pipeline.fit(X_train, Y_train)
dump(pipeline, "classifier.joblib")
Using the full pipeline
Now that the machine learning model is trained, we can use the full pipeline:
import edspdf
from pathlib import Path
reader = edspdf.load("config.cfg")
# Get a PDF
pdf = Path("letter.pdf").read_bytes()
texts = reader(pdf)
texts["body"]
# Out: Cher Pr ABC, Cher DEF,\n...