Reproducibility
To guarantee the reproducibility of our models, we rely on virtual environments and VCS tools.
Environments
We use Poetry to validate the constraints generated by the dependencies of our
model, and lock the versions that were used to generate a model, in a poetry.lock
file.
This file can be reused to reinstall a previous environment by running
poetry install --with docs
Versioning
We use DVC to version the experiences, models and datasets used.
To add and version a new dataset, run
dvc import-url url/or/path/to/your/dataset data/dataset
To (re-)train a model and package it, just run:
dvc repro
For more information about DVC, make sure to visit their documentation.
Article experiments
To reproduce the results of our article, run the experiments.py
script to queue and
run all the experiments with DVC:
python scripts/experiments.py
dvc exp run --queue --run-all
Tip for Slurm environments
If your computing resources are managed with Slurm, you can run
dvc exp queue-worker
from Slurm jobs instead of the last command to parallelize
the experiments across multiple nodes.
# SBATCH ...
dvc exp queue-worker dvc-worker-$SLURM_JOB_ID -v
$ sbatch my_slurm_job.sh # first job
$ sbatch my_slurm_job.sh # launch as many jobs at once as needed
To reproduce (some) of the figures of our article, run the analysis.py
script to
generate the charts and tables in the docs/assets/figures
folder.
python scripts/analysis.py
INFO:root:Loading experiments
INFO:root:Found 110 experiments
INFO:root:Building corpus statistics table
INFO:root:Computing results table, this can take a while...
INFO:root:Plotting BERT ablation experiments
INFO:root:Plotting results by labels
INFO:root:Plotting document type ablation experiments
INFO:root:Building comparison table of PDF extraction methods
INFO:root:Building comparison table of ML vs rule-based
and visualize them by serving the documentation
mkdocs serve