Reproducibility

To guarantee the reproducibility of our models, we rely on virtual environments and VCS tools.

Environments

We use Poetry to validate the constraints generated by the dependencies of our model, and lock the versions that were used to generate a model, in a poetry.lock file.

This file can be reused to reinstall a previous environment by running

poetry install --with docs

Versioning

We use DVC to version the experiences, models and datasets used.

To add and version a new dataset, run

dvc import-url url/or/path/to/your/dataset data/dataset

To (re-)train a model and package it, just run:

dvc repro

For more information about DVC, make sure to visit their documentation.

Article experiments

To reproduce the results of our article, run the experiments.py script to queue and run all the experiments with DVC:

$ python scripts/experiments.py
$ dvc exp run --queue --run-all

Tip for Slurm environments

If your computing resources are managed with Slurm, you can run dvc exp queue-worker from Slurm jobs instead of the last command to parallelize the experiments across multiple nodes.

my_slurm_job.sh

# SBATCH ...

dvc exp queue-worker dvc-worker-$SLURM_JOB_ID -v

$ sbatch my_slurm_job.sh  # first job
$ sbatch my_slurm_job.sh  # launch as many jobs at once as needed

To reproduce (some) of the figures of our article, run the analysis.py script to generate the charts and tables in the docs/assets/figures folder.

$ python scripts/analysis.py
INFO:root:Loading experiments
INFO:root:Found 110 experiments
INFO:root:Building corpus statistics table
INFO:root:Computing results table, this can take a while...
INFO:root:Plotting BERT ablation experiments
INFO:root:Plotting results by labels
INFO:root:Plotting document type ablation experiments
INFO:root:Building comparison table of PDF extraction methods
INFO:root:Building comparison table of ML vs rule-based

and visualize them by serving the documentation

$ mkdocs serve