To make the benchmarks reproducible, all the versions of Python libraries used in the benchmark should be pinned, possibly via a Pipfile.lock. To make re-use even easier, the docker image should be provided as well. (This may require a separate Makefile)