Skip to content

Commit 9b7ff77

Browse files
authored
reproducible explained better (#20)
1 parent 8d30ef9 commit 9b7ff77

File tree

2 files changed

+142
-1
lines changed

2 files changed

+142
-1
lines changed
Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,42 @@
11
---
22
layout: post
33
title: What is expected exactly in terms of reproducibility?
4-
date: 2023-04-24 00:00:00
4+
date: 2023-07-04 00:00:00
55
tags: reproducibility
66
description: Discuss the different kinds of reproducibility at play in Computo, and what is expected from the authors.
77
---
88

99
Computo is not just about publishing a notebook and proving that it can be compiled with CI! This part of the process is what we call _"Editorial Reproducibility"_. _"Scientific"_ or _"numerical"_ reproducibility of the analyses is also mandatory, on top of classical peer-review evaluation.
1010

1111
We don't ask people reproducing their data... yet! We also don't ask for "bit-wise computational" reproducibility (i.e. obtaining exactly the same results bit-by-bit) but rather a "statistical" reproducibility, i.e. obtaining results leading to the same conclusion, with potential non-significant statistical variability.
12+
13+
![Reproducible Workflow](img/reproducible-sequence.svg)
14+
15+
Indeed, the global scientific workflow of a reproducible process for a Computo may be split in two types of steps:
16+
17+
External
18+
: This part of the process may be conducted outside of the notebook environment, for a list of reasons (non-exclusive to each other):
19+
20+
- the process is too long to be conducted in a notebook
21+
- the data to be processed is too big to be handled directly in the notebook
22+
- it needs a specific environment (e.g. a cluster, with gpus, etc.)
23+
- it needs to involve specific languages (e.g. C, C++, Fortran, etc.) or build tools (e.g. make, cmake, etc.)
24+
25+
It is “Computational reproducibility”, where the reproducibility is achieved by providing the code and the environment to run it, but not the results themselves.
26+
27+
Editorial
28+
: This is where the notebook presents the results of the external process, and where everything is put together to produce the final document, it is “Direct reproducibility” in the sense that the notebook is the only thing needed to reproduce the results.
29+
30+
Ultimately, the workflow must end with a direct reproducibility step which concludes the whole process.
31+
32+
When applicable, the switch from external to editorial reproducibility is done with a “data transfer” step, where the data produced by the external process is transferred to the notebook environment. It’s required that not only the intermediate results are provided, but also the code to transfer it in the notebook environment. They are a variety of software solutions to do so.
33+
34+
## Examples of data transfer solutions
35+
36+
### Intermediate results storage
37+
- in python environment: the [`joblib.Memory`](https://joblib.readthedocs.io/en/latest/memory.html) class which provides a caching mechanism for python functions, and can be used to save the results of a function call to disk, and load it back later.
38+
- in R environment: the `.RData` file format, which can be loaded back in R with the `load()` function.
39+
40+
### Transfer of the results to the notebook environment
41+
- for both aforementioned solutions, the results (`.joblib` directory or `.Rdata` file) could be committed to the git repository, and directly loaded in the notebook environment.
42+
- Another solution is to centralize input data (when large enough) and intermediate results on a shared scientific provider (we recommend [Zenodo](https://zenodo.org/) for this purpose), and download them in the notebook environment.

0 commit comments

Comments
 (0)