-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Hi, I'm Carlos from the workshop on open science at EPFL/ETHZ (June 2022). First congrats on the paper, it is always a lot of work and it is very nice you shared the code. Also it is very nice that you provided a license.
I have some comments and feedback to you here. I separate the comments in 3 sections. First section some issues regarding the scripts, second about R packages and third some suggestions on what you could do here, from my point of view.
Running the scripts
It is also not possible to run the scripts. There are some reasons for why I could not run it:
- There are two folders without any description of what they are, probably someone with domain knowledge know it
- Files to run the analysis are not available in the repo and not indication where they can be downloaded. It seems that in one of the scripts you are using Rdata, no mention from where it comes. To be more specific it is the following line:
load("Input/InputReady.Mod2.Rdata") - Also in the same script, the folder "Code" is being referenced:
source("Code/functions.needed.R")
so it might be better to just change the Code to the current folder, as I see that the script is there. - File paths are all absolute paths
- No mention about the libraries being used
- No mention of which R version is being used
- No header in the scripts, what they do
R package - documentation
You mention this is a R package for the paper. By looking at the scripts and folder structure, this is not a R package. The structure for a R package is missing. A good website that says how to produce a R package is https://r-pkgs.org/.
Suggestions
I think your project would benefit from a folder structure for data science projects. For this you don't need to make a R package. My recommendations are the follow:
- Create three top level folders: data, scripts and results
- In the scripts folder, add some subfolders and organize the scripts as you seem fit
- Put all the raw data in the data folder
- The results folder is where you save the analysis results, such as images and even intermediate rds files
- Do never use Rdata to save results, use only rds files to save intermediate states, this way you are explicit on what you are saving. If you are using Rdata, you are saving the environment and this way it is impossible to know what you are loading by just looking at the script. Also this increases the chance of having bugs and unexpected results
- When running a R project, start a project directly from Rstudio. This makes it easier to work across different projects. Here is a link with more information for this: https://support.rstudio.com/hc/en-us/articles/200526207-Using-RStudio-Projects
- If your project is self contained, try to use relative paths, this way any person could download the files and run the scripts. Provided they keep the folder structure.
- In each script, try to include headers with the following information: author, email, date (when you did the analysis) and a brief description of what the script is doing.
- Another option when using R is to use quarto (https://quarto.org/). This way you can mingle R with markdown, getting very beautiful reports. This way you can mix code and text, where you can explain more what is happening and why you are doing this. I have a repo here with an example: https://chronchi.github.io/transcriptomics
- When documenting the functions, try to follow the Roxygen guidelines, check the r-pkgs website, they explain it more there (https://r-pkgs.org/man.html). What is very nice from doing this is that you get the vignettes in either pdf or html automatically. No need to write more stuff besides what is in the script.
- For the libraries you can have a requirements.txt file stating which libraries you are using
Others
If you have any question or want to discuss more, feel free to drop me an email as well at carlos.ronchi@epfl.ch, besides replying to this issue.
I wish you good luck with your research!