You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/reproducibility-guidance/howtos/input-data.qmd
+16-2Lines changed: 16 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,20 @@ format:
13
13
::: callout-caution
14
14
## WIP
15
15
16
-
This page is still in the works. Possible topics:
17
-
* using `.gitignore` and precommit hooks to ensure privacy
16
+
This page is still in the works. Overview of possible topics to summarize:
17
+
18
+
- Save input data into `/data/` folder and use `.gitignore` to ensure that the raw data is not saved - as per the deeper dive on structure:
19
+
20
+
- If data and analysis is simple, the analysis scripts in `/src/analysis/` will take the data and generate relevant outputs (data and visuals) in `/output/`
21
+
22
+
- If data cleaning is more complex, you can create a `/data/raw`, a `/data/clean/`, and a `/src/data_cleaning.py` that converts from raw to clean (before analysis). This way anyone can reproduce this process and modify the analysis with new data as they can understand exactly how to preprocess the data before analysis.
23
+
24
+
- Use `precommit` hooks to ensure that analysis notebooks don't render the output that may be sensitive. precommit hooks to ensure privacy
25
+
26
+
- Make sure you don't commit other sensitive information with the code and writeup - like access tokens or secrets. There are ways to set this up in a way that others can repeat that doesn't commit it in git.
27
+
28
+
- Off course avoid mentioning sensitive things in the prose (say the documentation)
29
+
30
+
**NOTE**: Use of these best practices is key you use sensitive or confidential data. For public data, `.gitignore` is still a good practice so that you don't repost the raw data. Should also touch upon how researchers should approach propietary datasets ([#46](https://github.com/UN-Task-Team-for-Scanner-Data/reproducibility-project/issues/46))
Copy file name to clipboardExpand all lines: docs/reproducibility-guidance/howtos/licences.qmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ format:
12
12
::: callout-caution
13
13
## WIP
14
14
15
-
This page is still in the works. It will refer to guidance, including:
15
+
This page is still in the works. Possile topics (see [#52](https://github.com/UN-Task-Team-for-Scanner-Data/reproducibility-project/issues/52) for more info) could include:
16
16
17
17
- Licenses for code (such as by [NHS RAP guide](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/publishing_code/licensing-your-code/#what-is-a-software-licence))
subtitle: "Summary of the basic metadata concepts and how they can help"
4
4
draft: true
5
5
sidebar: true
6
-
date: 2025-06-13
7
-
summary: "metadata summary..."
6
+
date: 2025-12-01
8
7
format:
9
8
html:
10
9
toc: true
@@ -14,5 +13,39 @@ format:
14
13
::: callout-caution
15
14
## WIP
16
15
17
-
This page is still in the works
16
+
This page is still in the works. The guide ([#51](https://github.com/UN-Task-Team-for-Scanner-Data/reproducibility-project/issues/51)) could cover such topics as:
17
+
18
+
- Metadata is to help find, reuse, understand everything. [How-to-fair guide on the topic](https://www.howtofair.dk/how-to-fair/metadata/) provides an intro, the [FAIR cookbook ](https://faircookbook.elixir-europe.org/content/home.html)is also not bad.
19
+
20
+
- Possible metadata that are key to know about:
21
+
22
+
- Metadata that helps findability - ideally all objects of a research process should have persistent identifiers (or PIDs) so that these can be easily found and be citable: In price statistics, some already exist/are possible, and others are not yet set up:
23
+
24
+
- Researchers in the discipline can sign up to create an ORCID. This helps you be found and get fair recognition for your work.
25
+
26
+
- Datasets published to data repositories like zenodo help mint DOIs. TBC how to handle proprietary datasets though (i.e. [#46](https://github.com/UN-Task-Team-for-Scanner-Data/reproducibility-project/issues/46))
27
+
28
+
- Papers in official journals have DOIs that the journal creates as part of the publication process. Ideally [papers published as part of conference proceedings could also have DOIs](https://datascience.codata.org/articles/dsj-2022-011) (as many disciplines now do), however this isn't yet done in price stats.
29
+
30
+
- Code (i.e. the research compendium) is published in a way that mints a persistent identifier. Note that GitHub doesn't mint a DOI but that may be okay for interim code and published code could be pushed to zenodo (which does).
31
+
32
+
- Metadata that helps interoperability:
33
+
34
+
- The descriptive and structural metadata (i.e. info about each dataset) is outlined in the catalogue – hence we aim to help solve some of this with the catalogue.
35
+
36
+
- While not exclusive, we are trying to follow the basic [dublin core](https://www.dublincore.org/resources/metadata-basics/)
37
+
38
+
- The way we define various things is as standard as possible so that its easy to use
39
+
40
+
- The idea is that researchers (and their programs) can more easily understand open datasets they use for their research, understand them, etc.
41
+
42
+
- Accessiblity:
43
+
44
+
- Ways to get the data is as simple as possible - say using [download_zenodo (in R)](https://rdrr.io/github/inbo/inborutils/man/download_zenodo.html) to automate the downloading of data via its DOI
45
+
46
+
- Metadata that helps reusability:
47
+
48
+
- Knowing how datasets or code is licensed so that you know when and how to research it. We document this in the limitations section of the dataset record
49
+
50
+
- Provenance is clear. Say a dataset is made available on zenodo - where it came from and how it was created/modified is clear.
subtitle: "Creating and sharing synthetic (or real but modified) data"
4
4
draft: true
5
5
sidebar: true
6
-
date: 2025-XX-XX
7
-
summary: "summary..."
6
+
date: 2025-12-01
8
7
format:
9
8
html:
10
9
toc: true
@@ -14,5 +13,11 @@ format:
14
13
::: callout-caution
15
14
## WIP
16
15
17
-
This page is still in the works
16
+
This page is still in the works. Possible topics (see [#43](https://github.com/UN-Task-Team-for-Scanner-Data/reproducibility-project/issues/43) for more info) to include:
17
+
18
+
- Common packages for making synthetic data. You could include these in research compendium and others could recreate your data.
19
+
20
+
- To publish synthetic data or not (to say Zenodo)
0 commit comments