Skip to content

Commit e85aa6a

Browse files
authored
Updates to cataloguing info (#73)
Updates to cataloguing processes: * Initial list of standard labels for the catalogue for data and use * Flushed out and clarified process for how to contribute to the catalogue, including a maturity model for datasets on the catalogue. * Other changes such as to FAQs to create overarching cohesion with other changes
1 parent 7b8c14d commit e85aa6a

File tree

8 files changed

+214
-32
lines changed

8 files changed

+214
-32
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ __temp__*
1414
env/*
1515
build/*
1616
.vscode
17-
17+
1818
# Quarto components
1919
/.quarto/
2020
/_site/

docs/about.qmd

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
---
22
title: "About this project"
3+
date-modified: last-modified
4+
date-format: YYYY-MM-DD
35
format:
46
html:
57
toc: true
@@ -65,6 +67,11 @@ The following are current or historic members of Workstream 5 of the UN Task Tea
6567

6668
- Role: Contributor (2024 - current)
6769

70+
## Caroline White
71+
72+
- Role: Contributor (2024 - current)
73+
- GitHub id: [carolinewsc](https://github.com/carolinewsc)
74+
6875
## Jens Mehrhoff
6976

7077
- Role: Contributor (2024 - current)

docs/catalogue/about.qmd

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,68 @@
11
---
2-
title: "Understand how it works"
2+
title: "How does the catalogue work?"
33
date: 2025-04-28
4+
date-modified: last-modified
45
date-format: YYYY-MM-DD
56
format:
67
html:
78
toc: true
89
toc-expand: 2
910
---
1011

11-
Researchers often struggle to use open data. They may find it hard to find open datasets for their research projects. If they find open datasets, they may struggle to easily understand them in order to judge that the datasets easily work for their research. Finally they may find it challenging to know how to work with or even how to cite the dataset.
12+
Researchers often struggle to use open data. They may find it hard to find open datasets for their research projects. If they find open datasets, they may struggle to easily understand them in order to confirm which works best for their research. Finally they may find it challenging to know how to work with or even how to cite the dataset.
1213

13-
To help find, assess, and understand how to access open datasets, this project has developed a basic discipline specific catalogue.
14+
The Price Statistics Open Data Catalogue helps find, assess, and understand how to access open datasets.
1415

1516
## How the Price Statistics Open Data Catalogue works in a nutshell
1617

17-
The idea is simple. The catalogue lists open datasets relevant to the discipline and is searchable according to standard data types (such as scanner data). It also provides basic information about each dataset that will allow researchers to assess its relevance and know how to access it. Visually this is shown in @fig-catalogue.
18+
The idea is simple. The catalogue lists open datasets relevant to the discipline and is searchable according to standard tags (more on tags below). It also provides basic information about each dataset that allows researchers to assess its relevance and know how to access it. Visually this is shown in @fig-catalogue.
1819

1920
![Basic idea of how we see the data catalogue working.](/docs/images/data-catalogue-idea.svg){#fig-catalogue}
2021

21-
## Where is the data catalogue?
22+
## What are the tags used in the catalogue?
2223

23-
The [data catalogue can be found here](https://un-task-team-for-scanner-data.github.io/price-stats-data-catalogue/). It is basically a simple static site hosted on GitHub.
24+
The catalogue uses the following tag structure to help categorize the datasets.
2425

25-
## How to register an open dataset on the catalogue?
26+
::: callout-note
27+
## Tags may change
28+
29+
Tags will evolve and change over time, especially as new datasets are registered.
30+
:::
31+
32+
### Data type
33+
34+
The first set of tags categorize the dataset on common types used in price statistics.
35+
36+
- `scanner`
37+
- `web-scraped`
38+
- `administrative`
39+
- `field-or-sample`
40+
41+
### Dataset Topic
42+
43+
The second set of tags focus on the uses of the data within price statistics. This relates most closely to categories of the area being measured as this implies the use of specific features in the data and specific set of methods.
2644

27-
The [CONTRIBUTING guidance in the catalogue](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/blob/main/CONTRIBUTING.md) summarizes it all. Have a look and help register a dataset or recommend one be registered!
45+
- `electronics-and-applicances`
46+
- `housing`
47+
- `groceries-and-food`
48+
- `fuels`
49+
50+
::: callout-note
51+
## Labels to be expanded over time
52+
53+
As new datasets are catalogued, this list will be expanded.
54+
:::
55+
56+
## Where is the standalone data catalogue?
57+
58+
The standalone [data catalogue can be found here](https://un-task-team-for-scanner-data.github.io/price-stats-data-catalogue/). It is a static site hosted on a separate GitHub repository.
2859

2960
## What does the catalogue not do?
3061

31-
As the catalogue does not store the dataset itself but simply describes it in detail. In other words, this catalogue is [not a data repository](https://book.the-turing-way.org/reproducible-research/rdm/rdm-repository#rr-rdm-repository-select).
62+
As the catalogue does not store the dataset itself or make decisions on key aspects of the dataset, but simply describes it in detail. In other words, this catalogue is [not a data repository](https://book.the-turing-way.org/reproducible-research/rdm/rdm-repository#rr-rdm-repository-select). Catalogue records point to the wherever the dataset lives. If more than one version is available, the dataset version that is easiest for researchers to use is referenced.
3263

3364
::: callout-note
3465
## This is an interim catalogue only!
3566

36-
This will very likely not be the long-term stable data catalogue used in the discpline. The idea, however, it to start with this interim (and very simple open-source) catalogue, while the project investigates a more viable longer term solution.
37-
:::
67+
This will likely not be the long-term stable data catalogue used in the discipline. The idea, however, it to start with this interim (and very simple open-source) catalogue, while the project investigates a more viable longer term solution.
68+
:::

docs/catalogue/catalogue.qmd

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,12 @@
22
title: "Browse the catalogue"
33
---
44

5+
::: callout-tip
6+
## The catalogue is actually a standalone site
7+
8+
This page integrates the catalogue into the page for simplicity---however if you wish to browse the catalogue directly outside this site, [check out its site here](https://un-task-team-for-scanner-data.github.io/price-stats-data-catalogue/).
9+
:::
510

611
```{=html}
712
<iframe width="100%" height="100%" src="https://un-task-team-for-scanner-data.github.io/price-stats-data-catalogue/" title="Price Statistics Data Catalogue"></iframe>
8-
```
13+
```

docs/catalogue/contributing.qmd

Lines changed: 59 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
2-
title: "How to contribute"
3-
date: 2025-XX-XX
2+
title: "How can you contribute?"
3+
date: 2025-04-08
4+
date-modified: last-modified
45
date-format: YYYY-MM-DD
56
draft: true
67
format:
@@ -9,44 +10,85 @@ format:
910
toc-expand: 2
1011
---
1112

12-
If you have a dataset that you would like to register on the catalogue, the following process outlines how to do this.
13+
If you have a dataset that you would like to register on the catalogue, the following process outlines how to do this. @fig-process-flow outlines this in high level, with details on each step below.
14+
15+
![High level process flow to register a dataset](/docs/images/Registering-dataset-process-flow.drawio.svg){#fig-process-flow fig-align="center" width="100%"}
1316

1417
# Before you start
1518

1619
## Requirements to contribute to the data catalogue
1720

1821
In order to contribute to the catalogue, the following criteria must be met:
1922

20-
- **The dataset must be publicly available for researchers**. There are proprietary datasets that could in theory also be listed, however until the price statistics reproducibility project figures out the process for this, we request that only fully open datasets are registered. We woulds till love to hear from you if you have a valuable proprietary dataset that should be registered, however we will not register it until we flush out this process.
21-
- T**he dataset must be related to the price statistics discipline**. Price statisticians most typically track change in prices, such as through price index methods - thus the dataset should support this use case. Other use cases, such as for machine learning applications when it comes to classification, can also be submitted, but should be as close to the needs of the discipline as possible.
22-
- **Be of value to the discipline**. Many data catalogues that are too loose with the registration process become filled with many datasets of incremental value. As a result, users start to struggle to find highly valuable datasets among the smaller and incremental ones, which eventually causes the catalogue to be unused and thus of little value. To avoid this, the value of the dataset to researchers in the discipline should be clearly and sufficiently outlined.
23-
- **The contributor must document the dataset in full when the dataset is to be registered**. Having partially documented datasets on the catalogue will take away from user experience and will thus takeaway from the push to be open.
24-
- **The dataset should be real, although some artificial datasets are possible if they are of value to reproducibility.** We recommend that synthetic not be registered in the dataset if it can be avoided but instead the code that generated it be made publicly available as part of that research projects' [research compendium](https://un-task-team-for-scanner-data.github.io/reproducibility-project/docs/reproducibility-guidance/intro.html).
23+
- **The dataset should be publicly available for researchers**. There are proprietary datasets that could in theory also be listed, however until the price statistics reproducibility project figures out the process for this, we request that only fully open datasets are registered. We encourage requests on valuable proprietary datasets, however these will not be catalogued until the process is flushed out.
24+
- **The dataset must be related to the price statistics discipline**. Price statisticians most typically track change in prices, such as through price index methods thus the dataset should support this use case. Other use cases, such as for machine learning applications when it comes to classification, can also be submitted, but should be as close to the needs of the discipline as possible.
25+
- **Be of value to the discipline**. Many data catalogues that are too lax with the cataloguing process become filled with many datasets of incremental value. As a result, users struggle to find highly valuable datasets, which eventually causes a dropoff in use of the catalogue. To avoid this, the value of the dataset to researchers in the discipline should be clear. The reproducibility team will review and approve each proposed submission during each meeting.
26+
- **The contributor must document the dataset in full when the dataset is to be registered**. Having partially documented datasets on the catalogue will take away from user experience and will thus takeaway from the push to be open. A [Maturity model of registered datasets](#maturity-model) is provided below to showcase how to document a dataset.
27+
- **The dataset should be real, although artificial and modified datasets are accepted if they are of value to reproducibility.** Specifically, synthetic datasets generated as part of a research process may not need to be registered if they can be reproduced through code published with the research, in which case we recommend that the code that generated it be made publicly available as part of that research projects' [research compendium](https://un-task-team-for-scanner-data.github.io/reproducibility-project/docs/reproducibility-guidance/intro.html).
2528
- Some synthetic/artificial datasets may be proposed, such as if the artificial dataset has become of high value and is used everywhere **as if** its a real dataset (such as the Turvey dataset).
26-
- **The dataset is already published somewhere easy to download and in a machine readable format**. Make sure that users can download and use the dataset easily. Several options for hosting a dataset are possible.
29+
- **The dataset is already published somewhere easy to download and in a machine readable format**. Make sure that users can download and use the dataset easily. Several options for hosting a dataset are possible (see next section on where to host datasets).
2730

28-
## Where to host the dataset
31+
## Where to host the dataset itself
2932

3033
As the price statistics data catalogue describes datasets already hosted elsewhere (in other words it is not a data repository), the first step is to host the dataset somewhere. This is fundamentally up to the researcher and the institution they work at. Ideally a data repository is used that mints a Digitial Object Identifier (DOI) so that the dataset can be easily cited and found. Data repositories often allow a researcher to host private datasets and lock down access but still create a DOI. Regardless of where, the DOI should be ready when the dataset is registered in the catalogue so that researchers can find the dataset itself and know how to cite the dataset.
3134

3235
Read more about [data repositories in the Turing Way](https://book.the-turing-way.org/reproducible-research/rdm/rdm-repository#rr-rdm-repository-select).
3336

3437
# How to register a dataset on the catalogue?
3538

36-
The catalogue uses the open-source python package [`datacontract-cli`](https://cli.datacontract.com/) to render the simple to use static site. Each dataset is stored as a `yaml` file following the structure of this package is saved in the `/datasets/` directory of this GitHub repository. Contributors have the following options:
39+
Once the above aspects are considered, a dataset can be registered to the catalogue.
40+
41+
::: callout-tip
42+
## How is the catalogue rendered?
43+
44+
The catalogue uses the open-source python package [`datacontract-cli`](https://cli.datacontract.com/) to render the dataset record as a static html site from a structured `yaml` file that stores the relevant information. Basically if there is a `dataset.yaml` in [the `/datasets/` folder](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/tree/main/datasets) of the data catalogue repository, [the GitHub workflow](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/blob/main/.github/workflows/deploy.yml) renders it a an html alongside other datasets.
45+
46+
Read more about the configuration of the `yaml` file and the open standard it is based on [here](https://datacontract.com/).
47+
:::
3748

38-
### Request to add a new dataset yourself (or modify a dataset that exists):
49+
There are two ways to request that a dataset be registered to the catalogue
3950

40-
The best way is to directly propose a dataset and start the process of registering it (such as by fully describing the dataset).
51+
### Option 1: Document the dataset yourself
52+
53+
The best way is to directly propose a dataset and start the process of registering it (such as by fully describing the dataset):
4154

4255
1. Fork [the catalogue repo](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue) and mock up a new dataset in the `/datasets/` directory of your fork.
43-
2. Submit a PR to request that we add it to the catalogue. Please tag it to the [dataset](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/labels/dataset) label.
44-
3. We will review your request and coordinate with you as appropriate (such as if we need more info) or help further flush out the metadata so that the dataset is well defined before it is published.
56+
2. Submit a PR to request from your fork to the main catalogue repository. Please tag the PR to the [dataset](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/labels/dataset) label.
57+
3. We will review your request and coordinate with you as appropriate (such as if we need more info) or help further flush out the metadata so that the dataset is well defined before it is published. We may work with you to propose more detailed documentation based on the maturity level, such as (see below).
58+
- For instance if you see value in writing a data exploration blog to describe the dataset, we can work with you to do so and make it part of the reproducibility project site.
4559

4660
When we are ready, we will either merge your PR and thus register your dataset or reject the dataset if it is not appropriate.
4761

48-
### Request that we add a new dataset:
62+
### Option 2: Request that we consider a dataset
4963

5064
You could also just give us a ping to let us know that a good dataset exists and we can review and register it when we get a chance.
5165

52-
1. Create [a new issue](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/issues/new) and describe the dataset you wish that we register. Include the relevant details that we can use to find out more about the dataset. Please also tag the issue with the [dataset](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/labels/dataset) label.
66+
1. Create [a new issue](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/issues/new) and describe the dataset you wish that we register. Include the relevant details that we can use to find out more about the dataset. Please also tag the issue with the [dataset](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/labels/dataset) label.
67+
68+
# Maturity model of registered datasets {#maturity-model}
69+
70+
As not all datasets can be documented to the same level, we've broken up the structure into a maturity model to aspire to for each dataset record, starting from the benchmark to the top 'gold standard' level.
71+
72+
## Level 1
73+
74+
This level sets a bare minimum a dataset needs to have to be registered to the catalogue
75+
76+
- [ ] The dataset has a basic description to introduce it to any users in the discipline
77+
- [ ] The data model (structure of the data and each variable) is documented
78+
- [ ] The dataset is available openly and the data file format is anything that is machine readable (for instance a proprietary format like `.xlsx` or language specific data formats like `.Rdata` are fine, however a pdf is not) is referenced.
79+
- [ ] The license (or at minimum terms of the use of the dataset) is listed so that it is clear how the dataset can be used and how it cannot.
80+
- [ ] Information on how to cite the dataset is available.
81+
82+
## Level 2
83+
84+
Level 2 implies a higher level of maturity to simplify the process for data
85+
86+
- [ ] The dataset is stored in an open file format
87+
- [ ] Dataset quality considerations and detailed nuances of the data are discussed. This is best done in specific quality section of each variable or for the table/dataset as a whole.
88+
89+
## Level 3
90+
91+
This implies a 'gold standard' for a dataset
92+
93+
- [ ] The dataset is made available in a data repository (such as Zenodo) that mints a DOI. This DOI is listed as part of the dataset.
94+
- [ ] A data paper detailing the dataset is available and linked. Alternatively, a blog can also be written on the project site to introduce the dataset with the dataset owner.

0 commit comments

Comments
 (0)