You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updates to cataloguing processes:
* Initial list of standard labels for the catalogue for data and use
* Flushed out and clarified process for how to contribute to the catalogue, including a maturity model for datasets on the catalogue.
* Other changes such as to FAQs to create overarching cohesion with other changes
Researchers often struggle to use open data. They may find it hard to find open datasets for their research projects. If they find open datasets, they may struggle to easily understand them in order to judge that the datasets easily work for their research. Finally they may find it challenging to know how to work with or even how to cite the dataset.
12
+
Researchers often struggle to use open data. They may find it hard to find open datasets for their research projects. If they find open datasets, they may struggle to easily understand them in order to confirm which works best for their research. Finally they may find it challenging to know how to work with or even how to cite the dataset.
12
13
13
-
To help find, assess, and understand how to access open datasets, this project has developed a basic discipline specific catalogue.
14
+
The Price Statistics Open Data Catalogue helps find, assess, and understand how to access open datasets.
14
15
15
16
## How the Price Statistics Open Data Catalogue works in a nutshell
16
17
17
-
The idea is simple. The catalogue lists open datasets relevant to the discipline and is searchable according to standard data types (such as scanner data). It also provides basic information about each dataset that will allow researchers to assess its relevance and know how to access it. Visually this is shown in @fig-catalogue.
18
+
The idea is simple. The catalogue lists open datasets relevant to the discipline and is searchable according to standard tags (more on tags below). It also provides basic information about each dataset that allows researchers to assess its relevance and know how to access it. Visually this is shown in @fig-catalogue.
18
19
19
20
{#fig-catalogue}
20
21
21
-
## Where is the data catalogue?
22
+
## What are the tags used in the catalogue?
22
23
23
-
The [data catalogue can be found here](https://un-task-team-for-scanner-data.github.io/price-stats-data-catalogue/). It is basically a simple static site hosted on GitHub.
24
+
The catalogue uses the following tag structure to help categorize the datasets.
24
25
25
-
## How to register an open dataset on the catalogue?
26
+
::: callout-note
27
+
## Tags may change
28
+
29
+
Tags will evolve and change over time, especially as new datasets are registered.
30
+
:::
31
+
32
+
### Data type
33
+
34
+
The first set of tags categorize the dataset on common types used in price statistics.
35
+
36
+
-`scanner`
37
+
-`web-scraped`
38
+
-`administrative`
39
+
-`field-or-sample`
40
+
41
+
### Dataset Topic
42
+
43
+
The second set of tags focus on the uses of the data within price statistics. This relates most closely to categories of the area being measured as this implies the use of specific features in the data and specific set of methods.
26
44
27
-
The [CONTRIBUTING guidance in the catalogue](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/blob/main/CONTRIBUTING.md) summarizes it all. Have a look and help register a dataset or recommend one be registered!
45
+
-`electronics-and-applicances`
46
+
-`housing`
47
+
-`groceries-and-food`
48
+
-`fuels`
49
+
50
+
::: callout-note
51
+
## Labels to be expanded over time
52
+
53
+
As new datasets are catalogued, this list will be expanded.
54
+
:::
55
+
56
+
## Where is the standalone data catalogue?
57
+
58
+
The standalone [data catalogue can be found here](https://un-task-team-for-scanner-data.github.io/price-stats-data-catalogue/). It is a static site hosted on a separate GitHub repository.
28
59
29
60
## What does the catalogue not do?
30
61
31
-
As the catalogue does not store the dataset itself but simply describes it in detail. In other words, this catalogue is [not a data repository](https://book.the-turing-way.org/reproducible-research/rdm/rdm-repository#rr-rdm-repository-select).
62
+
As the catalogue does not store the dataset itself or make decisions on key aspects of the dataset, but simply describes it in detail. In other words, this catalogue is [not a data repository](https://book.the-turing-way.org/reproducible-research/rdm/rdm-repository#rr-rdm-repository-select). Catalogue records point to the wherever the dataset lives. If more than one version is available, the dataset version that is easiest for researchers to use is referenced.
32
63
33
64
::: callout-note
34
65
## This is an interim catalogue only!
35
66
36
-
This will very likely not be the long-term stable data catalogue used in the discpline. The idea, however, it to start with this interim (and very simple open-source) catalogue, while the project investigates a more viable longer term solution.
37
-
:::
67
+
This will likely not be the long-term stable data catalogue used in the discipline. The idea, however, it to start with this interim (and very simple open-source) catalogue, while the project investigates a more viable longer term solution.
Copy file name to clipboardExpand all lines: docs/catalogue/catalogue.qmd
+6-1Lines changed: 6 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,12 @@
2
2
title: "Browse the catalogue"
3
3
---
4
4
5
+
::: callout-tip
6
+
## The catalogue is actually a standalone site
7
+
8
+
This page integrates the catalogue into the page for simplicity---however if you wish to browse the catalogue directly outside this site, [check out its site here](https://un-task-team-for-scanner-data.github.io/price-stats-data-catalogue/).
9
+
:::
5
10
6
11
```{=html}
7
12
<iframe width="100%" height="100%" src="https://un-task-team-for-scanner-data.github.io/price-stats-data-catalogue/" title="Price Statistics Data Catalogue"></iframe>
If you have a dataset that you would like to register on the catalogue, the following process outlines how to do this.
13
+
If you have a dataset that you would like to register on the catalogue, the following process outlines how to do this. @fig-process-flow outlines this in high level, with details on each step below.
14
+
15
+
{#fig-process-flow fig-align="center" width="100%"}
13
16
14
17
# Before you start
15
18
16
19
## Requirements to contribute to the data catalogue
17
20
18
21
In order to contribute to the catalogue, the following criteria must be met:
19
22
20
-
-**The dataset must be publicly available for researchers**. There are proprietary datasets that could in theory also be listed, however until the price statistics reproducibility project figures out the process for this, we request that only fully open datasets are registered. We woulds till love to hear from you if you have a valuable proprietary dataset that should be registered, however we will not register it until we flush out this process.
21
-
-T**he dataset must be related to the price statistics discipline**. Price statisticians most typically track change in prices, such as through price index methods - thus the dataset should support this use case. Other use cases, such as for machine learning applications when it comes to classification, can also be submitted, but should be as close to the needs of the discipline as possible.
22
-
-**Be of value to the discipline**. Many data catalogues that are too loose with the registration process become filled with many datasets of incremental value. As a result, users start to struggle to find highly valuable datasets among the smaller and incremental ones, which eventually causes the catalogue to be unused and thus of little value. To avoid this, the value of the dataset to researchers in the discipline should be clearly and sufficiently outlined.
23
-
-**The contributor must document the dataset in full when the dataset is to be registered**. Having partially documented datasets on the catalogue will take away from user experience and will thus takeaway from the push to be open.
24
-
-**The dataset should be real, although some artificial datasets are possible if they are of value to reproducibility.**We recommend that synthetic not be registered in the dataset if it can be avoided but instead the code that generated it be made publicly available as part of that research projects' [research compendium](https://un-task-team-for-scanner-data.github.io/reproducibility-project/docs/reproducibility-guidance/intro.html).
23
+
-**The dataset should be publicly available for researchers**. There are proprietary datasets that could in theory also be listed, however until the price statistics reproducibility project figures out the process for this, we request that only fully open datasets are registered. We encourage requests on valuable proprietary datasets, however these will not be catalogued until the process is flushed out.
24
+
-**The dataset must be related to the price statistics discipline**. Price statisticians most typically track change in prices, such as through price index methods – thus the dataset should support this use case. Other use cases, such as for machine learning applications when it comes to classification, can also be submitted, but should be as close to the needs of the discipline as possible.
25
+
-**Be of value to the discipline**. Many data catalogues that are too lax with the cataloguing process become filled with many datasets of incremental value. As a result, users struggle to find highly valuable datasets, which eventually causes a dropoff in use of the catalogue. To avoid this, the value of the dataset to researchers in the discipline should be clear. The reproducibility team will review and approve each proposed submission during each meeting.
26
+
-**The contributor must document the dataset in full when the dataset is to be registered**. Having partially documented datasets on the catalogue will take away from user experience and will thus takeaway from the push to be open. A [Maturity model of registered datasets](#maturity-model) is provided below to showcase how to document a dataset.
27
+
-**The dataset should be real, although artificial and modified datasets are accepted if they are of value to reproducibility.**Specifically, synthetic datasets generated as part of a research process may not need to be registered if they can be reproduced through code published with the research, in which case we recommend that the code that generated it be made publicly available as part of that research projects' [research compendium](https://un-task-team-for-scanner-data.github.io/reproducibility-project/docs/reproducibility-guidance/intro.html).
25
28
- Some synthetic/artificial datasets may be proposed, such as if the artificial dataset has become of high value and is used everywhere **as if** its a real dataset (such as the Turvey dataset).
26
-
-**The dataset is already published somewhere easy to download and in a machine readable format**. Make sure that users can download and use the dataset easily. Several options for hosting a dataset are possible.
29
+
-**The dataset is already published somewhere easy to download and in a machine readable format**. Make sure that users can download and use the dataset easily. Several options for hosting a dataset are possible (see next section on where to host datasets).
27
30
28
-
## Where to host the dataset
31
+
## Where to host the dataset itself
29
32
30
33
As the price statistics data catalogue describes datasets already hosted elsewhere (in other words it is not a data repository), the first step is to host the dataset somewhere. This is fundamentally up to the researcher and the institution they work at. Ideally a data repository is used that mints a Digitial Object Identifier (DOI) so that the dataset can be easily cited and found. Data repositories often allow a researcher to host private datasets and lock down access but still create a DOI. Regardless of where, the DOI should be ready when the dataset is registered in the catalogue so that researchers can find the dataset itself and know how to cite the dataset.
31
34
32
35
Read more about [data repositories in the Turing Way](https://book.the-turing-way.org/reproducible-research/rdm/rdm-repository#rr-rdm-repository-select).
33
36
34
37
# How to register a dataset on the catalogue?
35
38
36
-
The catalogue uses the open-source python package [`datacontract-cli`](https://cli.datacontract.com/) to render the simple to use static site. Each dataset is stored as a `yaml` file following the structure of this package is saved in the `/datasets/` directory of this GitHub repository. Contributors have the following options:
39
+
Once the above aspects are considered, a dataset can be registered to the catalogue.
40
+
41
+
::: callout-tip
42
+
## How is the catalogue rendered?
43
+
44
+
The catalogue uses the open-source python package [`datacontract-cli`](https://cli.datacontract.com/) to render the dataset record as a static html site from a structured `yaml` file that stores the relevant information. Basically if there is a `dataset.yaml` in [the `/datasets/` folder](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/tree/main/datasets) of the data catalogue repository, [the GitHub workflow](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/blob/main/.github/workflows/deploy.yml) renders it a an html alongside other datasets.
45
+
46
+
Read more about the configuration of the `yaml` file and the open standard it is based on [here](https://datacontract.com/).
47
+
:::
37
48
38
-
### Request to add a new dataset yourself (or modify a dataset that exists):
49
+
There are two ways to request that a dataset be registered to the catalogue
39
50
40
-
The best way is to directly propose a dataset and start the process of registering it (such as by fully describing the dataset).
51
+
### Option 1: Document the dataset yourself
52
+
53
+
The best way is to directly propose a dataset and start the process of registering it (such as by fully describing the dataset):
41
54
42
55
1. Fork [the catalogue repo](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue) and mock up a new dataset in the `/datasets/` directory of your fork.
43
-
2. Submit a PR to request that we add it to the catalogue. Please tag it to the [dataset](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/labels/dataset) label.
44
-
3. We will review your request and coordinate with you as appropriate (such as if we need more info) or help further flush out the metadata so that the dataset is well defined before it is published.
56
+
2. Submit a PR to request from your fork to the main catalogue repository. Please tag the PR to the [dataset](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/labels/dataset) label.
57
+
3. We will review your request and coordinate with you as appropriate (such as if we need more info) or help further flush out the metadata so that the dataset is well defined before it is published. We may work with you to propose more detailed documentation based on the maturity level, such as (see below).
58
+
- For instance if you see value in writing a data exploration blog to describe the dataset, we can work with you to do so and make it part of the reproducibility project site.
45
59
46
60
When we are ready, we will either merge your PR and thus register your dataset or reject the dataset if it is not appropriate.
47
61
48
-
### Request that we add a new dataset:
62
+
### Option 2: Request that we consider a dataset
49
63
50
64
You could also just give us a ping to let us know that a good dataset exists and we can review and register it when we get a chance.
51
65
52
-
1. Create [a new issue](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/issues/new) and describe the dataset you wish that we register. Include the relevant details that we can use to find out more about the dataset. Please also tag the issue with the [dataset](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/labels/dataset) label.
66
+
1. Create [a new issue](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/issues/new) and describe the dataset you wish that we register. Include the relevant details that we can use to find out more about the dataset. Please also tag the issue with the [dataset](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue/labels/dataset) label.
67
+
68
+
# Maturity model of registered datasets {#maturity-model}
69
+
70
+
As not all datasets can be documented to the same level, we've broken up the structure into a maturity model to aspire to for each dataset record, starting from the benchmark to the top 'gold standard' level.
71
+
72
+
## Level 1
73
+
74
+
This level sets a bare minimum a dataset needs to have to be registered to the catalogue
75
+
76
+
-[ ] The dataset has a basic description to introduce it to any users in the discipline
77
+
-[ ] The data model (structure of the data and each variable) is documented
78
+
-[ ] The dataset is available openly and the data file format is anything that is machine readable (for instance a proprietary format like `.xlsx` or language specific data formats like `.Rdata` are fine, however a pdf is not) is referenced.
79
+
-[ ] The license (or at minimum terms of the use of the dataset) is listed so that it is clear how the dataset can be used and how it cannot.
80
+
-[ ] Information on how to cite the dataset is available.
81
+
82
+
## Level 2
83
+
84
+
Level 2 implies a higher level of maturity to simplify the process for data
85
+
86
+
-[ ] The dataset is stored in an open file format
87
+
-[ ] Dataset quality considerations and detailed nuances of the data are discussed. This is best done in specific quality section of each variable or for the table/dataset as a whole.
88
+
89
+
## Level 3
90
+
91
+
This implies a 'gold standard' for a dataset
92
+
93
+
-[ ] The dataset is made available in a data repository (such as Zenodo) that mints a DOI. This DOI is listed as part of the dataset.
94
+
-[ ] A data paper detailing the dataset is available and linked. Alternatively, a blog can also be written on the project site to introduce the dataset with the dataset owner.
0 commit comments