Merge pull request #4 from UN-Task-Team-for-Scanner-Data/outline-project-structure-and-purpose

sergegoussev · web-flow · commit 50994c9ebe57 · 2025-02-11T21:43:37.000-05:00
Outline project structure and purpose
diff --git a/README.md b/README.md
@@ -1,2 +1,5 @@
 # reproducibility-project
-Project aiming to simplify the process of reproducibility in price statistics and use of FAIR data for research
+
+Welcome to the project aiming to simplify the process of reproducibility in price statistics and use of FAIR data for research. This project is led by workstream 5 of the [UN Task Team on Scanner data](https://unstats.un.org/bigdata/task-teams/scanner/index.cshtml), part of the [UN Committee of Experts on Big Data and Data Science for Official Statistics (UN-CEBD)](https://unstats.un.org/bigdata/index.cshtml).
+
+To find out more about the project, check out the [project charter](project-charter.md) or track the work of the project [on our GitHub project](https://github.com/orgs/UN-Task-Team-for-Scanner-Data/projects/1). 
diff --git a/project-charter.md b/project-charter.md
@@ -0,0 +1,27 @@
+# Project Charter
+
+## Project overview
+
+Research in the price statistics discipline is not as reproducible as we feel it should be. Most researchers utilize proprietary datasets (for instance internal datasets owned by their NSOs as part of their official work, or purchased datasets that require considerable financial investment for others to acquire). Research is also done using software available to researchers in a way that is custom to them and the code and detailed processes are typically not made easily available as part of the research project. This consequence is far from the intention of researchers, but is a result of the challenges to do this within the discipline. Specifically, it is not easy to find or access open datasets that can be used for research purposes. Once data is found, the metadata on each dataset will differ, making it challenging to process and use the data, such as for repeatable research processes. Finally, once a researcher has access to the data, it is not clear how code and processing logic should be shared as part of the research project to make the project reproducible. In other words, the process is not Findable, Accessible, Interoperable, or Reusable (or FAIR).
+
+## *Raison d'être* of the project
+
+The project aims to simplify this situation in the price statistics discipline by tackling the challenges that researchers face. In other words, the project aims to lower the barrier for reproducibility, making it intuitively easy (with some practice) for researchers to work openly. From an open science point of view, making research more routinely reproducible will help accelerate the pace at which consensus is reached on various topics as [results can quickly become replicable and even generalizable](https://book.the-turing-way.org/reproducible-research/overview/overview-definitions).
+
+## Expected outcomes in a little more detail
+
+As the project tackles the main challenges facing the discipline, it aims to deliver the following aspects:
+
+-   Objective A: Developing an interim data catalogue tool for several open datasets and publishing several open datasets as a proof of concept. The idea is to start helping researchers know how to reference open datasets for better replicability within the discipline
+    -   A.1. Mock up an interim data catalogue on the UNGP GitLab static site that is easy to read through and understand. The catalogue will be made available [on its own GitHub repo](https://github.com/UN-Task-Team-for-Scanner-Data/price-stats-data-catalogue).
+    -   A.2. Investigate the applicable metadata for publishing/registering a dataset and outline an ingestion/registration process that can be leveraged to onboard datasets to the interim data catalogue.
+    -   A.3. Coordinate the registering of several open datasets that can act as a pilot for the interim data catalogue.
+-   Objective B: Developing a white paper to outline the why, the what, the where, the when (in the research process) and the how. The idea is to create a high level guide for researchers, and could build on the FAIR or reproducibility literature and apply it to our domain. The paper would have several sub components that should be investigated and discussed, such as:
+    -   B.1. Investigate processes around data – such as how to reference the data (internal, public, or synthetic), how to incentivize the use of benchmark datasets for specific tasks in the discipline, how to deal with complex use cases (such as confidentiality or privately owned but widely used data).
+    -   B.2. Investigate processes around code and related objects like clear code documentation – such as where to publish code, what should be included in the repository, how to clearly document the process, etc. If applicable, make template repositories or examples available for the community.
+    -   B.3. Investigate administrative topics that are useful for the discipline, such as how to coordinate with the 2 conferences we attend to make sure that we embed and incentivize the use of the processes we would like to develop.
+    -   B.4. Outlining of the white paper/guidance for the discipline on reproducibility
+
+## Project management
+
+To manage the project, a [GitHub projects is used](https://github.com/orgs/UN-Task-Team-for-Scanner-Data/projects/1/views/1) for coordination and transparency.