Merge pull request #256 from oree-xx/quantifying/documentation-changes

TimidRobot · web-flow · commit e5c046fd1297 · 2025-12-22T06:26:17.000-08:00
Documentation update
diff --git a/README.md b/README.md
@@ -1,29 +1,16 @@
-# quantifying
+# Quantifying
 
-Quantifying the Commons
+Quantifying the Commons: measure the size and diversity of the commons--the
+collection of works that are openly licensed or in the public domain
 
 
 ## Overview
 
-This project seeks to quantify the size and diversity of the commons--the
-collection of works that are openly licensed or in the public domain.
-
-
-### Meaningful
-
-The reports generated by this project (and the data fetched and processed to
-support it) seeks to be meaningful. We hope this project will provide data and
-analysis that helps inform discussions about the commons--the collection of
-works that are openly licensed or in the public domain.
-
-The goal of this project is to help answer questions like:
-- How has the world's use of the commons changed over time?
-- How is the knowledge and culture of the commons distributed?
-  - Who has access (and how much) to the commons?
-- What significant trends can be observed in the commons?
-  - Which public domain dedication or licenses are the most popular?
-  - What are the correlations between public domain dedication or licenses and
-    region, language, domain/endeavor, etc.?
+This project seeks to quantify the size and diversity of the creative commons
+legal tools. We aim to track the collection of works (articles, images,
+publications, etc.) that are openly licensed or in the public domain. The
+project automates data collection from multiple data sources, processes the
+data, and generates meaningful reports.
 
 
 ## Code of conduct
@@ -47,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
 [org-contrib]: https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
 
 
+### The three phases of generating a report
+
+1. **Fetch**: This phase involves collecting data from a particular source
+   using its API. Before writing any code, we plan the analyses we want to
+   perform by asking meaningful questions about the data. We also consider API
+   limitations (such as query limits) and design a query strategy to work
+   within these limitations. Then we write a python script that gets the data,
+   it is quite important to follow the format of the scripts existing in the
+   project and use the modules and functions where applicable. It ensures
+   consistency in the scripts and we can easily debug issues might arise.
+   - **Meaningful questions**
+     - The reports generated by this project (and the data fetched and
+       processed to support it) seeks to be meaningful. We hope this project
+       will provide data and analysis that helps inform discussions about the
+       commons. The goal of this project is to help answer questions like:
+       - How has the world's use of the commons changed over time?
+       - How is the knowledge and culture of the commons distributed?
+       - Who has access (and how much) to the commons?
+       - What significant trends can be observed in the commons?
+       - Which public domain dedication or licenses are the most popular?
+       - What are the correlations between public domain dedication or licenses
+         and region, language, domain/endeavor, etc.?
+   - **Limitations of an API**
+     - Some data sources provide APIs with query limits (it can be daily or
+       hourly) depending on what is given in the documentation. This restricts
+       how many requests that can be made in the specified period of time. It
+       is important to plan a query strategy and schedule fetch jobs to stay
+       within the allowed limits.
+   - **Headings of data in 1-fetch**
+     - [Tool identifier][tool-identifier]: A unique identifier used to
+       distinguish each Creative Commons legal tool within the dataset. This
+       helps ensure consistency when tracking tools across different data
+       sources.
+     - [SPDX identifier][spdx-identifier]: A standardized identifier maintained
+       by the Software Package Data Exchange (SPDX) project. It provides a
+       consistent way to reference licenses in applications.
+2. **Process**: In this phase, the fetched data is transformed into a
+   structured and standardized format for analysis. The data is then analyzed
+   and categorized based on defined criteria to extract insights that answer
+   the meaningful questions identified during the 1-fetch phase.
+3. **report**: This phase focuses on presenting the results of the analysis.
+   We generate graphs and summaries that clearly show trends, patterns, and
+   distributions in the data. These reports help communicate key insights about
+   the size, diversity, and characteristics of openly licensed and public
+   domain works.
+
+[tool-identifier]: https://creativecommons.org/share-your-work/cclicenses/
+[spdx-identifier]: https://spdx.org/licenses/
+
+
+### Automation phases
+
+For automating these phases, the project uses Python scripts to fetch, process,
+and report data. GitHub Actions is used to automatically run these scripts on a
+defined schedule and on code updates. It handles script execution, manages
+dependencies, and ensures the workflow runs consistently.
+- **Script assumptions**
+  - Execution schedule for each quarter:
+    - 1-Fetch: first month, 1st half of second month
+    - 2-Process: 2nd half of second month
+    - 3-Report: third month
+- **Script requirements**
+  - *Must be safe*
+    - Scripts must not make any changes with default options
+    - Easiest way to run script should also be the safest
+    - Have options spelled out
+    - Must be timely
+  - *Scripts should complete within a maximum of 45 minutes*
+    - *Scripts shouldn't take longer than 3 minutes with default options*
+    - That way there’s a quicker way to see what is happening when it is
+      running; see execution, without errors, etc. Then later in production it
+      can be run with longer options
+  - *Must be idempotent*
+    - [Idempotence - Wikipedia](https://en.wikipedia.org/wiki/Idempotence)
+    - This applies to both the data fetched and the data stored. If the data
+      changes randomly, we can't draw meaningful conclusions.
+  - *Balanced use of third-party libraries*
+    - Third-party libraries should be leveraged when they are:
+      - API specific (google-api-python-client, internetarchive, etc.)
+- File formats
+  - CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
+    and the data used by the project is simple enough to avoid any
+    shortcomings.
+  - YAML: prioritizes human readability which addresses the primary costs and
+    risks associated with configuration files.
+
+
 ### Project structure
 
 Please note that in the directory tree below, all instances of `fetch`,
@@ -91,8 +165,7 @@ Quantifying/
 ```
 
 
-## Development
-
+## How to set up
 
 ### Prerequisites
 
diff --git a/sources.md b/sources.md
@@ -159,10 +159,10 @@ provides access to information of all wikimedia projects including the different
 language edition of wikipedia. It runs on the Meta-Wiki API.
 
 **API documentation link:**
-[WIKIPEDIA_BASE_URL documentation](https://en.wikipedia.org/w/api.php)
-[WIKIPEDIA_BASE_URL reference page](https://www.mediawiki.org/wiki/API:Main_page)
-[WIKIPEDIA_MATRIX_URL documentation](https://meta.wikimedia.org/w/api.php)
-[WIKIPEDIA_MATRIX_URL reference page](https://www.mediawiki.org/wiki/API:Sitematrix)
+- [WIKIPEDIA_BASE_URL documentation](https://en.wikipedia.org/w/api.php)
+- [WIKIPEDIA_BASE_URL reference page](https://www.mediawiki.org/wiki/API:Main_page)
+- [WIKIPEDIA_MATRIX_URL documentation](https://meta.wikimedia.org/w/api.php)
+- [WIKIPEDIA_MATRIX_URL reference page](https://www.mediawiki.org/wiki/API:Sitematrix)
 
 **API information:**
 - No API key required