1- # quantifying
1+ # Quantifying
22
3- Quantifying the Commons
3+ Quantifying the Commons: measure the size and diversity of the commons--the
4+ collection of works that are openly licensed or in the public domain
45
56
67## Overview
78
8- This project seeks to quantify the size and diversity of the commons--the
9- collection of works that are openly licensed or in the public domain.
10-
11-
12- ### Meaningful
13-
14- The reports generated by this project (and the data fetched and processed to
15- support it) seeks to be meaningful. We hope this project will provide data and
16- analysis that helps inform discussions about the commons--the collection of
17- works that are openly licensed or in the public domain.
18-
19- The goal of this project is to help answer questions like:
20- - How has the world's use of the commons changed over time?
21- - How is the knowledge and culture of the commons distributed?
22- - Who has access (and how much) to the commons?
23- - What significant trends can be observed in the commons?
24- - Which public domain dedication or licenses are the most popular?
25- - What are the correlations between public domain dedication or licenses and
26- region, language, domain/endeavor, etc.?
9+ This project seeks to quantify the size and diversity of the creative commons
10+ legal tools. We aim to track the collection of works (articles, images,
11+ publications, etc.) that are openly licensed or in the public domain. The
12+ project automates data collection from multiple data sources, processes the
13+ data, and generates meaningful reports.
2714
2815
2916## Code of conduct
@@ -47,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
4734[ org-contrib ] : https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
4835
4936
37+ ### The three phases of generating a report
38+
39+ 1 . ** Fetch** : This phase involves collecting data from a particular source
40+ using its API. Before writing any code, we plan the analyses we want to
41+ perform by asking meaningful questions about the data. We also consider API
42+ limitations (such as query limits) and design a query strategy to work
43+ within these limitations. Then we write a python script that gets the data,
44+ it is quite important to follow the format of the scripts existing in the
45+ project and use the modules and functions where applicable. It ensures
46+ consistency in the scripts and we can easily debug issues might arise.
47+ - ** Meaningful questions**
48+ - The reports generated by this project (and the data fetched and
49+ processed to support it) seeks to be meaningful. We hope this project
50+ will provide data and analysis that helps inform discussions about the
51+ commons. The goal of this project is to help answer questions like:
52+ - How has the world's use of the commons changed over time?
53+ - How is the knowledge and culture of the commons distributed?
54+ - Who has access (and how much) to the commons?
55+ - What significant trends can be observed in the commons?
56+ - Which public domain dedication or licenses are the most popular?
57+ - What are the correlations between public domain dedication or licenses
58+ and region, language, domain/endeavor, etc.?
59+ - ** Limitations of an API**
60+ - Some data sources provide APIs with query limits (it can be daily or
61+ hourly) depending on what is given in the documentation. This restricts
62+ how many requests that can be made in the specified period of time. It
63+ is important to plan a query strategy and schedule fetch jobs to stay
64+ within the allowed limits.
65+ - ** Headings of data in 1-fetch**
66+ - [ Tool identifier] [ tool-identifier ] : A unique identifier used to
67+ distinguish each Creative Commons legal tool within the dataset. This
68+ helps ensure consistency when tracking tools across different data
69+ sources.
70+ - [ SPDX identifier] [ spdx-identifier ] : A standardized identifier maintained
71+ by the Software Package Data Exchange (SPDX) project. It provides a
72+ consistent way to reference licenses in applications.
73+ 2 . ** Process** : In this phase, the fetched data is transformed into a
74+ structured and standardized format for analysis. The data is then analyzed
75+ and categorized based on defined criteria to extract insights that answer
76+ the meaningful questions identified during the 1-fetch phase.
77+ 3 . ** report** : This phase focuses on presenting the results of the analysis.
78+ We generate graphs and summaries that clearly show trends, patterns, and
79+ distributions in the data. These reports help communicate key insights about
80+ the size, diversity, and characteristics of openly licensed and public
81+ domain works.
82+
83+ [ tool-identifier ] : https://creativecommons.org/share-your-work/cclicenses/
84+ [ spdx-identifier ] : https://spdx.org/licenses/
85+
86+
87+ ### Automation phases
88+
89+ For automating these phases, the project uses Python scripts to fetch, process,
90+ and report data. GitHub Actions is used to automatically run these scripts on a
91+ defined schedule and on code updates. It handles script execution, manages
92+ dependencies, and ensures the workflow runs consistently.
93+ - ** Script assumptions**
94+ - Execution schedule for each quarter:
95+ - 1-Fetch: first month, 1st half of second month
96+ - 2-Process: 2nd half of second month
97+ - 3-Report: third month
98+ - ** Script requirements**
99+ - * Must be safe*
100+ - Scripts must not make any changes with default options
101+ - Easiest way to run script should also be the safest
102+ - Have options spelled out
103+ - Must be timely
104+ - * Scripts should complete within a maximum of 45 minutes*
105+ - * Scripts shouldn't take longer than 3 minutes with default options*
106+ - That way there’s a quicker way to see what is happening when it is
107+ running; see execution, without errors, etc. Then later in production it
108+ can be run with longer options
109+ - * Must be idempotent*
110+ - [ Idempotence - Wikipedia] ( https://en.wikipedia.org/wiki/Idempotence )
111+ - This applies to both the data fetched and the data stored. If the data
112+ changes randomly, we can't draw meaningful conclusions.
113+ - * Balanced use of third-party libraries*
114+ - Third-party libraries should be leveraged when they are:
115+ - API specific (google-api-python-client, internetarchive, etc.)
116+ - File formats
117+ - CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
118+ and the data used by the project is simple enough to avoid any
119+ shortcomings.
120+ - YAML: prioritizes human readability which addresses the primary costs and
121+ risks associated with configuration files.
122+
123+
50124### Project structure
51125
52126Please note that in the directory tree below, all instances of ` fetch ` ,
@@ -91,8 +165,7 @@ Quantifying/
91165```
92166
93167
94- ## Development
95-
168+ ## How to set up
96169
97170### Prerequisites
98171
0 commit comments