Skip to content

Commit e5c046f

Browse files
authored
Merge pull request #256 from oree-xx/quantifying/documentation-changes
Documentation update
2 parents 0266293 + 8cc0ea5 commit e5c046f

File tree

2 files changed

+100
-27
lines changed

2 files changed

+100
-27
lines changed

README.md

Lines changed: 96 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,16 @@
1-
# quantifying
1+
# Quantifying
22

3-
Quantifying the Commons
3+
Quantifying the Commons: measure the size and diversity of the commons--the
4+
collection of works that are openly licensed or in the public domain
45

56

67
## Overview
78

8-
This project seeks to quantify the size and diversity of the commons--the
9-
collection of works that are openly licensed or in the public domain.
10-
11-
12-
### Meaningful
13-
14-
The reports generated by this project (and the data fetched and processed to
15-
support it) seeks to be meaningful. We hope this project will provide data and
16-
analysis that helps inform discussions about the commons--the collection of
17-
works that are openly licensed or in the public domain.
18-
19-
The goal of this project is to help answer questions like:
20-
- How has the world's use of the commons changed over time?
21-
- How is the knowledge and culture of the commons distributed?
22-
- Who has access (and how much) to the commons?
23-
- What significant trends can be observed in the commons?
24-
- Which public domain dedication or licenses are the most popular?
25-
- What are the correlations between public domain dedication or licenses and
26-
region, language, domain/endeavor, etc.?
9+
This project seeks to quantify the size and diversity of the creative commons
10+
legal tools. We aim to track the collection of works (articles, images,
11+
publications, etc.) that are openly licensed or in the public domain. The
12+
project automates data collection from multiple data sources, processes the
13+
data, and generates meaningful reports.
2714

2815

2916
## Code of conduct
@@ -47,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
4734
[org-contrib]: https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
4835

4936

37+
### The three phases of generating a report
38+
39+
1. **Fetch**: This phase involves collecting data from a particular source
40+
using its API. Before writing any code, we plan the analyses we want to
41+
perform by asking meaningful questions about the data. We also consider API
42+
limitations (such as query limits) and design a query strategy to work
43+
within these limitations. Then we write a python script that gets the data,
44+
it is quite important to follow the format of the scripts existing in the
45+
project and use the modules and functions where applicable. It ensures
46+
consistency in the scripts and we can easily debug issues might arise.
47+
- **Meaningful questions**
48+
- The reports generated by this project (and the data fetched and
49+
processed to support it) seeks to be meaningful. We hope this project
50+
will provide data and analysis that helps inform discussions about the
51+
commons. The goal of this project is to help answer questions like:
52+
- How has the world's use of the commons changed over time?
53+
- How is the knowledge and culture of the commons distributed?
54+
- Who has access (and how much) to the commons?
55+
- What significant trends can be observed in the commons?
56+
- Which public domain dedication or licenses are the most popular?
57+
- What are the correlations between public domain dedication or licenses
58+
and region, language, domain/endeavor, etc.?
59+
- **Limitations of an API**
60+
- Some data sources provide APIs with query limits (it can be daily or
61+
hourly) depending on what is given in the documentation. This restricts
62+
how many requests that can be made in the specified period of time. It
63+
is important to plan a query strategy and schedule fetch jobs to stay
64+
within the allowed limits.
65+
- **Headings of data in 1-fetch**
66+
- [Tool identifier][tool-identifier]: A unique identifier used to
67+
distinguish each Creative Commons legal tool within the dataset. This
68+
helps ensure consistency when tracking tools across different data
69+
sources.
70+
- [SPDX identifier][spdx-identifier]: A standardized identifier maintained
71+
by the Software Package Data Exchange (SPDX) project. It provides a
72+
consistent way to reference licenses in applications.
73+
2. **Process**: In this phase, the fetched data is transformed into a
74+
structured and standardized format for analysis. The data is then analyzed
75+
and categorized based on defined criteria to extract insights that answer
76+
the meaningful questions identified during the 1-fetch phase.
77+
3. **report**: This phase focuses on presenting the results of the analysis.
78+
We generate graphs and summaries that clearly show trends, patterns, and
79+
distributions in the data. These reports help communicate key insights about
80+
the size, diversity, and characteristics of openly licensed and public
81+
domain works.
82+
83+
[tool-identifier]: https://creativecommons.org/share-your-work/cclicenses/
84+
[spdx-identifier]: https://spdx.org/licenses/
85+
86+
87+
### Automation phases
88+
89+
For automating these phases, the project uses Python scripts to fetch, process,
90+
and report data. GitHub Actions is used to automatically run these scripts on a
91+
defined schedule and on code updates. It handles script execution, manages
92+
dependencies, and ensures the workflow runs consistently.
93+
- **Script assumptions**
94+
- Execution schedule for each quarter:
95+
- 1-Fetch: first month, 1st half of second month
96+
- 2-Process: 2nd half of second month
97+
- 3-Report: third month
98+
- **Script requirements**
99+
- *Must be safe*
100+
- Scripts must not make any changes with default options
101+
- Easiest way to run script should also be the safest
102+
- Have options spelled out
103+
- Must be timely
104+
- *Scripts should complete within a maximum of 45 minutes*
105+
- *Scripts shouldn't take longer than 3 minutes with default options*
106+
- That way there’s a quicker way to see what is happening when it is
107+
running; see execution, without errors, etc. Then later in production it
108+
can be run with longer options
109+
- *Must be idempotent*
110+
- [Idempotence - Wikipedia](https://en.wikipedia.org/wiki/Idempotence)
111+
- This applies to both the data fetched and the data stored. If the data
112+
changes randomly, we can't draw meaningful conclusions.
113+
- *Balanced use of third-party libraries*
114+
- Third-party libraries should be leveraged when they are:
115+
- API specific (google-api-python-client, internetarchive, etc.)
116+
- File formats
117+
- CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
118+
and the data used by the project is simple enough to avoid any
119+
shortcomings.
120+
- YAML: prioritizes human readability which addresses the primary costs and
121+
risks associated with configuration files.
122+
123+
50124
### Project structure
51125

52126
Please note that in the directory tree below, all instances of `fetch`,
@@ -91,8 +165,7 @@ Quantifying/
91165
```
92166

93167

94-
## Development
95-
168+
## How to set up
96169

97170
### Prerequisites
98171

sources.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -159,10 +159,10 @@ provides access to information of all wikimedia projects including the different
159159
language edition of wikipedia. It runs on the Meta-Wiki API.
160160

161161
**API documentation link:**
162-
[WIKIPEDIA_BASE_URL documentation](https://en.wikipedia.org/w/api.php)
163-
[WIKIPEDIA_BASE_URL reference page](https://www.mediawiki.org/wiki/API:Main_page)
164-
[WIKIPEDIA_MATRIX_URL documentation](https://meta.wikimedia.org/w/api.php)
165-
[WIKIPEDIA_MATRIX_URL reference page](https://www.mediawiki.org/wiki/API:Sitematrix)
162+
- [WIKIPEDIA_BASE_URL documentation](https://en.wikipedia.org/w/api.php)
163+
- [WIKIPEDIA_BASE_URL reference page](https://www.mediawiki.org/wiki/API:Main_page)
164+
- [WIKIPEDIA_MATRIX_URL documentation](https://meta.wikimedia.org/w/api.php)
165+
- [WIKIPEDIA_MATRIX_URL reference page](https://www.mediawiki.org/wiki/API:Sitematrix)
166166

167167
**API information:**
168168
- No API key required

0 commit comments

Comments
 (0)