Skip to content

Commit 94527e6

Browse files
committed
update whitespace and formatting. reorg sections
1 parent 0ac3ae1 commit 94527e6

File tree

1 file changed

+95
-48
lines changed

1 file changed

+95
-48
lines changed

README.md

Lines changed: 95 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,17 @@
11
# Quantifying
2-
Quantifying the Commons: measure the size and diversity of the commons--the collection of works that are openly licensed or in the public domain
2+
3+
Quantifying the Commons: measure the size and diversity of the commons--the
4+
collection of works that are openly licensed or in the public domain
35

46

57
## Overview
6-
This project seeks to quantify the size and diversity of the creative commons legal tools. We aim to track the collection of works (articles, images, publications, etc.) that are openly licensed or in the public domain. The project automates data collection from multiple data sources, processes the data, and generates meaningful reports.
78

9+
This project seeks to quantify the size and diversity of the creative commons
10+
legal tools. We aim to track the collection of works (articles, images,
11+
publications, etc.) that are openly licensed or in the public domain. The
12+
project automates data collection from multiple data sources, processes the
13+
data, and generates meaningful reports.
814

9-
### The three phases of generating a report
10-
- **1-Fetch**: This phase involves collecting data from a particular source using its API. Before writing any code, we plan the analyses we want to perform by asking meaningful questions about the data. We also consider API limitations (such as query limits) and design a query strategy to work within these limitations. Then we write a python script that gets the data, it is quite important to follow the format of the scripts existing in the project and use the modules and functions where applicable. It ensures consistency in the scripts and we can easily debug issues might arise.
11-
- **Meaningful questions**
12-
- The reports generated by this project (and the data fetched and processed to support it) seeks to be meaningful. We hope this project will provide data and analysis that helps inform discussions about the commons--the collection of works that are openly licensed or in the public domain.
13-
The goal of this project is to help answer questions like:
14-
- How has the world's use of the commons changed over time?
15-
- How is the knowledge and culture of the commons distributed?
16-
- Who has access (and how much) to the commons?
17-
- What significant trends can be observed in the commons?
18-
- Which public domain dedication or licenses are the most popular?
19-
- What are the correlations between public domain dedication or licenses and
20-
region, language, domain/endeavor, etc.?
21-
- **Limitations of an API**
22-
- Some data sources provide APIs with query limits (it can be daily or hourly) depending on what is given in the documentation. This restricts how many requests that can be made in the specified period of time. It is important to plan a query strategy and schedule fetch jobs to stay within the allowed limits.
23-
- **Headings of data in 1-fetch**
24-
- [Tool identifier](https://creativecommons.org/share-your-work/cclicenses/): A unique identifier used to distinguish each Creative Commons legal tool within the dataset. This helps ensure consistency when tracking tools across different data sources.
25-
- [SPDX identifier](https://spdx.org/licenses/): A standardized identifier maintained by the Software Package Data Exchange (SPDX) project. It provides a consistent way to reference licenses in applications.
26-
- **2-Process**: In this phase, the fetched data is transformed into a structured and standardized format for analysis. The data is then analyzed and categorized based on defined criteria to extract insights that answer the meaningful questions identified during the 1-fetch phase.
27-
- **3-report**: This phase focuses on presenting the results of the analysis. We generate graphs and summaries that clearly show trends, patterns, and distributions in the data. These reports help communicate key insights about the size, diversity, and characteristics of openly licensed and public-domain works.
28-
29-
30-
### Automation scripts
31-
For automating these phases, the project uses Python scripts to fetch, process, and report data. GitHub Actions is used to automatically run these scripts on a defined schedule and on code updates. It handles script execution, manages dependencies, and ensures the workflow runs consistently.
32-
- **Script assumptions**
33-
- Execution schedule for each quarter:
34-
- 1-Fetch: first month, 1st half of second month
35-
- 2-Process: 2nd half of second month
36-
- 3-Report: third month
37-
- **Script requirements**
38-
- *Must be safe*
39-
- Scripts must not make any changes with default options
40-
- Easiest way to run script should also be the safest
41-
- Have options spelled out
42-
- Must be timely
43-
- *Scripts should complete within a maximum of 45 minutes*
44-
- *Scripts shouldn't take longer than 3 minutes with default options*
45-
- That way there’s a quicker way to see what is happening when it is running; see execution, without errors, etc. Then later in production it can be run with longer options
46-
- *Must be idempotent (Idempotence: [Wikipedia](https://en.wikipedia.org/wiki/Idempotence))*
47-
- This applies to both the data fetched and the data stored.
48-
If the data changes randomly, we can't draw meaningful conclusions.
49-
- *Balanced use of third-party libraries*
50-
- Third-party libraries should be leveraged when they are:
51-
- API specific (google-api-python-client, internetarchive, etc.)
52-
- File formats
53-
- CSV: the format is well supported (rendered on GitHub, etc.), easy to use, and the data used by the project is simple enough to avoid any shortcomings.
54-
- YAML: prioritizes human readability which addresses the primary costs and risks associated with configuration files.
5515

5616
## Code of conduct
5717

@@ -74,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
7434
[org-contrib]: https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
7535

7636

37+
### The three phases of generating a report
38+
- **1-Fetch**: This phase involves collecting data from a particular source
39+
using its API. Before writing any code, we plan the analyses we want to
40+
perform by asking meaningful questions about the data. We also consider API
41+
limitations (such as query limits) and design a query strategy to work within
42+
these limitations. Then we write a python script that gets the data, it is
43+
quite important to follow the format of the scripts existing in the project
44+
and use the modules and functions where applicable. It ensures consistency in
45+
the scripts and we can easily debug issues might arise.
46+
- **Meaningful questions**
47+
- The reports generated by this project (and the data fetched and processed
48+
to support it) seeks to be meaningful. We hope this project will provide
49+
data and analysis that helps inform discussions about the commons--the
50+
collection of works that are openly licensed or in the public domain.
51+
52+
The goal of this project is to help answer questions like:
53+
- How has the world's use of the commons changed over time?
54+
- How is the knowledge and culture of the commons distributed?
55+
- Who has access (and how much) to the commons?
56+
- What significant trends can be observed in the commons?
57+
- Which public domain dedication or licenses are the most popular?
58+
- What are the correlations between public domain dedication or licenses
59+
and region, language, domain/endeavor, etc.?
60+
- **Limitations of an API**
61+
- Some data sources provide APIs with query limits (it can be daily or
62+
hourly) depending on what is given in the documentation. This restricts
63+
how many requests that can be made in the specified period of time. It is
64+
important to plan a query strategy and schedule fetch jobs to stay within
65+
the allowed limits.
66+
- **Headings of data in 1-fetch**
67+
- [Tool identifier][tool-identifier]: A unique identifier used to
68+
distinguish each Creative Commons legal tool within the dataset. This
69+
helps ensure consistency when tracking tools across different data
70+
sources.
71+
- [SPDX identifier][spdx-identifier]: A standardized identifier maintained
72+
by the Software Package Data Exchange (SPDX) project. It provides a
73+
consistent way to reference licenses in applications.
74+
- **2-Process**: In this phase, the fetched data is transformed into a
75+
structured and standardized format for analysis. The data is then analyzed
76+
and categorized based on defined criteria to extract insights that answer the
77+
meaningful questions identified during the 1-fetch phase.
78+
- **3-report**: This phase focuses on presenting the results of the analysis. We generate graphs and summaries that clearly show trends, patterns, and
79+
distributions in the data. These reports help communicate key insights about
80+
the size, diversity, and characteristics of openly licensed and public-domain
81+
works.
82+
83+
[tool-identifier]: https://creativecommons.org/share-your-work/cclicenses/
84+
[spdx-identifier]: https://spdx.org/licenses/
85+
86+
87+
### Automation phases
88+
89+
For automating these phases, the project uses Python scripts to fetch, process,
90+
and report data. GitHub Actions is used to automatically run these scripts on a
91+
defined schedule and on code updates. It handles script execution, manages
92+
dependencies, and ensures the workflow runs consistently.
93+
- **Script assumptions**
94+
- Execution schedule for each quarter:
95+
- 1-Fetch: first month, 1st half of second month
96+
- 2-Process: 2nd half of second month
97+
- 3-Report: third month
98+
- **Script requirements**
99+
- *Must be safe*
100+
- Scripts must not make any changes with default options
101+
- Easiest way to run script should also be the safest
102+
- Have options spelled out
103+
- Must be timely
104+
- *Scripts should complete within a maximum of 45 minutes*
105+
- *Scripts shouldn't take longer than 3 minutes with default options*
106+
- That way there’s a quicker way to see what is happening when it is
107+
running; see execution, without errors, etc. Then later in production it
108+
can be run with longer options
109+
- *Must be idempotent*
110+
- [Idempotence - Wikipedia](https://en.wikipedia.org/wiki/Idempotence)
111+
- This applies to both the data fetched and the data stored. If the data
112+
changes randomly, we can't draw meaningful conclusions.
113+
- *Balanced use of third-party libraries*
114+
- Third-party libraries should be leveraged when they are:
115+
- API specific (google-api-python-client, internetarchive, etc.)
116+
- File formats
117+
- CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
118+
and the data used by the project is simple enough to avoid any
119+
shortcomings.
120+
- YAML: prioritizes human readability which addresses the primary costs and
121+
risks associated with configuration files.
122+
123+
77124
### Project structure
78125

79126
Please note that in the directory tree below, all instances of `fetch`,

0 commit comments

Comments
 (0)