Skip to content

Commit 880be5e

Browse files
committed
Made changes to documentation
1 parent c5c12e1 commit 880be5e

File tree

1 file changed

+28
-28
lines changed

1 file changed

+28
-28
lines changed

README.md

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,17 @@
11
# Quantifying
2-
3-
Quantifying the Commons - Measuring the diversity of Openly Licensed and Public Domain Works
2+
Quantifying the Commons: Measuring the diversity of Openly Licensed and Public Domain Works
43

54

65
## Overview
6+
This project seeks to quantify the size and diversity of the creative commons legal tools. We aim to track the collection of works (articles, images, publications, etc.) that are openly licensed or in the public domain. The project automates data collection from multiple data sources, processes the data, and generates reports.
77

8-
This project seeks to quantify the size and diversity of the creative commons legal tools. We aim to track the collection of works (articles, images, publications) that are openly licensed or in the public domain. The project automates data collection from multiple data sources, processes the data, and generates reports.
9-
10-
11-
#### The three phases of generating a report:
128

13-
- 1-Fetch - This phase involves collecting data from a specific source using its API. Before writing any code, we plan the analyses we want to perform by asking meaningful questions about the data. We also consider API limitations (such as query limits) and design a query strategy to work within those constraints.
9+
### The three phases of generating a report
10+
- **1-Fetch**: This phase involves collecting data from a specific source using its API. Before writing any code, we plan the analyses we want to perform by asking meaningful questions about the data. We also consider API limitations (such as query limits) and design a query strategy to work within those constraints.
1411

1512

16-
- Meaningful questions:
17-
The reports generated by this project (and the data fetched and processed to support it) seeks to be meaningful. We hope this project will provide data and analysis that helps inform discussions about the commons--the collection of works that are openly licensed or in the public domain.
13+
- **Meaningful questions**
14+
- The reports generated by this project (and the data fetched and processed to support it) seeks to be meaningful. We hope this project will provide data and analysis that helps inform discussions about the commons--the collection of works that are openly licensed or in the public domain.
1815
The goal of this project is to help answer questions like:
1916
- How has the world's use of the commons changed over time?
2017
- How is the knowledge and culture of the commons distributed?
@@ -24,49 +21,52 @@ This project seeks to quantify the size and diversity of the creative commons le
2421
- What are the correlations between public domain dedication or licenses and
2522
region, language, domain/endeavor, etc.?
2623

27-
28-
- Limitations of an API
24+
- **Limitations of an API**
2925
- Some data sources provide APIs with certain limitations. A common limitation is a daily or hourly query limit, which restricts how many requests can be made in a given time period. To work around this, we carefully plan our queries, batch requests where possible, and schedule fetch jobs to stay within the allowed limits.
30-
- Headings of data in 1-fetch
31-
- [Tool identifier](https://creativecommons.org/share-your-work/cclicenses/): A unique identifier used to distinguish each Creative Commons legal tool within the dataset. This helps ensure consistency when tracking tools across different data sources.
32-
- [SPDX identifier](https://spdx.org/licenses/): A standardized identifier maintained by the Software Package Data Exchange (SPDX) project. It provides a consistent way to reference licenses and improves interoperability across systems.
3326

27+
- **Headings of data in 1-fetch**
28+
- [Tool identifier](https://creativecommons.org/share-your-work/cclicenses/): A unique identifier used to distinguish each Creative Commons legal tool within the dataset. This helps ensure consistency when tracking tools across different data sources.
29+
- [SPDX identifier](https://spdx.org/licenses/): A standardized identifier maintained by the Software Package Data Exchange (SPDX) project. It provides a consistent way to reference licenses.
3430

35-
- 2-Process: In this phase, the fetched data is transformed into a structured and standardized format for analysis. The data is then analyzed and categorized based on defined criteria to extract insights that answer the meaningful questions identified during the fetch stage.
3631

32+
- **2-Process**: In this phase, the fetched data is transformed into a structured and standardized format for analysis. The data is then analyzed and categorized based on defined criteria to extract insights that answer the meaningful questions identified during the fetch stage.
3733

38-
- 3-report: This phase focuses on presenting the results of the analysis. We generate graphs and summaries that clearly show trends, patterns, and distributions in the data. These reports help communicate key insights about the size, diversity, and characteristics of openly licensed and public-domain works.
3934

35+
- **3-report**: This phase focuses on presenting the results of the analysis. We generate graphs and summaries that clearly show trends, patterns, and distributions in the data. These reports help communicate key insights about the size, diversity, and characteristics of openly licensed and public-domain works.
4036

41-
#### Automation scripts
42-
For automating these steps, the project uses Python scripts to fetch, process, and report data. GitHub Actions is used to automatically run these scripts on a defined schedule and on code updates. It handles task execution, manages dependencies, and ensures the workflow runs consistently.
4337

38+
### Automation scripts
39+
For automating these steps, the project uses Python scripts to fetch, process, and report data. GitHub Actions is used to automatically run these scripts on a defined schedule and on code updates. It handles script execution, manages dependencies, and ensures the workflow runs consistently.
4440

45-
- Script assumptions
41+
- **Script assumptions**
4642
- Execution schedule for each quarter:
4743
- 1-Fetch: first month, 1st half of second month
4844
- 2-Process: 2nd half of second month
4945
- 3-Report: third month
5046

51-
- Script requirements
52-
- Must be safe
47+
- **Script requirements**
48+
- *Must be safe*
5349
- Scripts must not make any changes with default options
5450
- Easiest way to run script should also be the safest
5551
- Have options spelled out
5652
- Must be timely
57-
- Scripts should complete within a maximum of 45 minutes
58-
- Scripts shouldn't take longer than 3 minutes with default options
53+
54+
- *Scripts should complete within a maximum of 45 minutes*
55+
- *Scripts shouldn't take longer than 3 minutes with default options*
5956
- That way there’s a quicker way to see what is happening when it is running; see execution, without errors, etc.
6057
- Then later in production it can be run with longer options
61-
- Must be idempotent (Idempotence - [Wikipedia](https://en.wikipedia.org/wiki/Idempotence))
58+
59+
- *Must be idempotent (Idempotence: [Wikipedia](https://en.wikipedia.org/wiki/Idempotence))*
6260
- This applies to both the data fetched and the data stored.
63-
If the data changes randomly, we can't draw meaningful conclusions
64-
- Balanced use of third-party libraries
61+
If the data changes randomly, we can't draw meaningful conclusions.
62+
63+
- *Balanced use of third-party libraries*
6564
- Third-party libraries should be leveraged when they are:
6665
- API specific (google-api-python-client, internetarchive, etc.)
66+
6767
- File formats
68-
- CSV - the format is well supported (rendered on GitHub, etc.), easy to use, and the data used by the project is simple enough to avoid any shortcomings.
69-
- YAML - prioritizes human readability which addresses the primary costs and risks associated with configuration files.
68+
- CSV: the format is well supported (rendered on GitHub, etc.), easy to use, and the data used by the project is simple enough to avoid any shortcomings.
69+
- YAML: prioritizes human readability which addresses the primary costs and risks associated with configuration files.
7070

7171
## Code of conduct
7272

0 commit comments

Comments
 (0)