You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-8Lines changed: 7 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,11 +3,11 @@
3
3
4
4
5
5
## Overview
6
-
This project seeks to quantify the size and diversity of the creative commons legal tools. We aim to track the collection of works (articles, images, publications, etc.) that are openly licensed or in the public domain. The project automates data collection from multiple data sources, processes the data, and generates reports.
6
+
This project seeks to quantify the size and diversity of the creative commons legal tools. We aim to track the collection of works (articles, images, publications, etc.) that are openly licensed or in the public domain. The project automates data collection from multiple data sources, processes the data, and generates meaningful reports.
7
7
8
8
9
9
### The three phases of generating a report
10
-
-**1-Fetch**: This phase involves collecting data from a specific source using its API. Before writing any code, we plan the analyses we want to perform by asking meaningful questions about the data. We also consider API limitations (such as query limits) and design a query strategy to work within those constraints.
10
+
-**1-Fetch**: This phase involves collecting data from a particular source using its API. Before writing any code, we plan the analyses we want to perform by asking meaningful questions about the data. We also consider API limitations (such as query limits) and design a query strategy to work within these limitations. Then we write a python script that gets the data, it is quite important to follow the format of the scripts existing in the project and use the modules and functions where applicable. It ensures consistency in the scripts and we can easily debug issues might arise.
11
11
12
12
13
13
-**Meaningful questions**
@@ -22,21 +22,21 @@ This project seeks to quantify the size and diversity of the creative commons le
22
22
region, language, domain/endeavor, etc.?
23
23
24
24
-**Limitations of an API**
25
-
- Some data sources provide APIs with certain limitations. A common limitation is a daily or hourly query limit, which restricts how many requests can be made in a given time period. To work around this, we carefully plan our queries, batch requests where possible, and schedule fetch jobs to stay within the allowed limits.
25
+
- Some data sources provide APIs with query limits (it can be daily or hourly) depending on what is given in the documentation. This restricts how many requests that can be made in the specified period of time. It is important to plan a query strategy and schedule fetch jobs to stay within the allowed limits.
26
26
27
27
-**Headings of data in 1-fetch**
28
28
-[Tool identifier](https://creativecommons.org/share-your-work/cclicenses/): A unique identifier used to distinguish each Creative Commons legal tool within the dataset. This helps ensure consistency when tracking tools across different data sources.
29
-
-[SPDX identifier](https://spdx.org/licenses/): A standardized identifier maintained by the Software Package Data Exchange (SPDX) project. It provides a consistent way to reference licenses.
29
+
-[SPDX identifier](https://spdx.org/licenses/): A standardized identifier maintained by the Software Package Data Exchange (SPDX) project. It provides a consistent way to reference licenses in applications.
30
30
31
31
32
-
-**2-Process**: In this phase, the fetched data is transformed into a structured and standardized format for analysis. The data is then analyzed and categorized based on defined criteria to extract insights that answer the meaningful questions identified during the fetch stage.
32
+
-**2-Process**: In this phase, the fetched data is transformed into a structured and standardized format for analysis. The data is then analyzed and categorized based on defined criteria to extract insights that answer the meaningful questions identified during the 1-fetch phase.
33
33
34
34
35
35
-**3-report**: This phase focuses on presenting the results of the analysis. We generate graphs and summaries that clearly show trends, patterns, and distributions in the data. These reports help communicate key insights about the size, diversity, and characteristics of openly licensed and public-domain works.
36
36
37
37
38
38
### Automation scripts
39
-
For automating these steps, the project uses Python scripts to fetch, process, and report data. GitHub Actions is used to automatically run these scripts on a defined schedule and on code updates. It handles script execution, manages dependencies, and ensures the workflow runs consistently.
39
+
For automating these phases, the project uses Python scripts to fetch, process, and report data. GitHub Actions is used to automatically run these scripts on a defined schedule and on code updates. It handles script execution, manages dependencies, and ensures the workflow runs consistently.
40
40
41
41
-**Script assumptions**
42
42
- Execution schedule for each quarter:
@@ -53,8 +53,7 @@ For automating these steps, the project uses Python scripts to fetch, process, a
53
53
54
54
-*Scripts should complete within a maximum of 45 minutes*
55
55
-*Scripts shouldn't take longer than 3 minutes with default options*
56
-
- That way there’s a quicker way to see what is happening when it is running; see execution, without errors, etc.
57
-
- Then later in production it can be run with longer options
56
+
- That way there’s a quicker way to see what is happening when it is running; see execution, without errors, etc. Then later in production it can be run with longer options
58
57
59
58
-*Must be idempotent (Idempotence: [Wikipedia](https://en.wikipedia.org/wiki/Idempotence))*
60
59
- This applies to both the data fetched and the data stored.
0 commit comments