Skip to content

Commit 5f8f6a9

Browse files
committed
Revert "Delete sources.md"
This reverts commit a227730.
1 parent a227730 commit 5f8f6a9

File tree

1 file changed

+208
-0
lines changed

1 file changed

+208
-0
lines changed

sources.md

Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
# Data Sources
2+
3+
This project uses data from various sources that are openly licensed or in the
4+
public domain. Below are the sources and their respective information:
5+
6+
7+
## CC Legal Tools
8+
9+
**Description:** A `.txt` file provided by Timid Robot containing all legal
10+
tool paths.
11+
12+
**API documentation link:**
13+
- [`google_custom_search/legal-tool-paths.txt`][tools-paths]: a list of all
14+
current Creative Commons (CC) legal tool paths
15+
- [`data/prioritized-tool-urls.txt`][prioritized-tool-urls]: a prioritized list
16+
of all current CC legal tool URLs
17+
18+
**API information:**
19+
- No API key required
20+
- No query limits
21+
22+
[tools-paths]:data/legal-tool-paths.txt
23+
[prioritized-tool-urls]: data/prioritized-tool-urls.txt
24+
25+
26+
## Flickr
27+
28+
**Description:** _With over 5 billion photos (many with valuable metadata such
29+
as tags, geolocation, and Exif data), the Flickr community creates wonderfully
30+
rich data. The Flickr API is how you can access that data. In fact, almost all
31+
the functionality that runs flickr.com is available through the API._ ([Flickr:
32+
The Flickr Developer Guide](https://www.flickr.com/services/developer/))
33+
34+
**API documentation link:**
35+
- [API documentation - Flickr Services](https://www.flickr.com/services/api/)
36+
37+
**API information:**
38+
- API key required
39+
- Query limit: 3600 requests per hour
40+
- Data available through CSV format
41+
42+
## GitHub
43+
44+
**Description:** A development platform for hosting and managing code.
45+
46+
**API documentation link:**
47+
- [GitHub REST API v3](https://docs.github.com/en/rest)
48+
49+
**API information:**
50+
- API key not required but recommended by GitHub
51+
- Query limit: 60 requests per hour if unauthenticated,
52+
5000 requests per hour if authenticated
53+
- Data available through JSON format
54+
55+
56+
## GCS (Google Custom Search) JSON API
57+
58+
**Description:** The Custom Search JSON API allows user-defined detailed query
59+
and access towards related query data using a programmable search engine.
60+
61+
**Admin links:**
62+
- [Programmable Search - All search engines][gcs-admin]
63+
- [APIs & Services – APIs & Services – Google Cloud console][google-api-admin]
64+
65+
**API documentation links:**
66+
- [Custom Search JSON API Reference | Programmable Search Engine | Google
67+
Developers][google-json]
68+
- [Google API Python Client Library][google-api-python]
69+
- [Google API Client Library for Python Docs |
70+
google-api-python-client][google-api-python]
71+
- _Reference documentation for the core library
72+
[googleapiclient][googleapiclient]._
73+
- See: googleapiclient.discovery > build
74+
- _[Library reference documentation by API][gcs-library-ref]_
75+
- See Custom Search v1 [cse()][gcs-cse]
76+
- [Method: cse.list | Custom Search JSON API | Google Developers][cse-list]
77+
- [XML API reference appendices][reference-appendix]
78+
79+
**API information:**
80+
- API key required
81+
- Query limit: 100 queries per day
82+
- Data available through JSON format
83+
84+
**Notes:**
85+
- The data from Google Custom Search will only cover 50+ general, most
86+
significant categories of CC License for data collection quota constraint.
87+
As an additional note, the order of precedence of license the collected
88+
data's first column is sorted due to intermediate data analysis progress.
89+
90+
[gcs-admin]: https://programmablesearchengine.google.com/controlpanel/all
91+
[google-api-admin]: https://console.cloud.google.com/apis/dashboard
92+
[google-json]: https://developers.google.com/custom-search/v1/reference/rest
93+
[google-api-python]: https://github.com/googleapis/google-api-python-client
94+
[googleapiclient]: http://googleapis.github.io/google-api-python-client/docs/epy/index.html
95+
[gcs-library-ref]: https://googleapis.github.io/google-api-python-client/docs/dyn/
96+
[gcs-cse]: https://googleapis.github.io/google-api-python-client/docs/dyn/customsearch_v1.cse.html
97+
[cse-list]: https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list
98+
[reference-appendix]: https://developers.google.com/custom-search/docs/xml_results_appendices
99+
100+
101+
## Internet Archive Python Interface
102+
103+
**Description:** A python interface to archive.org to achieve API requests
104+
towards internet archive.
105+
106+
**API documentation link:**
107+
- [internetarchive.Search - Internetarchive: A Python Interface to
108+
archive.org][ia-search]
109+
110+
**API information:**
111+
- No API key required
112+
- No query limits
113+
114+
[ia-search]: https://internetarchive.readthedocs.io/en/stable/internetarchive.html#internetarchive.Search
115+
116+
117+
## MediaWiki Action API
118+
119+
**Description:** _The MediaWiki Action API is a web service that allows access
120+
to some wiki features like authentication, page operations, and search. It can
121+
provide meta information about the wiki and the logged-in user._ ([API:Main
122+
page - MediaWiki](https://www.mediawiki.org/wiki/API:Main_page))
123+
124+
**API documentation link:**
125+
- [MediaWiki Action API](https://www.mediawiki.org/wiki/API:Main_page)
126+
127+
**API information:**
128+
- No API key required
129+
- Query limit: depends on user status and request type
130+
- Data available through XML or JSON format
131+
132+
133+
## The Metropolitan Museum of Art Collection API
134+
135+
**Description:** _The Met’s Open Access datasets are available through our API.
136+
The API (RESTful web service in JSON format) gives access to all of The Met’s
137+
Open Access data and to corresponding high resolution images (JPEG format) that
138+
are in the public domain._ ([The Metropolitan Museum of Art Collection
139+
API](https://metmuseum.github.io/))
140+
141+
**API documentation link:**
142+
- [Latest Updates | The Metropolitan Museum of Art Collection
143+
API](https://metmuseum.github.io/)
144+
145+
**API information:**
146+
- No API key required
147+
- 80 queries per second
148+
149+
150+
## Vimeo API
151+
152+
**Description:** The Vimeo API allows users to perform filtered, advanced
153+
search on Vimeo videos.
154+
155+
**API documentation link:**
156+
- [Getting Started with the Vimeo API](https://developer.vimeo.com/api/start)
157+
158+
**API information:**
159+
- API key required
160+
- Query limit: 5000 authenticated requests per day
161+
- Data available through JSON format
162+
163+
164+
## YouTube Data API
165+
166+
**Description:** An API from YouTube for platform users to upload videos,
167+
adjust video parameters, and obtain search results.
168+
169+
**API documentation link:**
170+
- [Search: list | YouTube Data API | Google
171+
Developers](https://developers.google.com/youtube/v3/docs/search/list)
172+
173+
**API information:**
174+
- API key required
175+
- Query limit: depends on the type and number of requests
176+
- Data available through JSON format
177+
178+
## 📖 Wikipedia Data Source
179+
180+
Quantifying now supports fetching data from Wikipedia as an additional source alongside GitHub and Google Custom Search.
181+
182+
### Available Statistics
183+
184+
- **Number of articles** – Total articles on Wikipedia.
185+
- **Number of pages** – Total pages, including non-article pages.
186+
- **Number of edits** – Total edits across Wikipedia.
187+
- **Number of users** – Total registered users.
188+
- **Number of images** – Total uploaded images.
189+
- **Keyword-based counts** – Number of articles referencing specific Creative Commons licenses or keywords.
190+
191+
### Example Usage
192+
193+
```python
194+
from scripts.wikipedia_fetch import get_site_statistics, search_articles_count, fetch_cc_related_statistics
195+
196+
# General Wikipedia statistics
197+
stats = get_site_statistics()
198+
print("Wikipedia Site Stats:", stats)
199+
200+
# Count articles containing a specific keyword
201+
cc_articles = search_articles_count("Creative Commons")
202+
print("Articles with 'Creative Commons':", cc_articles)
203+
204+
# Fetch counts for various Creative Commons licenses
205+
cc_stats = fetch_cc_related_statistics()
206+
for license_name, count in cc_stats.items():
207+
print(f"{license_name}: {count}")
208+

0 commit comments

Comments
 (0)