Skip to content

Commit 51c0ece

Browse files
committed
Merge branch 'main' into Wikipedia/processing
2 parents 701fbfc + e84b028 commit 51c0ece

File tree

10 files changed

+163
-89
lines changed

10 files changed

+163
-89
lines changed

.cc-metadata.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,3 @@ english_name: Quantifying the Commons
66
technologies: Python
77
# Whether this repository should be featured on the CC Open Source site
88
featured: true
9-
# Slack channel name
10-
slack: "cc-dev-quantifying"
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
name: Test scripts' help
2+
3+
on:
4+
push:
5+
workflow_dispatch:
6+
7+
jobs:
8+
job:
9+
runs-on: ubuntu-latest
10+
11+
steps:
12+
13+
# https://github.com/actions/setup-python
14+
- name: Install Python 3.11
15+
uses: actions/setup-python@v5
16+
with:
17+
python-version: '3.11'
18+
19+
- name: Install pipenv
20+
run: |
21+
pip install --upgrade pip
22+
pip install pipenv
23+
24+
# https://github.com/actions/checkout
25+
- name: Checkout quantifying
26+
uses: actions/checkout@v4
27+
28+
- name: Install Python dependencies
29+
run: |
30+
pipenv sync --system
31+
32+
- name: Test scripts' help
33+
run: |
34+
./dev/test_scripts_help.sh

dev/test_scripts_help.sh

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#!/usr/bin/env bash
2+
#
3+
# Ensure each script can display help message to ensure basic execution.
4+
#
5+
# This script must be run from within the pipenv shell or a properly configured
6+
# environment. For example:
7+
#
8+
# 1. Using pipenv run
9+
# pipenv run ./dev/test_scripts_help.sh
10+
#
11+
# 2. Using pipenv shell
12+
# pipenv shell
13+
# ./dev/test_scripts_help.sh
14+
#
15+
# 3. A properly configured environment
16+
# (see .github/workflows/test_scripts_help.yml)
17+
#
18+
#### SETUP ####################################################################
19+
20+
set -o errexit
21+
set -o errtrace
22+
set -o nounset
23+
24+
# shellcheck disable=SC2154
25+
trap '_es=${?};
26+
printf "${0}: line ${LINENO}: \"${BASH_COMMAND}\"";
27+
printf " exited with a status of ${_es}\n";
28+
exit ${_es}' ERR
29+
30+
DIR_REPO="$(cd -P -- "${0%/*}/.." && pwd -P)"
31+
EXIT_STATUS=0
32+
33+
#### FUNCTIONS ################################################################
34+
35+
test_help() {
36+
local _es _script
37+
for _script in $(find scripts/?-* -type f -name '*.py' | sort)
38+
do
39+
_es=0
40+
./"${_script}" --help &>/dev/null || _es=${?}
41+
if (( _es == 0 ))
42+
then
43+
echo "${_script}"
44+
else
45+
echo "${_script}"
46+
EXIT_STATUS=${_es}
47+
fi
48+
done
49+
}
50+
51+
#### MAIN #####################################################################
52+
53+
cd "${DIR_REPO}"
54+
test_help
55+
echo "exit status: ${EXIT_STATUS}"
56+
exit ${EXIT_STATUS}

scripts/1-fetch/arxiv_fetch.py

Lines changed: 2 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,6 @@
2222
from pygments import highlight
2323
from pygments.formatters import TerminalFormatter
2424
from pygments.lexers import PythonTracebackLexer
25-
from requests.adapters import HTTPAdapter
26-
from urllib3.util.retry import Retry
2725

2826
# Add parent directory so shared can be imported
2927
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
@@ -36,7 +34,7 @@
3634

3735
# Constants
3836
# API Configuration
39-
BASE_URL = "http://export.arxiv.org/api/query?"
37+
BASE_URL = "https://export.arxiv.org/api/query?"
4038
DEFAULT_FETCH_LIMIT = 800 # Default total papers to fetch
4139

4240
# CSV Headers
@@ -335,19 +333,6 @@ def initialize_all_data_files(args):
335333
initialize_data_file(FILE_ARXIV_AUTHOR_BUCKET, HEADER_AUTHOR_BUCKET)
336334

337335

338-
def get_requests_session():
339-
"""Create request session with retry logic"""
340-
retry_strategy = Retry(
341-
total=5,
342-
backoff_factor=10,
343-
status_forcelist=shared.STATUS_FORCELIST,
344-
)
345-
session = requests.Session()
346-
session.headers.update({"User-Agent": shared.USER_AGENT})
347-
session.mount("https://", HTTPAdapter(max_retries=retry_strategy))
348-
return session
349-
350-
351336
def normalize_license_text(raw_text):
352337
"""
353338
Convert raw license text to standardized CC license identifiers.
@@ -533,7 +518,7 @@ def query_arxiv(args):
533518
"""
534519

535520
LOGGER.info("Beginning to fetch results from ArXiv API")
536-
session = get_requests_session()
521+
session = shared.get_session()
537522

538523
results_per_iteration = 50
539524

scripts/1-fetch/europeana_fetch.py

Lines changed: 1 addition & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323
from pygments import highlight
2424
from pygments.formatters import TerminalFormatter
2525
from pygments.lexers import PythonTracebackLexer
26-
from requests.adapters import HTTPAdapter, Retry
2726

2827
# Add parent directory for shared imports
2928
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
@@ -103,19 +102,6 @@ def parse_arguments():
103102
return args
104103

105104

106-
def get_requests_session():
107-
"""Create a requests session with retry."""
108-
max_retries = Retry(
109-
total=5, backoff_factor=10, status_forcelist=shared.STATUS_FORCELIST
110-
)
111-
session = requests.Session()
112-
session.mount("https://", HTTPAdapter(max_retries=max_retries))
113-
session.headers.update(
114-
{"accept": "application/json", "User-Agent": shared.USER_AGENT}
115-
)
116-
return session
117-
118-
119105
def simplify_legal_tool(legal_tool):
120106
"""Simplify license URLs into human-readable labels
121107
@@ -433,7 +419,7 @@ def main():
433419
"EUROPEANA_API_KEY not found in environment variables", 1
434420
)
435421

436-
session = get_requests_session()
422+
session = shared.get_session(accept_header="application/json")
437423

438424
# Fetch facet lists once, including counts
439425
providers_full = get_facet_list(session, "DATA_PROVIDER")

scripts/1-fetch/github_fetch.py

Lines changed: 6 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,6 @@
1717
from pygments import highlight
1818
from pygments.formatters import TerminalFormatter
1919
from pygments.lexers import PythonTracebackLexer
20-
from requests.adapters import HTTPAdapter
21-
from urllib3.util.retry import Retry
2220

2321
# Add parent directory so shared can be imported
2422
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
@@ -80,25 +78,6 @@ def check_for_completion():
8078
pass # File may not be found without --enable-save, etc.
8179

8280

83-
def get_requests_session():
84-
max_retries = Retry(
85-
total=5,
86-
backoff_factor=10,
87-
status_forcelist=shared.STATUS_FORCELIST,
88-
)
89-
session = requests.Session()
90-
session.mount("https://", HTTPAdapter(max_retries=max_retries))
91-
headers = {
92-
"accept": "application/vnd.github+json",
93-
"User-Agent": shared.USER_AGENT,
94-
}
95-
if GH_TOKEN:
96-
headers["authorization"] = f"Bearer {GH_TOKEN}"
97-
session.headers.update(headers)
98-
99-
return session
100-
101-
10281
def write_data(args, tool_data):
10382
if not args.enable_save:
10483
return args
@@ -162,7 +141,12 @@ def main():
162141
args = parse_arguments()
163142
shared.paths_log(LOGGER, PATHS)
164143
check_for_completion()
165-
session = get_requests_session()
144+
session = shared.get_session(
145+
accept_header="application/vnd.github+json",
146+
)
147+
if GH_TOKEN:
148+
session.headers.update({"authorization": f"Bearer {GH_TOKEN}"})
149+
166150
tool_data = query_github(args, session)
167151
args = write_data(args, tool_data)
168152
args = shared.git_add_and_commit(

scripts/1-fetch/openverse_fetch.py

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,6 @@
2525
from pygments import highlight
2626
from pygments.formatters import TerminalFormatter
2727
from pygments.lexers import PythonTracebackLexer
28-
from requests.adapters import HTTPAdapter
29-
from urllib3.util.retry import Retry
3028

3129
# Add parent directory so shared can be imported
3230
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
@@ -83,20 +81,6 @@ def parse_arguments():
8381
return args
8482

8583

86-
def get_requests_session():
87-
max_retries = Retry(
88-
total=5,
89-
backoff_factor=10,
90-
status_forcelist=shared.STATUS_FORCELIST,
91-
)
92-
session = requests.Session()
93-
session.mount("https://", HTTPAdapter(max_retries=max_retries))
94-
session.headers.update(
95-
{"accept": "application/json", "User-Agent": shared.USER_AGENT}
96-
)
97-
return session
98-
99-
10084
def get_all_sources_and_licenses(session, media_type):
10185
"""
10286
Fetch all available sources for a given media_type.
@@ -225,8 +209,8 @@ def write_data(args, data):
225209

226210
def main():
227211
args = parse_arguments()
228-
session = get_requests_session()
229212
LOGGER.info("Starting Openverse Fetch Script...")
213+
session = shared.get_session(accept_header="application/json")
230214
records = query_openverse(session)
231215
write_data(args, records)
232216
LOGGER.info(f"Fetched {len(records)} unique Openverse records.")

scripts/1-fetch/wikipedia_fetch.py

Lines changed: 2 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,9 @@
1313
from operator import itemgetter
1414

1515
# Third-party
16-
import requests
1716
from pygments import highlight
1817
from pygments.formatters import TerminalFormatter
1918
from pygments.lexers import PythonTracebackLexer
20-
from requests.adapters import HTTPAdapter
21-
from urllib3.util.retry import Retry
2219

2320
# Add parent directory so shared can be imported
2421
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
@@ -66,18 +63,6 @@ def parse_arguments():
6663
return args
6764

6865

69-
def get_requests_session():
70-
max_retries = Retry(
71-
total=5,
72-
backoff_factor=10,
73-
status_forcelist=shared.STATUS_FORCELIST,
74-
)
75-
session = requests.Session()
76-
session.mount("https://", HTTPAdapter(max_retries=max_retries))
77-
session.headers.update({"User-Agent": shared.USER_AGENT})
78-
return session
79-
80-
8166
def write_data(args, tool_data):
8267
if not args.enable_save:
8368
return args
@@ -173,7 +158,8 @@ def main():
173158
args = parse_arguments()
174159
shared.paths_log(LOGGER, PATHS)
175160
shared.git_fetch_and_merge(args, PATHS["repo"])
176-
tool_data = query_wikipedia_languages(get_requests_session())
161+
session = shared.get_session()
162+
tool_data = query_wikipedia_languages(session)
177163
args = write_data(args, tool_data)
178164
args = shared.git_add_and_commit(
179165
args,

scripts/shared.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,15 @@
22
import logging
33
import os
44
import sys
5+
from collections import OrderedDict
56
from datetime import datetime, timezone
67

78
# Third-party
89
from git import InvalidGitRepositoryError, NoSuchPathError, Repo
910
from pandas import PeriodIndex
11+
from requests import Session
12+
from requests.adapters import HTTPAdapter
13+
from urllib3.util import Retry
1014

1115
# Constants
1216
STATUS_FORCELIST = [
@@ -31,6 +35,37 @@ def __init__(self, message, exit_code=None):
3135
super().__init__(self.message)
3236

3337

38+
def get_session(accept_header=None, session=None):
39+
"""
40+
Create or configure a reusable HTTPS session with retry logic and
41+
appropriate headers.
42+
"""
43+
if session is None:
44+
session = Session()
45+
46+
# Purge default and custom session connection adapters
47+
# (With only a https:// adapter, below, unencrypted requests will fail.)
48+
session.adapters = OrderedDict()
49+
50+
# Try again after 0s, 6s, 12s, 24s, 48s (total 90s) for the specified HTTP
51+
# error codes (STATUS_FORCELIST)
52+
retry_strategy = Retry(
53+
total=5,
54+
backoff_factor=3,
55+
status_forcelist=STATUS_FORCELIST,
56+
allowed_methods=["GET", "POST"],
57+
raise_on_status=False,
58+
)
59+
session.mount("https://", HTTPAdapter(max_retries=retry_strategy))
60+
61+
headers = {"User-Agent": USER_AGENT}
62+
if accept_header:
63+
headers["accept"] = accept_header
64+
session.headers.update(headers)
65+
66+
return session
67+
68+
3469
def git_fetch_and_merge(args, repo_path, branch=None):
3570
if not args.enable_git:
3671
return

sources.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,32 @@ and access towards related query data using a programmable search engine.
121121
- Data available through JSON format
122122

123123

124+
## Openverse
125+
126+
**Description:** Openverse is a search engine for openly licensed media,
127+
including images and audio. It provides access to over 700 million works from
128+
more than 20 sources, all of which are under Creative Commons licenses or in the
129+
public domain. The API allows querying for media by source, license type, and
130+
other parameters. Because anonymous Openverse API access returns a maximum of
131+
~240 result count per source-license combination, the `openverse_fetch.py`
132+
script currently provides approximate counts. It does not include pagination or
133+
license_version breakdown.
134+
135+
**API documentation link:**
136+
- [Openverse API Documentation](https://api.openverse.org/v1/)
137+
- [Openverse API Reference](https://wordpress.org/openverse/api/)
138+
- [Base URL](https://api.openverse.org/v1)
139+
- [Openverse Frontend](https://openverse.org/)
140+
141+
**API information:**
142+
- No API key required for basic access
143+
- Query limit: Rate-limited to prevent abuse (anonymous access provides ~240 results per source-license combination)
144+
- Data available through JSON format
145+
- Supports filtering by source, license, media type (images, audio)
146+
- Media types: `images`, `audio`
147+
- Supported licenses: `by`, `by-nc`, `by-nc-nd`, `by-nc-sa`, `by-nd`, `by-sa`, `cc0`, `nc-sampling+`, `pdm`, `sampling+`
148+
149+
124150
## Wikipedia
125151

126152
**Description:** The Wikipedia API allows users to query statistics of pages,

0 commit comments

Comments
 (0)