Add arXiv data fetching functionality #179

Goziee-git · 2025-10-11T23:29:04Z

Fixes

Fixes Integrate arXiv as data source for academic commons quantification #178 by @Goziee-git

Description

Implements comprehensive arXiv data collection system to quantify open access academic papers in the commons.

Type of Change

New feature implementing data collection from the arXivopen access academic papers
Data source addition/modification

Changes Made

Added arXiv API integration for fetching academic paper metadata using the arXiv API as project requirement for automation of fetching new data sources.
Implemented data processing pipeline for arXiv submissions in the scripts/1-fetch/arxiv_fetch.py
Created filtering logic for open access and CC-licensed papers
Added arXiv data to quarterly reporting system

Testing

Static analysis passes (./dev/check.sh)
arXiv API integration tested with sample queries
Data processing validated with test dataset

Data Impact

New data source added (arXiv academic papers)
Report generation affected (new academic commons metrics)

Checklist

I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
My pull request doesn't include code or content generated with AI.
My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

TimidRobot

This is a great start.

I recommend also developing a data/report plan. For example:

It is not meaningful to get a count of a single language (though it is worth noting that other languages are not available).
Category codes converted to reporting (words and/or abbreviations instead of acronyms)

scripts/3-report/gcs_report.py

data/2025Q4/1-fetch/arxiv_1_count.csv

data/2025Q4/1-fetch/arxiv_2_count_by_category.csv

data/2025Q4/1-fetch/arxiv_2_count_by_language.csv

data/2025Q4/1-fetch/arxiv_3_count_by_country.csv

data/2025Q4/1-fetch/arxiv_3_count_by_year.csv

data/2025Q4/1-fetch/arxiv_4_count_by_author_count.csv

scripts/1-fetch/arxiv_fetch.py

Goziee-git · 2025-10-16T13:02:49Z

@Goziee-git please follow through on your first pull request (PR) before submitting any more:

Add OTL education CSV reader script #168

Depending on how that one goes, I might reopen this PR.

hello @TimidRobot as requested, I have made changes based on your review. Good work is emphasized over speed and I do hope my attempt to go full circle with other PR hasn't dented my chances of significantly contributing to the project. Thank you🙏🏼

TimidRobot · 2025-10-17T17:42:03Z

@Goziee-git ok, please focus on this PR

Goziee-git · 2025-10-20T21:09:46Z

This is a great start.

I recommend also developing a data/report plan. For example:

It is not meaningful to get a count of a single language (though it is worth noting that other languages are not available).

Category codes converted to reporting (words and/or abbreviations instead of acronyms)

@TimidRobot I have removed the query for languages as it returns only English. Also worthy of note here, the arxiv data source accepts papers in other languages but requires that the paper abstracts be submitted in English. so Impossible to get a good distribution of licenses as per language usage.

Also, as suggested. I converted the category codes to reporting words that a more user-friendly and readable using an external arxiv_category_map.yml, in data/2025Q4/1-fetch. I believe this should make updates repoducible and maintainable over time. The script now produces arxiv_2_count_by_category_report.csv and arxiv_2_count_by_category_report_agg.csv for better reporting. Also instead of dumping author count data previously, i implemented a Bucketing approach. in arxiv_4_count_by_author_bucket.csv to group author counts into meaningful ranges (1, 2-3, 4-6, 7-10, 11+). The script also generates a arxiv_provenance.json to record metadata for audit, reproducibility, and provenance

Goziee-git · 2025-10-20T21:25:57Z

Hello @TimidRobot, i observed from multiple results fetched previously that the script failed to fetch CC licenses that may be recorded as hyphenated variants like (CC-BY, CC-BY-NC, etc). I have implemented a compiled regex pattern that replaces the string matching for more robust license detection

I have also looked at some of the implementations in other PR to use the normalize_license_text() function for consistent license identification.

Please i'ld like to know what your thoughts are on these changes and work continuously on further improvements, Thanks.

scripts/3-report/gcs_report.py

scripts/1-fetch/arxiv_fetch.py

arxiv_fetch.py

TimidRobot · 2025-10-21T10:18:46Z

Also, as suggested. I converted the category codes to reporting words that a more user-friendly and readable using an external arxiv_category_map.yml, in data/2025Q4/1-fetch. I believe this should make updates repoducible and maintainable over time.

How was arxiv_category_map.yml created?
- If by script, it should probably go in dev/
Data that persists should go in data/ not a specific quarter directory

The script now produces arxiv_2_count_by_category_report.csv and arxiv_2_count_by_category_report_agg.csv for better reporting. Also instead of dumping author count data previously, i implemented a Bucketing approach. in arxiv_4_count_by_author_bucket.csv to group author counts into meaningful ranges (1, 2-3, 4-6, 7-10, 11+).

I'll look at data after outstanding comments are resolved.

The script also generates a arxiv_provenance.json to record metadata for audit, reproducibility, and provenance

I'll look at data after outstanding comments are resolved. That said, I'm not excited about adding JSON to the project.

TimidRobot · 2025-10-21T10:19:48Z

Hello @TimidRobot, i observed from multiple results fetched previously that the script failed to fetch CC licenses that may be recorded as hyphenated variants like (CC-BY, CC-BY-NC, etc). I have implemented a compiled regex pattern that replaces the string matching for more robust license detection

I have also looked at some of the implementations in other PR to use the normalize_license_text() function for consistent license identification.

Please i'ld like to know what your thoughts are on these changes and work continuously on further improvements, Thanks.

It's probably a good idea to create a function in the shared library eventually. Please leave that to last, however.

Refactor arxiv_fetch.py to use requests library for HTTP requests, implementing retry logic for better error handling. Update license extraction logic and CSV headers to remove PLAN_INDEX.

Goziee-git · 2025-10-28T18:09:04Z

@TimidRobot Hello, are there any more changes you'll like to be made to the ones already identified and done. Also i'ld like to ask, if there are no more changes to be made on this, what steps would you recommend as the next one for me in the project, with respect to integrating ArXiv as a valid source for the project?

…commons#217

TimidRobot · 2025-10-30T08:25:16Z

@Goziee-git Please update sources.md for arXiv

Goziee-git · 2025-10-30T11:47:04Z

@Goziee-git Please update sources.md for arXiv

@TimidRobot, Thanks, Im happy we've gotten to this point. very excited to continue contributing to the project. I would like your permission to continue working on the processing and reporting scripts for the arXiv data source

TimidRobot · 2025-10-30T13:36:41Z

@Goziee-git please see two unresolved conversations:

Goziee-git · 2025-10-30T20:16:56Z

Hello @TimidRobot as per #179 (comment)
the implementation of the bucketing for the AUTHOR_COUNT was because you suggested that it was not so insightful to just merely dump the data, and so i thought a sort of grouping of this data would help us to generate meaningful insights. Raw author counts would create extremely sparse data with many single-occurrence values while Bucketing ("1", "2-3", "4-6", "7-10", "11+") on the other hand, creates meaningful statistical groups for analysis. It also facilitates visualization and reporting by creating compact, aggregated datasets that are easier to process and analyze hence supporting the project's goal of understanding "how knowledge and culture of the commons is distributed".

although these are my thoughts, I am still very happy to know what you think about this. To further share my opinion, these are some helpful links that guided my decision for bucketing the AUTHOR_COUNT data.

Scientometrics Journal Guidelines: (https://link.springer.com/journal/11192)
• Standard practice in bibliometric studies to group author counts into ranges for collaboration analysis which reduces noise and enables meaningful statistical comparisons across disciplines

FAIR Data Principles: (https://www.go-fair.org/fair-principles/)
• Bucketing supports Findability and Reusability by creating standardized categorical data
NIH Data Management Guidelines: (https://sharing.nih.gov/data-management-and-sharing-policy)

Goziee-git · 2025-10-30T20:44:07Z

@Goziee-git please see two unresolved conversations:

Add arXiv data fetching functionality #179 (comment)

Add arXiv data fetching functionality #179 (comment)

@TimidRobot the conversations mentioned here are identical, please i'ld like to know what your preference is with the AUTHOR_COUNT, especially if you want it removed or modified in differnt way

scripts/1-fetch/arxiv_fetch.py

TimidRobot · 2025-10-31T07:36:58Z

@Goziee-git Sorry, let me clarify. I think bucketing authors into AUTHOR_BUCKET is a good and helpful course of action. I wouldd also add https://en.wikipedia.org/wiki/Data_binning to the list of references.

I think which buckets are selected could use some refinement. My thinking is that there are three considerations:

plotting
precedence
distribution

Plotting

For plotting, I think around five values display well.

Precedence

For precedence, we can look at the various citation styles:

APA (https://www.easybib.com/guides/citation-guides/apa-format/citing-source-with-multiple-authors-apa/#in-text-citations-when-there-are-multiple-authors)
- 1 author
- 2 authors
- 3+ authors
ASA (https://web.archive.org/web/20131008072242/http://lib.trinity.edu/research/citing/ASA_Style_Citations_4.pdf)
- 1 author
- 2 authors
- 3 authors
- 4+ authors
Chicago (https://library.ulethbridge.ca/chicagostyle/books/multiple)
- 1 author
- 2 authors
- 3 authors
- 4+ authors
  - (simplification since the style differentiates between 3-6 authors and 7+ authors)
MLA (https://www.scribbr.com/frequently-asked-questions/multiple-authors-mla/)
- 1 author
- 2 authors
- 3+ authors

Distribution

The script can be modified to provide information on all relevant author counts (only results above 1% are shown):

Authors	Count	Percent
3	131	22.55%
2	111	19.10%
4	92	15.83%
1	85	14.63%
5	46	7.92%
6	23	3.96%
7	21	3.61%
8	19	3.27%
9	15	2.58%
10	10	1.72%

Recommendation

I recommend the following buckets:

1 author
2 authors
3 authors
4 authors
5+ authors

…nd 5+ grouping

scripts/1-fetch/arxiv_fetch.py

Goziee-git · 2025-10-31T09:59:48Z

@TimidRobot as per ordering the constants, what would be your preference here. The current order of the constants is:

FILE_ARXIV_COUNT (arxiv_1_count.csv)
FILE_ARXIV_CATEGORY_REPORT (arxiv_2_count_by_category_report.csv)
FILE_ARXIV_YEAR (arxiv_3_count_by_year.csv)
FILE_ARXIV_AUTHOR_BUCKET (arxiv_4_count_by_author_bucket.csv)

This follows the logical/sequential order based on the numbered workflow (1 → 2 → 3 → 4).

in comparison to the gcs_fetch.py (current order):

FILE1_COUNT (gcs_1_count.csv)
FILE2_LANGUAGE (gcs_2_count_by_language.csv)
FILE3_COUNTRY (gcs_3_count_by_country.csv)

Comparison:

Both scripts follow the same logical/sequential ordering pattern
Both order constants by their numbered workflow sequence (1 → 2 → 3 → 4)
Both use the numbering in the filename to determine order

Do you prefer some other order for this, so i can be guided as to the best course of action here

Babi-B · 2025-10-31T13:07:35Z

Hi @Goziee-git !

Order or sort alphabetically.

FILE_ARXIV_COUNT (arxiv_1_count.csv) shouldn't come before
FILE_ARXIV_CATEGORY_REPORT

Goziee-git · 2025-10-31T14:44:16Z

Hi @Goziee-git !

Order or sort alphabetically.

FILE_ARXIV_COUNT (arxiv_1_count.csv) shouldn't come before FILE_ARXIV_CATEGORY_REPORT

@Babi-B, Thanks for the clarity, i appreciate your reviews here.

Goziee-git · 2025-10-31T14:51:57Z

@Babi-B, as per #185 looks like you've not made additions to sources.md. A moment ago i tried to read through and no links or references. I think you should update that. Or would you like me to go ahead with that 🚀🚀🚀🚀🚀

TimidRobot · 2025-11-01T16:47:40Z

Both use the numbering in the filename to determine order

@Goziee-git The difference is that gcs_fetch.py uses a number in the constant so that they follow the flow when ordered:

FILE1_COUNT
    ^
FILE2_LANGUAGE
    ^

TimidRobot

Great work, thank you!!

Goziee-git requested review from a team as code owners October 11, 2025 23:29

Goziee-git requested review from TimidRobot and possumbilities and removed request for a team October 11, 2025 23:29

cc-open-source-bot added this to TimidRobot Oct 11, 2025

cc-open-source-bot moved this to In review in TimidRobot Oct 11, 2025

Goziee-git changed the title ~~Add arXiv data fetching and processing functionality~~ Add arXiv data fetching functionality Oct 12, 2025

TimidRobot requested changes Oct 14, 2025

View reviewed changes

TimidRobot reviewed Oct 14, 2025

View reviewed changes

scripts/1-fetch/arxiv_fetch.py Outdated Show resolved Hide resolved

TimidRobot self-assigned this Oct 14, 2025

TimidRobot mentioned this pull request Oct 15, 2025

[DISCARDED] Pull Request: Fix ArXiv Fetch Script - Replace urllib with requests and improve error handling #192

Closed

9 tasks

This comment was marked as outdated.

Sign in to view

TimidRobot closed this Oct 15, 2025

github-project-automation bot moved this from In review to Done in TimidRobot Oct 15, 2025

Goziee-git mentioned this pull request Oct 16, 2025

Integrate arXiv as data source for academic commons quantification #178

Closed

5 tasks

TimidRobot reopened this Oct 17, 2025

TimidRobot mentioned this pull request Oct 17, 2025

Integrate Fetching Functionality With the Europeana Data Collection Script. #199

Closed

9 tasks

TimidRobot requested changes Oct 21, 2025

View reviewed changes

scripts/3-report/gcs_report.py Show resolved Hide resolved

scripts/1-fetch/arxiv_fetch.py Outdated Show resolved Hide resolved

scripts/1-fetch/arxiv_fetch.py Outdated Show resolved Hide resolved

scripts/1-fetch/arxiv_fetch.py Show resolved Hide resolved

TimidRobot reviewed Oct 21, 2025

View reviewed changes

arxiv_fetch.py Outdated Show resolved Hide resolved

Goziee-git added 4 commits October 22, 2025 04:03

Add arXiv data fetching and processing functionality

0783ca9

Delete scripts/3-report/gcs_report.py

6e16e63

added fixes: refactor request and license extraction logic

9ea0071

Refactor HTTP requests and license extraction logic.

f8c9774

Refactor arxiv_fetch.py to use requests library for HTTP requests, implementing retry logic for better error handling. Update license extraction logic and CSV headers to remove PLAN_INDEX.

Goziee-git added 2 commits October 28, 2025 11:33

Remove verbose per-paper logging in arxiv_fetch.py

c2fcfd8

Revert .gitignore changes (f14e4ce and 5bb4144)

32a8c60

chore: Fix encoding and newlines in arxiv_fetch.py per issue creative…

5317b77

…commons#217

Add arXiv source

5953472

Update arXiv documentation links

faa2b27

TimidRobot reviewed Oct 31, 2025

View reviewed changes

scripts/1-fetch/arxiv_fetch.py Outdated Show resolved Hide resolved

Goziee-git added 2 commits October 31, 2025 08:58

Refine author count bucketing to individual buckets for 1-4 authors a…

6d50654

…nd 5+ grouping

Merge branch 'creativecommons:main' into feature/arxiv

f554e91

TimidRobot reviewed Oct 31, 2025

View reviewed changes

scripts/1-fetch/arxiv_fetch.py Outdated Show resolved Hide resolved

Remove redundant None check in bucket_author_count function

6ddde78

Refactor: alphabetize file path constants in arxiv_fetch.py

bef203f

TimidRobot added 4 commits November 1, 2025 17:49

Merge branch 'main' into feature/arxiv

7adc610

order soruces and cleanup formatting and labeling

8a058eb

use standard backoff_factor=10

631df48

order/sort data

01f4e01

TimidRobot approved these changes Nov 1, 2025

View reviewed changes

TimidRobot merged commit 0d44547 into creativecommons:main Nov 1, 2025

TimidRobot mentioned this pull request Nov 24, 2025

Add Zenodo data source with fetch implementation #250

Open

9 tasks

Uh oh!

Add arXiv data fetching functionality #179

Add arXiv data fetching functionality #179

Uh oh!

Conversation

Goziee-git commented Oct 11, 2025 • edited by TimidRobot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Description

Type of Change

Changes Made

Testing

Data Impact

Related Documentation

Checklist

Developer Certificate of Origin

Uh oh!

TimidRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Goziee-git commented Oct 16, 2025

Uh oh!

TimidRobot commented Oct 17, 2025

Uh oh!

Goziee-git commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Goziee-git commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimidRobot commented Oct 21, 2025

Uh oh!

TimidRobot commented Oct 21, 2025

Uh oh!

Goziee-git commented Oct 28, 2025

Uh oh!

TimidRobot commented Oct 30, 2025

Uh oh!

Goziee-git commented Oct 30, 2025

Uh oh!

TimidRobot commented Oct 30, 2025

Uh oh!

Goziee-git commented Oct 30, 2025

Uh oh!

Goziee-git commented Oct 30, 2025

Uh oh!

Uh oh!

TimidRobot commented Oct 31, 2025

Plotting

Precedence

Distribution

Recommendation

Uh oh!

Uh oh!

Goziee-git commented Oct 31, 2025 • edited by TimidRobot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Babi-B commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Goziee-git commented Oct 31, 2025

Uh oh!

Goziee-git commented Oct 31, 2025

Uh oh!

Goziee-git commented Oct 11, 2025 •

edited by TimidRobot

Loading

Goziee-git commented Oct 20, 2025 •

edited

Loading

Goziee-git commented Oct 20, 2025 •

edited

Loading

Goziee-git commented Oct 31, 2025 •

edited by TimidRobot

Loading

Babi-B commented Oct 31, 2025 •

edited

Loading