Skip to content

Conversation

@Goziee-git
Copy link
Contributor

@Goziee-git Goziee-git commented Oct 11, 2025

Fixes

Description

Implements comprehensive arXiv data collection system to quantify open access academic papers in the commons.

Type of Change

  • New feature implementing data collection from the arXivopen access academic papers
  • Data source addition/modification

Changes Made

  • Added arXiv API integration for fetching academic paper metadata using the arXiv API as project requirement for automation of fetching new data sources.
  • Implemented data processing pipeline for arXiv submissions in the scripts/1-fetch/arxiv_fetch.py
  • Created filtering logic for open access and CC-licensed papers
  • Added arXiv data to quarterly reporting system

Testing

  • Static analysis passes (./dev/check.sh)
  • arXiv API integration tested with sample queries
  • Data processing validated with test dataset

Data Impact

  • New data source added (arXiv academic papers)
  • Report generation affected (new academic commons metrics)

Related Documentation

  • Updated sources.md with arXiv API credentials setup
  • Added arXiv processing documentation

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@Goziee-git Goziee-git requested review from a team as code owners October 11, 2025 23:29
@Goziee-git Goziee-git requested review from TimidRobot and possumbilities and removed request for a team October 11, 2025 23:29
@cc-open-source-bot cc-open-source-bot moved this to In review in TimidRobot Oct 11, 2025
@Goziee-git Goziee-git changed the title Add arXiv data fetching and processing functionality Add arXiv data fetching functionality Oct 12, 2025
Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start.

I recommend also developing a data/report plan. For example:

  • It is not meaningful to get a count of a single language (though it is worth noting that other languages are not available).
  • Category codes converted to reporting (words and/or abbreviations instead of acronyms)

@TimidRobot

This comment was marked as outdated.

@Goziee-git
Copy link
Contributor Author

@Goziee-git please follow through on your first pull request (PR) before submitting any more:

Depending on how that one goes, I might reopen this PR.

hello @TimidRobot as requested, I have made changes based on your review. Good work is emphasized over speed and I do hope my attempt to go full circle with other PR hasn't dented my chances of significantly contributing to the project. Thank you🙏🏼

@TimidRobot TimidRobot reopened this Oct 17, 2025
@TimidRobot
Copy link
Member

@Goziee-git ok, please focus on this PR

@Goziee-git
Copy link
Contributor Author

Goziee-git commented Oct 20, 2025

This is a great start.

I recommend also developing a data/report plan. For example:

  • It is not meaningful to get a count of a single language (though it is worth noting that other languages are not available).
  • Category codes converted to reporting (words and/or abbreviations instead of acronyms)

@TimidRobot I have removed the query for languages as it returns only English. Also worthy of note here, the arxiv data source accepts papers in other languages but requires that the paper abstracts be submitted in English. so Impossible to get a good distribution of licenses as per language usage.

Also, as suggested. I converted the category codes to reporting words that a more user-friendly and readable using an external arxiv_category_map.yml, in data/2025Q4/1-fetch. I believe this should make updates repoducible and maintainable over time. The script now produces arxiv_2_count_by_category_report.csv and arxiv_2_count_by_category_report_agg.csv for better reporting. Also instead of dumping author count data previously, i implemented a Bucketing approach. in arxiv_4_count_by_author_bucket.csv to group author counts into meaningful ranges (1, 2-3, 4-6, 7-10, 11+). The script also generates a arxiv_provenance.json to record metadata for audit, reproducibility, and provenance

@Goziee-git
Copy link
Contributor Author

Goziee-git commented Oct 20, 2025

Hello @TimidRobot, i observed from multiple results fetched previously that the script failed to fetch CC licenses that may be recorded as hyphenated variants like (CC-BY, CC-BY-NC, etc). I have implemented a compiled regex pattern that replaces the string matching for more robust license detection

I have also looked at some of the implementations in other PR to use the normalize_license_text() function for consistent license identification.

Please i'ld like to know what your thoughts are on these changes and work continuously on further improvements, Thanks.

@TimidRobot
Copy link
Member

Also, as suggested. I converted the category codes to reporting words that a more user-friendly and readable using an external arxiv_category_map.yml, in data/2025Q4/1-fetch. I believe this should make updates repoducible and maintainable over time.

  • How was arxiv_category_map.yml created?
    • If by script, it should probably go in dev/
  • Data that persists should go in data/ not a specific quarter directory

The script now produces arxiv_2_count_by_category_report.csv and arxiv_2_count_by_category_report_agg.csv for better reporting. Also instead of dumping author count data previously, i implemented a Bucketing approach. in arxiv_4_count_by_author_bucket.csv to group author counts into meaningful ranges (1, 2-3, 4-6, 7-10, 11+).

I'll look at data after outstanding comments are resolved.

The script also generates a arxiv_provenance.json to record metadata for audit, reproducibility, and provenance

I'll look at data after outstanding comments are resolved. That said, I'm not excited about adding JSON to the project.

@TimidRobot
Copy link
Member

Hello @TimidRobot, i observed from multiple results fetched previously that the script failed to fetch CC licenses that may be recorded as hyphenated variants like (CC-BY, CC-BY-NC, etc). I have implemented a compiled regex pattern that replaces the string matching for more robust license detection

I have also looked at some of the implementations in other PR to use the normalize_license_text() function for consistent license identification.

Please i'ld like to know what your thoughts are on these changes and work continuously on further improvements, Thanks.

It's probably a good idea to create a function in the shared library eventually. Please leave that to last, however.

Refactor arxiv_fetch.py to use requests library for HTTP requests, implementing retry logic for better error handling. Update license extraction logic and CSV headers to remove PLAN_INDEX.
@Goziee-git
Copy link
Contributor Author

@TimidRobot Hello, are there any more changes you'll like to be made to the ones already identified and done. Also i'ld like to ask, if there are no more changes to be made on this, what steps would you recommend as the next one for me in the project, with respect to integrating ArXiv as a valid source for the project?

@TimidRobot
Copy link
Member

@Goziee-git Please update sources.md for arXiv

@Goziee-git
Copy link
Contributor Author

@Goziee-git Please update sources.md for arXiv

@TimidRobot, Thanks, Im happy we've gotten to this point. very excited to continue contributing to the project. I would like your permission to continue working on the processing and reporting scripts for the arXiv data source

@TimidRobot
Copy link
Member

@Goziee-git
Copy link
Contributor Author

Hello @TimidRobot as per #179 (comment)
the implementation of the bucketing for the AUTHOR_COUNT was because you suggested that it was not so insightful to just merely dump the data, and so i thought a sort of grouping of this data would help us to generate meaningful insights. Raw author counts would create extremely sparse data with many single-occurrence values while Bucketing ("1", "2-3", "4-6", "7-10", "11+") on the other hand, creates meaningful statistical groups for analysis. It also facilitates visualization and reporting by creating compact, aggregated datasets that are easier to process and analyze hence supporting the project's goal of understanding "how knowledge and culture of the commons is distributed".

although these are my thoughts, I am still very happy to know what you think about this. To further share my opinion, these are some helpful links that guided my decision for bucketing the AUTHOR_COUNT data.

Scientometrics Journal Guidelines: (https://link.springer.com/journal/11192)
• Standard practice in bibliometric studies to group author counts into ranges for collaboration analysis which reduces noise and enables meaningful statistical comparisons across disciplines

FAIR Data Principles: (https://www.go-fair.org/fair-principles/)
• Bucketing supports Findability and Reusability by creating standardized categorical data
NIH Data Management Guidelines: (https://sharing.nih.gov/data-management-and-sharing-policy)

@Goziee-git
Copy link
Contributor Author

@Goziee-git please see two unresolved conversations:

  1. Add arXiv data fetching functionality #179 (comment)
  2. Add arXiv data fetching functionality #179 (comment)

@TimidRobot the conversations mentioned here are identical, please i'ld like to know what your preference is with the AUTHOR_COUNT, especially if you want it removed or modified in differnt way

@TimidRobot
Copy link
Member

@Goziee-git Sorry, let me clarify. I think bucketing authors into AUTHOR_BUCKET is a good and helpful course of action. I wouldd also add https://en.wikipedia.org/wiki/Data_binning to the list of references.

I think which buckets are selected could use some refinement. My thinking is that there are three considerations:

  1. plotting
  2. precedence
  3. distribution

Plotting

For plotting, I think around five values display well.

Precedence

For precedence, we can look at the various citation styles:

Distribution

The script can be modified to provide information on all relevant author counts (only results above 1% are shown):

Authors Count Percent
3 131 22.55%
2 111 19.10%
4 92 15.83%
1 85 14.63%
5 46 7.92%
6 23 3.96%
7 21 3.61%
8 19 3.27%
9 15 2.58%
10 10 1.72%

Recommendation

I recommend the following buckets:

  • 1 author
  • 2 authors
  • 3 authors
  • 4 authors
  • 5+ authors

@Goziee-git
Copy link
Contributor Author

Goziee-git commented Oct 31, 2025

@TimidRobot as per ordering the constants, what would be your preference here. The current order of the constants is:

  1. FILE_ARXIV_COUNT (arxiv_1_count.csv)
  2. FILE_ARXIV_CATEGORY_REPORT (arxiv_2_count_by_category_report.csv)
  3. FILE_ARXIV_YEAR (arxiv_3_count_by_year.csv)
  4. FILE_ARXIV_AUTHOR_BUCKET (arxiv_4_count_by_author_bucket.csv)

This follows the logical/sequential order based on the numbered workflow (1 → 2 → 3 → 4).

in comparison to the gcs_fetch.py (current order):

  1. FILE1_COUNT (gcs_1_count.csv)
  2. FILE2_LANGUAGE (gcs_2_count_by_language.csv)
  3. FILE3_COUNTRY (gcs_3_count_by_country.csv)

Comparison:

  • Both scripts follow the same logical/sequential ordering pattern
  • Both order constants by their numbered workflow sequence (1 → 2 → 3 → 4)
  • Both use the numbering in the filename to determine order

Do you prefer some other order for this, so i can be guided as to the best course of action here

@Babi-B
Copy link
Contributor

Babi-B commented Oct 31, 2025

Hi @Goziee-git !

Order or sort alphabetically.

FILE_ARXIV_COUNT (arxiv_1_count.csv) shouldn't come before
FILE_ARXIV_CATEGORY_REPORT

@Goziee-git
Copy link
Contributor Author

Hi @Goziee-git !

Order or sort alphabetically.

FILE_ARXIV_COUNT (arxiv_1_count.csv) shouldn't come before FILE_ARXIV_CATEGORY_REPORT

@Babi-B, Thanks for the clarity, i appreciate your reviews here.

@Goziee-git
Copy link
Contributor Author

@Babi-B, as per #185 looks like you've not made additions to sources.md. A moment ago i tried to read through and no links or references. I think you should update that. Or would you like me to go ahead with that 🚀🚀🚀🚀🚀

@TimidRobot
Copy link
Member

Both use the numbering in the filename to determine order

@Goziee-git The difference is that gcs_fetch.py uses a number in the constant so that they follow the flow when ordered:

FILE1_COUNT
    ^
FILE2_LANGUAGE
    ^

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thank you!!

@TimidRobot TimidRobot merged commit 0d44547 into creativecommons:main Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Integrate arXiv as data source for academic commons quantification

3 participants