-
-
Notifications
You must be signed in to change notification settings - Fork 72
Add arXiv data fetching functionality #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 26 commits
Commits
Show all changes
64 commits
Select commit
Hold shift + click to select a range
0783ca9
Add arXiv data fetching and processing functionality
Goziee-git 6e16e63
Delete scripts/3-report/gcs_report.py
Goziee-git 9ea0071
added fixes: refactor request and license extraction logic
Goziee-git f8c9774
Refactor HTTP requests and license extraction logic.
Goziee-git 5348e8b
refactord to use url library, enhanced retry and extraction logic
Goziee-git 41ff221
Enhance ArXiv script with category reporting, author bucketing, and i…
Goziee-git 2fd81a1
Delete data/2025Q4/1-fetch/arxiv_2_count_by_language.csv
Goziee-git 9d09d9d
Delete data/2025Q4/1-fetch/arxiv_2_count_by_category.csv
Goziee-git 300cb25
modified regex pattern Creative Commons to Unknown CC legat tool
Goziee-git 8f6a409
Remove HTTP adapter configuration to ensure all API calls use HTTPS
Goziee-git 70d191f
Add User-Agent header and remove HTTP adapter for HTTPS-only requests
Goziee-git a544cab
Add category converter in /dev called by arxiv_fetch.py to generate u…
Goziee-git 4fb8f30
Fix converter output to use correct filename and location in data/202…
Goziee-git 587e2e0
Fix static analysis issues in arxiv_fetch.py - line length and format…
Goziee-git 7dbf3c0
Fix static analysis issues in arxiv_category_converter.py - formattin…
Goziee-git 03b7c69
Convert provenance output from JSON to YAML and store in /data directory
Goziee-git 4980324
Restore gcs_report.py to upstream version
Goziee-git a425ee5
Delete arxiv_fetch.py
Goziee-git 5bb4144
Add data files to gitignore to prevent accidental commits
Goziee-git c531b75
Delete data/2025Q4/1-fetch/arxiv_1_count.csv
Goziee-git 1de32c7
Delete data/2025Q4/1-fetch/arxiv_3_count_by_country.csv
Goziee-git d8724bb
Delete data/2025Q4/1-fetch/arxiv_3_count_by_year.csv
Goziee-git 386989a
Delete data/2025Q4/1-fetch/arxiv_4_count_by_author_count.csv
Goziee-git f14e4ce
Delete .gitignore
Goziee-git f9e5ae7
Merge remote-tracking branch 'upstream/main' into feature/arxiv
Goziee-git 6769a33
Merge branch 'feature/arxiv' of https://github.com/Goziee-git/quantif…
Goziee-git 95df48a
Improve arxiv_fetch.py: add debug logging, organize constants, use sh…
Goziee-git 267105b
Add PyYAML and feedparser dependencies for ArXiv functionality
Goziee-git 0aa919c
Update arxiv_fetch.py
Goziee-git 9b83241
Remove shebang from imported module
Goziee-git 58a9f99
Remove type hints from arxiv_fetch.py
Goziee-git 7defab5
Add logging and fix silent exception handling in arxiv_category_conve…
Goziee-git 076b95a
feat: centralize ArXiv category management in shared.py
Goziee-git 9183088
refactor: use shared module for comprehensive ArXiv categories
Goziee-git 2ef6c6f
refactor: use shared category functions in arxiv_fetch.py
Goziee-git b2f96f9
Refactor arxiv_fetch.py: move CATEGORIES constant local, reorganize c…
Goziee-git 0798f0d
Delete dev/arxiv_category_converter.py
Goziee-git ba8bce7
Delete dev/create_arxiv_category_map.py
Goziee-git 69e86be
Delete scripts/shared.py
Goziee-git 37214af
Restore scripts/shared.py - required dependency for arxiv_fetch.py
Goziee-git 9993111
Revert shared.py to pre-category state, make arxiv_fetch.py fully sel…
Goziee-git cf29f8d
Add .gitignore file
Goziee-git 1880043
Revert "Add .gitignore file"
Goziee-git df6fe6b
Replace HTTP retry and API constants with literal values
Goziee-git 0414859
Remove PERCENT column and aggregated category report generation
Goziee-git f304aaa
Move provenance file to quarterly data directory
Goziee-git 704fa28
Move script execution log to main function
Goziee-git ba74f05
Clarify limit argument help text and add documentation
Goziee-git d04179b
Replace consecutive calls logging with per-query result summary
Goziee-git 76d2184
Reorganize constants in logical order and fix static analysis issues
Goziee-git a9d91de
Fix error handling in arxiv_fetch.py to raise QuantifyingException
Goziee-git c2fcfd8
Remove verbose per-paper logging in arxiv_fetch.py
Goziee-git 32a8c60
Revert .gitignore changes (f14e4ce and 5bb4144)
Goziee-git 5317b77
chore: Fix encoding and newlines in arxiv_fetch.py per issue #217
Goziee-git 5953472
Add arXiv source
Goziee-git faa2b27
Update arXiv documentation links
Goziee-git 6d50654
Refine author count bucketing to individual buckets for 1-4 authors a…
Goziee-git f554e91
Merge branch 'creativecommons:main' into feature/arxiv
Goziee-git 6ddde78
Remove redundant None check in bucket_author_count function
Goziee-git bef203f
Refactor: alphabetize file path constants in arxiv_fetch.py
Goziee-git 7adc610
Merge branch 'main' into feature/arxiv
TimidRobot 8a058eb
order soruces and cleanup formatting and labeling
TimidRobot 631df48
use standard backoff_factor=10
TimidRobot 01f4e01
order/sort data
TimidRobot File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
TimidRobot marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| #!/usr/bin/env python | ||
TimidRobot marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| """ | ||
| ArXiv category code to user-friendly name converter. | ||
| Called by arxiv_fetch.py to convert category codes to readable names. | ||
| """ | ||
| # Standard library | ||
| import csv | ||
| import os | ||
|
|
||
| # Third-party | ||
| import yaml | ||
|
|
||
|
|
||
| def load_category_mapping(data_dir): | ||
| """Load category code to label mapping from YAML file.""" | ||
| mapping_file = os.path.join(data_dir, "arxiv_category_map.yaml") | ||
|
|
||
| if not os.path.exists(mapping_file): | ||
| return {} | ||
|
|
||
| try: | ||
| with open(mapping_file, "r") as f: | ||
| return yaml.safe_load(f) or {} | ||
| except Exception: | ||
| return {} | ||
TimidRobot marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| def convert_categories_to_friendly_names(input_file, output_file, data_dir): | ||
| """ | ||
| Convert category codes in CSV to user-friendly names. | ||
|
|
||
| Args: | ||
| input_file: Path to input CSV with category codes | ||
| output_file: Path to output CSV with friendly names | ||
| data_dir: Directory containing arxiv_category_map.yaml | ||
| """ | ||
| if not os.path.exists(input_file): | ||
| return | ||
TimidRobot marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # Load category mapping | ||
| category_mapping = load_category_mapping(data_dir) | ||
|
|
||
| with ( | ||
| open(input_file, "r") as infile, | ||
| open(output_file, "w", newline="") as outfile, | ||
TimidRobot marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ): | ||
| reader = csv.DictReader(infile) | ||
|
|
||
| # Create new fieldnames with both code and label | ||
| fieldnames = [] | ||
| for field in reader.fieldnames: | ||
| fieldnames.append(field) | ||
| if field == "CATEGORY": | ||
| fieldnames.append("CATEGORY_LABEL") | ||
|
|
||
| writer = csv.DictWriter(outfile, fieldnames=fieldnames, dialect="unix") | ||
| writer.writeheader() | ||
|
|
||
| for row in reader: | ||
| if "CATEGORY" in row: | ||
| category_code = row["CATEGORY"] | ||
| # Convert code to label, fallback to uppercase first part | ||
| category_label = category_mapping.get( | ||
| category_code, | ||
| ( | ||
| category_code.split(".")[0].upper() | ||
| if category_code and "." in category_code | ||
| else category_code | ||
| ), | ||
| ) | ||
| row["CATEGORY_LABEL"] = category_label | ||
|
|
||
| writer.writerow(row) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.