-
-
Notifications
You must be signed in to change notification settings - Fork 72
Add arXiv data fetching functionality #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TimidRobot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start.
I recommend also developing a data/report plan. For example:
- It is not meaningful to get a count of a single language (though it is worth noting that other languages are not available).
- Category codes converted to reporting (words and/or abbreviations instead of acronyms)
This comment was marked as outdated.
This comment was marked as outdated.
hello @TimidRobot as requested, I have made changes based on your review. Good work is emphasized over speed and I do hope my attempt to go full circle with other PR hasn't dented my chances of significantly contributing to the project. Thank you🙏🏼 |
|
@Goziee-git ok, please focus on this PR |
@TimidRobot I have removed the query for languages as it returns only English. Also worthy of note here, the arxiv data source accepts papers in other languages but requires that the paper abstracts be submitted in English. so Impossible to get a good distribution of licenses as per language usage. Also, as suggested. I converted the category codes to reporting words that a more user-friendly and readable using an external |
|
Hello @TimidRobot, i observed from multiple results fetched previously that the script failed to fetch CC licenses that may be recorded as hyphenated variants like (CC-BY, CC-BY-NC, etc). I have implemented a compiled regex pattern that replaces the string matching for more robust license detection I have also looked at some of the implementations in other PR to use the normalize_license_text() function for consistent license identification. Please i'ld like to know what your thoughts are on these changes and work continuously on further improvements, Thanks. |
I'll look at data after outstanding comments are resolved.
I'll look at data after outstanding comments are resolved. That said, I'm not excited about adding JSON to the project. |
It's probably a good idea to create a function in the shared library eventually. Please leave that to last, however. |
Refactor arxiv_fetch.py to use requests library for HTTP requests, implementing retry logic for better error handling. Update license extraction logic and CSV headers to remove PLAN_INDEX.
|
@TimidRobot Hello, are there any more changes you'll like to be made to the ones already identified and done. Also i'ld like to ask, if there are no more changes to be made on this, what steps would you recommend as the next one for me in the project, with respect to integrating ArXiv as a valid source for the project? |
|
@Goziee-git Please update |
@TimidRobot, Thanks, Im happy we've gotten to this point. very excited to continue contributing to the project. I would like your permission to continue working on the processing and reporting scripts for the arXiv data source |
|
@Goziee-git please see two unresolved conversations: |
|
Hello @TimidRobot as per #179 (comment) although these are my thoughts, I am still very happy to know what you think about this. To further share my opinion, these are some helpful links that guided my decision for bucketing the AUTHOR_COUNT data. Scientometrics Journal Guidelines: (https://link.springer.com/journal/11192) FAIR Data Principles: (https://www.go-fair.org/fair-principles/) |
@TimidRobot the conversations mentioned here are identical, please i'ld like to know what your preference is with the |
|
@Goziee-git Sorry, let me clarify. I think bucketing authors into I think which buckets are selected could use some refinement. My thinking is that there are three considerations:
PlottingFor plotting, I think around five values display well. PrecedenceFor precedence, we can look at the various citation styles:
DistributionThe script can be modified to provide information on all relevant author counts (only results above 1% are shown):
RecommendationI recommend the following buckets:
|
|
@TimidRobot as per ordering the constants, what would be your preference here. The current order of the constants is:
This follows the logical/sequential order based on the numbered workflow (1 → 2 → 3 → 4). in comparison to the gcs_fetch.py (current order):
Comparison:
Do you prefer some other order for this, so i can be guided as to the best course of action here |
|
Hi @Goziee-git ! Order or sort alphabetically.
|
@Babi-B, Thanks for the clarity, i appreciate your reviews here. |
@Goziee-git The difference is that |
TimidRobot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thank you!!
Fixes
Description
Implements comprehensive arXiv data collection system to quantify open access academic papers in the commons.
Type of Change
Changes Made
scripts/1-fetch/arxiv_fetch.pyTesting
./dev/check.sh)Data Impact
Related Documentation
sources.mdwith arXiv API credentials setupChecklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin