-
Notifications
You must be signed in to change notification settings - Fork 154
Add GitHub data: backup and convertion scripts + backup update #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add GitHub data: backup and convertion scripts + backup update #653
Conversation
9381af6
to
b0c8c58
Compare
b0c8c58
to
4f58ecd
Compare
Thank you for this! But while on vacation I started wondering if scrapping GitHub data does not somehow violate their ToS. Anyone knows? |
https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies
It seems that it doesn't break ToS 👍 (I probably should have looked up this before working on this, but hey, it's there) |
Maybe I'm too paranoid after having been locked out from my GH account, but can I count as a "Researcher" or as an "Archivist"? Also, even if the actual scrapping does not violate their ToS, does publishing software that enables scrapping perhaps violate their ToS? Someone who is not an archivist or a researcher who publishes the scrapped data under open access can take the code and do something with it that violates their ToS. Similar to the Pirate Bay not actually engaging in copyright violation, but enabling others to easily do so. Or youtube-dl, which IIRC was taken down due to a DMCA violation complaint. |
I mean, this is clearly for archival purposes, and you're using your own token with GitHub's official GraphQL API, which has its own rate limiting to prevent overuse. To be more precise, a full backup of the current repository costs 700 points out of the 5,000 available (which reset hourly). Even if this were done weekly, it shouldn’t draw much attention. That said, you’re probably right about the second part, I’m not entirely sure if publishing tools like this complies with GitHub’s ToS. I did find a similar project still active here: https://gist.github.com/animusna/b64b45d910dd3df7cd41ee0f99082766 Nevertheless, I’ve published it under my own account here: https://github.com/ThomasBaruzier/ghbkp If you’d prefer to keep this separate from your project, I could set up daily or weekly backups there instead. But again, it should be safe. I understand your caution after the two-day takedown we experienced, but that was likely a mistake on GitHub’s part, maybe the sudden influx of stars from new users enjoying Kimi triggered their bot detection and forced an automated takedown? |
Thanks for your repo! I've been having trouble with github search in general. For fun I tried to "vibe code" a quick github repo CLI tool with the latest Kimi-K2-Instruct, but it was very inefficient making a bunch of requests. Your solution with graphql is quite efficient and much faster. Now I Here are quick instructions for anyone who wants to try it which takes just a few minutes: # install requests is only python dependency
$ uv venv ./venv --python 3.12 --python-preference=only-managed
$ source ./venv/bin/activate
$ uv pip install requests
# get the scripts
$ git clone [email protected]:ThomasBaruzier/ghbkp.git
# Add fine-grained token with read only access with 30 day expiration etc...
# https://github.com/settings/personal-access-tokens
# save data for easy searching locally
$ source ./venv/bin/activate
$ export GITHUB_TOKEN=COPYPASTETOKENJUSTMADEABOVE
$ python ghbkp/ghscrape.py ikawrakow/ik_llama.cpp
I: Fetching all issues...
I: API Rate Limit (Req #1): 4997 points remaining, resets in 59m 52s.
I: Processed 100 issues...
I: API Rate Limit (Req #2): 4994 points remaining, resets in 59m 48s.
I: Processed 174 issues...
I: Fetching all nested data for 174 items (1 pages)...
I: API Rate Limit (Req #3): 4993 points remaining, resets in 59m 48s.
I: Processed batch of 1 pages. 0 pages remaining.
I: Structuring final items for issues...
I: Finished issues: Found and processed 174 items.
I: Fetching all pull requests...
I: API Rate Limit (Req #4): 4888 points remaining, resets in 59m 45s.
I: Processed 100 pull_requests...
I: API Rate Limit (Req #5): 4783 points remaining, resets in 59m 41s.
I: Processed 200 pull_requests...
I: API Rate Limit (Req #6): 4678 points remaining, resets in 59m 35s.
I: Processed 300 pull_requests...
I: API Rate Limit (Req #7): 4573 points remaining, resets in 59m 28s.
I: Processed 400 pull_requests...
I: API Rate Limit (Req #8): 4468 points remaining, resets in 59m 20s.
I: Processed 500 pull_requests...
I: API Rate Limit (Req #9): 4363 points remaining, resets in 59m 18s.
I: Processed 514 pull_requests...
I: Fetching all nested data for 514 items (1 pages)...
I: API Rate Limit (Req #10): 4362 points remaining, resets in 59m 18s.
I: Processed batch of 1 pages. 1 pages remaining.
I: API Rate Limit (Req #11): 4361 points remaining, resets in 59m 18s.
I: Processed batch of 1 pages. 0 pages remaining.
I: Structuring final items for pull_requests...
I: Finished pull_requests: Found and processed 514 items.
I: Fetching all discussions...
I: API Rate Limit (Req #12): 4259 points remaining, resets in 59m 12s.
I: Processed 84 discussions...
I: Fetching all nested data for 84 items (1 pages)...
I: API Rate Limit (Req #13): 4258 points remaining, resets in 59m 12s.
I: Processed batch of 1 pages. 0 pages remaining.
I: Structuring final items for discussions...
I: Finished discussions: Found and processed 84 items.
I: Data successfully saved to ikawrakow__ik_llama.cpp.json
I: Total execution time: 48.67 seconds
# check the data
$ cat ikawrakow__ik_llama.cpp.json | jq '.discussions[].author' | sort | uniq -c | sort -n | tail -n 5
2 "usrlocalben"
5 "magikRUKKOLA"
6 "ubergarm"
7 "Nexesenex"
19 "ikawrakow" Now I can easily |
Thanks for the demo, I'm glad you found it useful! Have you tried ghconvert.py? I may add built-in filtering if relevant. |
Oh no I didn't realize what that does! I'll try it: Ok, Looks like your $ python ghbkp/ghconvert.py ikawrakow__ik_llama.cpp.json -o ./outputdir/
$ tree -L 1 ./outputdir/
.
├── discussions
├── index.md
├── issues
└── pull_requests This is great because I can Thanks for this useful tool! |
Hello!
I've refined the scraping and conversion scripts. While they should work with any repository, I haven't extensively tested them beyond the current use case. For this repository, the scripts consistently complete in ~30 seconds (initially 750s!) using just
1110 API requests to fetch all issues, pull requests, and discussions.I initially explored resumable/incremental scraping but abandoned the idea due to reliability issues: the
updatedAt
field only reflects edits to the issue/PR body, not new activity. Instead, I focused on optimization, achieving the results below.Usage
Or as a one-liner (ensure that you are in
github-data/
):Scraping Demo
Conversion Demo
Relevant links jump:
Scripts:
Index:
Discussion example:
PR example:
Issue example:
Notes
Content extraction for reviews isn’t fully implemented yet (see example). This could be added later if needed.Fixed.Wiki backups are not implemented.
Scripts and filenames also work on Windows.
I’ve read the contributing guidelines.
Self-reported review complexity: