Skip to content

Conversation

ThomasBaruzier
Copy link
Contributor

@ThomasBaruzier ThomasBaruzier commented Jul 26, 2025

Hello!

I've refined the scraping and conversion scripts. While they should work with any repository, I haven't extensively tested them beyond the current use case. For this repository, the scripts consistently complete in ~30 seconds (initially 750s!) using just 11 10 API requests to fetch all issues, pull requests, and discussions.

I initially explored resumable/incremental scraping but abandoned the idea due to reliability issues: the updatedAt field only reflects edits to the issue/PR body, not new activity. Instead, I focused on optimization, achieving the results below.


Usage

export GITHUB_TOKEN='github_pat_...'
cd github-data
rm -rf issues discussions pull_requests index.md ik.json
python ghscrape.py ikawrakow/ik_llama.cpp -o ik.json
python ghconvert.py ik.json -o .

Or as a one-liner (ensure that you are in github-data/):

rm -rf issues discussions pull_requests index.md ik.json && GITHUB_TOKEN='github_pat_...' python ghscrape.py ikawrakow/ik_llama.cpp -o ik.json && python ghconvert.py ik.json -o .

Scraping Demo

python ghscrape.py ikawrakow/ik_llama.cpp -o ik.json
I: Fetching all issues...
I: API Rate Limit (Req #1): 4997 points remaining, resets in 59m 54s.
I: Processed 100 issues...
I: API Rate Limit (Req #2): 4994 points remaining, resets in 59m 52s.
I: Processed 131 issues...
I: Fetching all nested data for 131 items (1 pages)...
I: API Rate Limit (Req #3): 4993 points remaining, resets in 59m 52s.
I: Processed batch of 1 pages. 0 pages remaining.
I: Structuring final items for issues...
I: Finished issues: Found and processed 131 items.
I: Fetching all pull requests...
I: API Rate Limit (Req #4): 4888 points remaining, resets in 59m 49s.
I: Processed 100 pull_requests...
I: API Rate Limit (Req #5): 4783 points remaining, resets in 59m 46s.
I: Processed 200 pull_requests...
I: API Rate Limit (Req #6): 4678 points remaining, resets in 59m 41s.
I: Processed 300 pull_requests...
I: API Rate Limit (Req #7): 4573 points remaining, resets in 59m 36s.
I: Processed 400 pull_requests...
I: API Rate Limit (Req #8): 4468 points remaining, resets in 59m 34s.
I: Processed 452 pull_requests...
I: Fetching all nested data for 452 items (0 pages)...
I: Structuring final items for pull_requests...
I: Finished pull_requests: Found and processed 452 items.
I: Fetching all discussions...
I: API Rate Limit (Req #9): 4366 points remaining, resets in 59m 30s.
I: Processed 71 discussions...
I: Fetching all nested data for 71 items (1 pages)...
I: API Rate Limit (Req #10): 4365 points remaining, resets in 59m 29s.
I: Processed batch of 1 pages. 0 pages remaining.
I: Structuring final items for discussions...
I: Finished discussions: Found and processed 71 items.
I: Data successfully saved to ik.json
I: Total execution time: 28.77 seconds

Conversion Demo

python ghconvert.py ik.json -o .
Processing 131 issues...
Processing 452 pull_requests...
Processing 71 discussions...
Generating index.md summary file...
Successfully generated 654 Markdown files.
Files are in the '.' directory.

Relevant links jump:

Scripts:

Index:

Discussion example:

PR example:

Issue example:


Notes

  • Content extraction for reviews isn’t fully implemented yet (see example). This could be added later if needed. Fixed.

  • Wiki backups are not implemented.

  • Scripts and filenames also work on Windows.

  • I’ve read the contributing guidelines.

  • Self-reported review complexity:

    • Low
    • Medium
    • High

@ThomasBaruzier ThomasBaruzier marked this pull request as draft July 26, 2025 10:54
@ThomasBaruzier ThomasBaruzier force-pushed the tb/github-data-scripts branch from 9381af6 to b0c8c58 Compare July 28, 2025 00:49
@ThomasBaruzier ThomasBaruzier marked this pull request as ready for review July 28, 2025 00:52
@ThomasBaruzier ThomasBaruzier force-pushed the tb/github-data-scripts branch from b0c8c58 to 4f58ecd Compare July 28, 2025 01:09
@ikawrakow
Copy link
Owner

Thank you for this!

But while on vacation I started wondering if scrapping GitHub data does not somehow violate their ToS. Anyone knows?

@ThomasBaruzier
Copy link
Contributor Author

https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies

You may use information from our Service for the following reasons, regardless of whether the information was scraped, collected through our API, or obtained otherwise:

  • Researchers may use public, non-personal information from the Service for research purposes, only if any publications resulting from that research are open access.
  • Archivists may use public information from the Service for archival purposes.

It seems that it doesn't break ToS 👍 (I probably should have looked up this before working on this, but hey, it's there)

@ikawrakow
Copy link
Owner

Maybe I'm too paranoid after having been locked out from my GH account, but can I count as a "Researcher" or as an "Archivist"?

Also, even if the actual scrapping does not violate their ToS, does publishing software that enables scrapping perhaps violate their ToS? Someone who is not an archivist or a researcher who publishes the scrapped data under open access can take the code and do something with it that violates their ToS. Similar to the Pirate Bay not actually engaging in copyright violation, but enabling others to easily do so. Or youtube-dl, which IIRC was taken down due to a DMCA violation complaint.

@ThomasBaruzier
Copy link
Contributor Author

I mean, this is clearly for archival purposes, and you're using your own token with GitHub's official GraphQL API, which has its own rate limiting to prevent overuse. To be more precise, a full backup of the current repository costs 700 points out of the 5,000 available (which reset hourly). Even if this were done weekly, it shouldn’t draw much attention.

That said, you’re probably right about the second part, I’m not entirely sure if publishing tools like this complies with GitHub’s ToS. I did find a similar project still active here: https://gist.github.com/animusna/b64b45d910dd3df7cd41ee0f99082766

Nevertheless, I’ve published it under my own account here: https://github.com/ThomasBaruzier/ghbkp

If you’d prefer to keep this separate from your project, I could set up daily or weekly backups there instead. But again, it should be safe. I understand your caution after the two-day takedown we experienced, but that was likely a mistake on GitHub’s part, maybe the sudden influx of stars from new users enjoying Kimi triggered their bot detection and forced an automated takedown?

@ubergarm
Copy link
Contributor

@ThomasBaruzier

Thanks for your repo! I've been having trouble with github search in general. For fun I tried to "vibe code" a quick github repo CLI tool with the latest Kimi-K2-Instruct, but it was very inefficient making a bunch of requests. Your solution with graphql is quite efficient and much faster.

Now I

Here are quick instructions for anyone who wants to try it which takes just a few minutes:

# install requests is only python dependency
$ uv venv ./venv --python 3.12 --python-preference=only-managed
$ source ./venv/bin/activate
$ uv pip install requests

# get the scripts
$ git clone [email protected]:ThomasBaruzier/ghbkp.git

# Add fine-grained token with read only access with 30 day expiration etc...
# https://github.com/settings/personal-access-tokens

# save data for easy searching locally
$ source ./venv/bin/activate
$ export GITHUB_TOKEN=COPYPASTETOKENJUSTMADEABOVE
$ python ghbkp/ghscrape.py ikawrakow/ik_llama.cpp

I: Fetching all issues...
I: API Rate Limit (Req #1): 4997 points remaining, resets in 59m 52s.
I: Processed 100 issues...
I: API Rate Limit (Req #2): 4994 points remaining, resets in 59m 48s.
I: Processed 174 issues...
I: Fetching all nested data for 174 items (1 pages)...
I: API Rate Limit (Req #3): 4993 points remaining, resets in 59m 48s.
I: Processed batch of 1 pages. 0 pages remaining.
I: Structuring final items for issues...
I: Finished issues: Found and processed 174 items.
I: Fetching all pull requests...
I: API Rate Limit (Req #4): 4888 points remaining, resets in 59m 45s.
I: Processed 100 pull_requests...
I: API Rate Limit (Req #5): 4783 points remaining, resets in 59m 41s.
I: Processed 200 pull_requests...
I: API Rate Limit (Req #6): 4678 points remaining, resets in 59m 35s.
I: Processed 300 pull_requests...
I: API Rate Limit (Req #7): 4573 points remaining, resets in 59m 28s.
I: Processed 400 pull_requests...
I: API Rate Limit (Req #8): 4468 points remaining, resets in 59m 20s.
I: Processed 500 pull_requests...
I: API Rate Limit (Req #9): 4363 points remaining, resets in 59m 18s.
I: Processed 514 pull_requests...
I: Fetching all nested data for 514 items (1 pages)...
I: API Rate Limit (Req #10): 4362 points remaining, resets in 59m 18s.
I: Processed batch of 1 pages. 1 pages remaining.
I: API Rate Limit (Req #11): 4361 points remaining, resets in 59m 18s.
I: Processed batch of 1 pages. 0 pages remaining.
I: Structuring final items for pull_requests...
I: Finished pull_requests: Found and processed 514 items.
I: Fetching all discussions...
I: API Rate Limit (Req #12): 4259 points remaining, resets in 59m 12s.
I: Processed 84 discussions...
I: Fetching all nested data for 84 items (1 pages)...
I: API Rate Limit (Req #13): 4258 points remaining, resets in 59m 12s.
I: Processed batch of 1 pages. 0 pages remaining.
I: Structuring final items for discussions...
I: Finished discussions: Found and processed 84 items.
I: Data successfully saved to ikawrakow__ik_llama.cpp.json
I: Total execution time: 48.67 seconds

# check the data
$ cat ikawrakow__ik_llama.cpp.json | jq '.discussions[].author' | sort | uniq -c | sort -n | tail -n 5
      2 "usrlocalben"
      5 "magikRUKKOLA"
      6 "ubergarm"
      7 "Nexesenex"
     19 "ikawrakow"

Now I can easily grep to find things!

@ThomasBaruzier
Copy link
Contributor Author

Thanks for the demo, I'm glad you found it useful!

Have you tried ghconvert.py? I may add built-in filtering if relevant.

@ubergarm
Copy link
Contributor

ubergarm commented Sep 12, 2025

Oh no I didn't realize what that does! I'll try it:

Ok, Looks like your ghconvert.py will take the .json file produced by ghscrape.py and create a directory structure filled with an index markdown file and discussions/issues/pull_requests subdirs with comments inserted into each one.

$ python ghbkp/ghconvert.py ikawrakow__ik_llama.cpp.json -o ./outputdir/
$ tree -L 1 ./outputdir/
.
├── discussions
├── index.md
├── issues
└── pull_requests

This is great because I can grep -ri foobar then get a quick reference link from the top of the file to share references with folks!

Thanks for this useful tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants