-
Notifications
You must be signed in to change notification settings - Fork 3
GitHub API
This page is intended to include all relevant information about our source data, ie GitHub data and the GitHub API.
As far as I know, GitHub data is composed (very roughly) of:
- User info: username, email, payment info, etc; repo list; org list; following/followers
- Repository info:
- Code, ie actual files
- Commits, which share data with code, as well as timestamps, users, messages
- Wiki pages, essentially equivalent to code files but stored in a separate repo named .wiki
- Issues, including issue numbers, users, timestamps, messages, status; also milestones, features, etc
- Forks, pull requests, watchers, etc
- Lots of calculated info (eg contributor stats)
- User/org ownership
- Org info: repos, users...
The GitHub API provides a convenient way to access all of this data. It is documented here, and is fairly straightforward. Perhaps the biggest drawback of the API is rate limiting. Every user is limited to 5000 requests/hour. Each request gives a limited number of results. Thus, there are limits to the amount of data that the API can give us. This limitation is not as bad as it seems, however, since the rate is per user, not per application. In particular, if a user authorizes our site to access GitHub on their behalf, we can make 5000 requests in an hour on behalf of that user, and another 5000 for each other user. We might be able to ask GitHub to raise this limit for this academic project, though perhaps for the academic project we don't really need that many requests...In general, the best ways to avoid running into rate limits include caching as much as possible, using statistics, and using "free" conditional requests.
The other major way of accessing GitHub data is through git itself, by cloning repos locally. This gives us all files, commits, timestamps, and Wikis, but not issues, forks, and per-user/org lists of repos. We also still need authentication for private repos. Still, this might be worth looking into if we want to compute statistics that GitHub doesn't have, or for certain rate-limit issues.
TODO: figure out/describe authentication process