-
Notifications
You must be signed in to change notification settings - Fork 31
GitHub Diffs #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: working
Are you sure you want to change the base?
GitHub Diffs #36
Conversation
|
Make sure to end your file in a new line |
|
LGTM, @ncoop57 can you check? |
|
Looks good @herbiebradley. The only thing needed is a minimum test with a dummy parquet file that is tested with pytest: https://docs.pytest.org/en/7.1.x/getting-started.html. We want to make sure we don't have bugs. Also, could you enable maintainer edits for the PR in case I need to modify something quickly I can? https://github.blog/2016-09-07-improving-collaboration-with-forks/ |
|
@herbiebradley @reshinthadithyan This is looking pretty solid, could you add a quick test so that I can merge? |
|
@reshinthadithyan you might want to add your scripts to this branch before we merge? |
A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string.
This uses PyArrow via
daskto save to parquet, which makes it easily parallelisable and gives low memory usage.See #31