Skip to content

Conversation

@herbiebradley
Copy link

@herbiebradley herbiebradley commented Oct 7, 2022

A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string.

This uses PyArrow via dask to save to parquet, which makes it easily parallelisable and gives low memory usage.

See #31

@LouisCastricato
Copy link
Contributor

Make sure to end your file in a new line

@LouisCastricato
Copy link
Contributor

LGTM, @ncoop57 can you check?

@ncoop57
Copy link
Collaborator

ncoop57 commented Oct 9, 2022

Looks good @herbiebradley. The only thing needed is a minimum test with a dummy parquet file that is tested with pytest: https://docs.pytest.org/en/7.1.x/getting-started.html. We want to make sure we don't have bugs. Also, could you enable maintainer edits for the PR in case I need to modify something quickly I can? https://github.blog/2016-09-07-improving-collaboration-with-forks/

@ncoop57
Copy link
Collaborator

ncoop57 commented Nov 9, 2022

@herbiebradley @reshinthadithyan This is looking pretty solid, could you add a quick test so that I can merge?

@herbiebradley
Copy link
Author

@reshinthadithyan you might want to add your scripts to this branch before we merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants