Skip to content

git-crawl fetches and tracks high level repository information from specified users and organizations. Downloads various informative file types from the repositories and recursively crawls forks to a specified depth.

Notifications You must be signed in to change notification settings

dane-git/git-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

git-crawler

Overview

The GitHub Crawler is a Python script designed to fetch and process repository data from GitHub. It handles rate limiting, converts specific file formats, and logs detailed information about the crawling process.

Features

  • Fetch repository details, including commits, branches, forks, and more.

  • Skip specified repositories by name.

  • Convert .pdf and .mediawiki files to Markdown.

  • Handle GitHub API rate limiting.

  • Log detailed messages with timestamps, file names, function names, and line numbers.

  • Save metadata to a single file. Goal to prevent duplicate api calls.

Prerequisites

  • Python 3.x
  • GitHub personal access token
  • Required Python libraries: requests, fitz (PyMuPDF), markdownify, mwparserfromhell, pypandoc

Installation

Install Python Libraries

pip install .

install Pandoc

mac os

brew install pandoc

ubuntu/debian

sudo apt-get install pandoc

Usage

  1. Set up your GitHub personal access token: Replace your_github_token_here with your actual GitHub personal access token in the script.

  2. Specify Repositories configuration in config/config.yaml.
    See example_config.yaml.

  3. Run the Script: Run the script to fetch and process repository data from the specified users and organizations.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Documentation

About

git-crawl fetches and tracks high level repository information from specified users and organizations. Downloads various informative file types from the repositories and recursively crawls forks to a specified depth.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages