Skip to content

rnckp/github-get-agentsmd

Repository files navigation

GitHub AGENTS.md and CLAUDE.md Scraper

Discover and download AGENTS.md and CLAUDE.md files from GitHub repositories.

Python GitHub License GitHub Stars linting - Ruff

Note

For learning and inspiration. Downloaded files retain their original licenses—respect those terms.

What It Does

  1. get_repos.py — Find repos via GitHub Search API
  2. get_agentsmd.py — Download their AGENTS.md and CLAUDE.md files

Searches recent, non-archived GitHub repos sorted by stars (default: 50,000 repos max). Default language: Python. Configurable via config.yaml.

Installation

git clone https://github.com/yourusername/github-get-agents.git
cd github-get-agents
pip3 install uv && uv sync

Configuration

All settings are centralized in config.yaml. Edit this file to customize:

  • Repository search: Language, date ranges, star bins, max repos
  • API settings: Timeouts, retries, backoff strategies
  • Download settings: Delays, output directories

Default values work well for most use cases. CLI arguments override config values when specified.

GitHub Token

Create a Personal Access Token with repo and user:read:user permissions:

export GITHUB_TOKEN="ghp_..."

Usage

1. Discover Repositories

uv run python get_repos.py                 # Use defaults from config.yaml
uv run python get_repos.py -n 1000         # Limit to 1000 repos
uv run python get_repos.py --dry-run       # Preview query partitions without fetching

Output: repos_YYYY-MM-DD_HHMMSS.jsonl

2. Download AGENTS.md and CLAUDE.md Files

uv run python get_agentsmd.py              # Auto-detect newest repos file
uv run python get_agentsmd.py -w 8         # Use 8 parallel workers (faster)
uv run python get_agentsmd.py -r           # Resume interrupted download
uv run python get_agentsmd.py -r -w 8      # Resume with parallel workers

Output: agents_md_YYYY-MM-DD_HHMMSS/org/repo/AGENTS.md + download_results.jsonl

Troubleshooting

Issue Solution
ERROR: set GITHUB_TOKEN export GITHUB_TOKEN="..."
403 Forbidden Regenerate token with repo and user:read:user scopes
Rate limit Scripts auto-wait; run during off-peak hours for large jobs
Empty repos.jsonl Adjust filters in get_repos.py or verify token works

Verify token:

curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/user | jq -r .login

Scaling to more Repos

GitHub Search API returns max 1,000 results per query. To get more:

Method 1: Edit star bins in config.yaml to partition queries:

star_bins:
  - [10000, null]
  - [5000, 9999] # Uncomment for 5k-10k stars
  - [2000, 4999] # Uncomment for 2k-5k stars
  # ... more bins available in config

Method 2: Edit date ranges or other filters in config.yaml

Method 3: Use GitHub on BigQuery for exhaustive queries

API Limits

Resource Limit Notes
Search API 30 req/min Used by get_repos.py
File downloads N/A 0.1s delay in get_agentsmd.py

Both scripts handle rate limits with automatic retry and backoff.

License

MIT License

About

Discover and download AGENTS.md files from Python repositories.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages