Each clone of RuvScan already ships with data/ruvscan.db filled with roughly 100 public projects from the ruvnet GitHub organization (think ruvnet/sublinear-time-solver, ruvnet/FACT, ruvnet/MidStream, etc.). You can run the MCP server immediately and start asking questions without touching any scripts.
The seeding workflow below is therefore optional—use it when you want to refresh that bundled catalog, add your own org/user, or rebuild the database from scratch.
# 1. Clone the repository
git clone https://github.com/Hulupeep/ruvscan.git
cd ruvscan
# 2. Setup environment
cp .env.example .env.local
# Edit .env.local and add your GITHUB_TOKEN
# 3. (Optional) Refresh or extend the database
# Skip if you're happy with the included ruvnet dataset.
python3 scripts/seed_database.py
# 4. Start RuvScan
docker compose up -d
# 5. Start using immediately!
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"intent": "Find tools for AI applications"}'The default seeding settings are in .env.local:
# Default repository to scan/seed
RUVSCAN_SOURCE_TYPE=org # or 'user'
RUVSCAN_SOURCE_NAME=ruvnet # GitHub username/org
# Database location
DATABASE_TYPE=sqlite
SQLITE_PATH=data/ruvscan.db# Seed with default (ruvnet, 50 repos)
python3 scripts/seed_database.py
# Seed specific user/org
python3 scripts/seed_database.py --org openai --limit 30
# Seed without README content (faster)
python3 scripts/seed_database.py --org vercel --limit 20 --no-readmespython3 scripts/seed_database.py \
--org anthropics \ # GitHub user/org name
--limit 100 \ # Max repos to fetch
--db data/custom.db \ # Custom database path
--no-readmes # Skip README fetching (much faster)- Inside Claude / Codex say: “Use scan_github on org anthropics with limit 25.”
- Or run the bundled CLI:
./scripts/ruvscan scan org anthropics --limit 25
Both routes feed new repositories into the same SQLite database alongside the preloaded ruvnet entries.
For each repository, the following data is stored:
| Field | Description | Example |
|---|---|---|
name |
Repository name | sublinear-time-solver |
org |
Owner (user/org) | ruvnet |
full_name |
Full identifier | ruvnet/sublinear-time-solver |
description |
Repo description | "TRUE O(log n) algorithms" |
topics |
GitHub topics | ["algorithms", "optimization"] |
readme |
README content | Full markdown text |
stars |
Star count | 157 |
language |
Primary language | Python |
last_scan |
Scan timestamp | 2025-10-23 14:10:32 |
Every repository has a last_scan timestamp. Re-running the seed script will:
- ✅ Update existing repos with latest data
- ✅ Add new repos that don't exist
- ✅ Refresh the
last_scantimestamp
# Check when repos were last scanned (requires sqlite3)
sqlite3 data/ruvscan.db "
SELECT full_name, last_scan
FROM repos
ORDER BY last_scan DESC
LIMIT 10;
"Recommended frequency:
- Active development: Weekly
- Production use: Monthly
- After major updates: Immediately
# Quick rescan (no READMEs, faster)
python3 scripts/seed_database.py --org ruvnet --no-readmes
# Full rescan (includes READMEs)
python3 scripts/seed_database.py --org ruvnetYou can seed from multiple organizations/users:
# Seed from multiple sources
python3 scripts/seed_database.py --org ruvnet --limit 50
python3 scripts/seed_database.py --org openai --limit 30
python3 scripts/seed_database.py --org anthropics --limit 20
python3 scripts/seed_database.py --org vercel --limit 25
# Check total repos
python3 scripts/seed_database.py --org facebook --limit 15After seeding, your database will have repos from all sources!
# Fastest: Just metadata, no README content
python3 scripts/seed_database.py --org ruvnet --limit 100 --no-readmes
# Speed: ~2-3 seconds per 10 repos
# Use this for initial exploration# Slower: Includes full README content
python3 scripts/seed_database.py --org ruvnet --limit 50
# Speed: ~5-10 seconds per 10 repos
# Use this for production quality dataWithout GitHub Token:
- Limit: 60 requests/hour
- Seeding: ~10-15 repos max
With GitHub Token:
- Limit: 5,000 requests/hour
- Seeding: Hundreds of repos easily
Always use a GitHub token for seeding!
# Edit the file
nano .env.local
# Change these lines:
RUVSCAN_SOURCE_NAME=your-github-username
RUVSCAN_SOURCE_TYPE=user
# Reseed
python3 scripts/seed_database.py# One-time seed of different repo
python3 scripts/seed_database.py --org microsoft --limit 50
# Your .env.local default stays unchanged# Run seed script with any org to see stats at the end
python3 scripts/seed_database.py --org ruvnet --limit 1
# Shows:
# - Total repos in database
# - Added/Updated counts# Backup before major changes
cp data/ruvscan.db data/ruvscan.db.backup.$(date +%Y%m%d)
# Restore if needed
cp data/ruvscan.db.backup.20251023 data/ruvscan.db
docker compose restart# Complete reset (deletes all data)
docker compose down
rm -f data/ruvscan.db
python3 scripts/seed_database.py
docker compose up -d# If data directory owned by root
docker compose down
rm -rf data
mkdir -p data
python3 scripts/seed_database.py
docker compose up -d# Check if user/org exists
curl -s https://api.github.com/users/username | jq .
# If "Not Found", the username is wrong# Add to .env.local
echo "GITHUB_TOKEN=ghp_your_token_here" >> .env.local
# Generate token at: https://github.com/settings/tokens
# Needs scopes: public_repo, read:org# Stop Docker to release lock
docker compose down
# Then seed
python3 scripts/seed_database.py
# Restart
docker compose up -dCreate a seed script in your Dockerfile or docker-compose:
# Dockerfile.python
RUN python3 scripts/seed_database.py --org ruvnet --limit 100 --no-readmesCreate a cron job:
# crontab -e
# Reseed every week on Sunday at 2am
0 2 * * 0 cd /path/to/ruvscan && python3 scripts/seed_database.py# .github/workflows/seed.yml
name: Seed Database
on:
schedule:
- cron: '0 0 * * 0' # Weekly
workflow_dispatch: # Manual trigger
jobs:
seed:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Seed Database
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python3 scripts/seed_database.py --org ruvnet- ✅ Seed on first setup - Provides immediate functionality
- ✅ Use GitHub token - Avoids rate limiting
- ✅ Rescan periodically - Keeps data fresh
- ✅ Start with
--no-readmes- Faster initial seed - ✅ Add multiple sources - Broader coverage
- ✅ Backup before reseeding - Safety first
- ❌ Commit .env.local - Contains your token
- ❌ Seed without token - Very slow, limited
- ❌ Never rescan - Data gets stale
- ❌ Seed while Docker running - May cause locks
- ❌ Use massive limits initially - Start small
A: No. Seeding pre-populates the database for immediate use. The scanning feature (when fully implemented) will continuously update repos in the background.
A: Weekly for development, monthly for production. Or after major GitHub updates to repos you track.
A: Yes! Just run the script multiple times with different --org parameters. All repos accumulate in the database.
A: No. It uses INSERT OR REPLACE, which updates existing repos and adds new ones. No data loss.
A: Edit .env.local and change RUVSCAN_SOURCE_NAME to your preferred GitHub user/org.
A: Yes, if your GitHub token has repo scope (not just public_repo). Add --org your-private-org.
A: No. It only uses the free GitHub API. Just stay within rate limits (5,000/hour with token).
A: Yes! Use cron jobs, GitHub Actions, or systemd timers to periodically reseed.
Key Points:
- 🌱 Seed on first setup for immediate functionality
- 🔄 Rescan periodically to keep data fresh
- 🎯 Configure default in
.env.local - 📊 Database tracks
last_scantimestamps - 🚀 Use
--no-readmesfor speed - 💾 Backup before major changes
- 🔑 Always use a GitHub token
Quick Commands:
# First-time setup
python3 scripts/seed_database.py
# Regular rescan
python3 scripts/seed_database.py --no-readmes
# Add more repos
python3 scripts/seed_database.py --org openai --limit 30Need help? See QUICK_START_LOCAL.md or open an issue!