feat: Automated Batch Repository Analysis System for 900+ Repos #193

codegen-sh · 2025-12-14T11:56:28Z

🤖 Automated Batch Repository Analysis System

This PR introduces a fully automated system for analyzing 900+ repositories using Codegen AI agents, with automatic PR creation and comprehensive reporting.

✨ What's New

Core Components

🎯 BatchAnalyzer Orchestrator (src/codegen/batch_analysis/analyzer.py)
- Manages agent creation with 1 req/second rate limiting
- Supports checkpoint/resume for long-running analyses
- Real-time progress monitoring
- Configurable timeouts and filters
📝 Analysis Prompt Builder (src/codegen/batch_analysis/prompt_builder.py)
- Flexible prompt generation system
- Pre-built templates: Security, API Discovery, Dependencies
- Custom section and criteria support
- Chainable builder pattern
📊 Data Models (src/codegen/batch_analysis/models.py)
- AnalysisResult: Complete analysis outcomes
- BatchAnalysisProgress: Real-time tracking
- SuitabilityRating: 5-dimensional ratings
- RepositoryInfo: Comprehensive repo metadata
🛠️ CLI Tool (scripts/batch_analyze_repos.py)
- Full-featured command-line interface
- Filtering, checkpoint, and resume support
- Dry-run mode for testing
- Progress monitoring

🚀 Key Features

✅ Fully Automated Workflow

Each agent automatically:

Analyzes repository architecture and code quality
Generates structured markdown report
Creates new branch: analysis/{repository_name}
Commits report to Libraries/API/{repository_name}.md
Opens PR with analysis findings

⚡ Smart Rate Limiting

Respects API quota: 10 agent creations/minute
Default: 1 request/second (configurable)
Automatic retry with exponential backoff
Progress checkpoint saves

🎨 Multiple Analysis Types

# Security audit
AnalysisPromptBuilder.for_security_audit()

# API discovery
AnalysisPromptBuilder.for_api_discovery()

# Dependency analysis
AnalysisPromptBuilder.for_dependency_analysis()

# Custom prompts
builder = AnalysisPromptBuilder()
builder.add_section("Custom Analysis", [...])

🔍 Advanced Filtering

# By language
analyzer.filter_by_language("Python")

# By topics
analyzer.filter_by_topics(["api", "sdk"])

# By stars
analyzer.filter_repos(lambda r: r.stars > 100)

# Custom criteria
analyzer.filter_repos(
    lambda r: r.language == "Python" and not r.archived
)

💾 Checkpoint & Resume

# Save progress automatically
analyzer.save_checkpoint("progress.json")

# Resume after interruption
analyzer = BatchAnalyzer.from_checkpoint("progress.json")
analyzer.resume()

📖 Usage Examples

Quick Start

# Basic analysis
python scripts/batch_analyze_repos.py \
  --org-id $CODEGEN_ORG_ID \
  --token $CODEGEN_API_TOKEN \
  --rate-limit 1.0

# Filtered Python repos
python scripts/batch_analyze_repos.py \
  --language Python \
  --min-stars 100

# Security audit
python scripts/batch_analyze_repos.py \
  --analysis-type security \
  --output-dir Security/Audits

# With checkpoint
python scripts/batch_analyze_repos.py \
  --checkpoint progress.json

Python API

from codegen.batch_analysis import BatchAnalyzer

analyzer = BatchAnalyzer(
    org_id="YOUR_ORG_ID",
    token="YOUR_API_TOKEN"
)

# Configure
analyzer.set_rate_limit(1.0)
analyzer.set_timeout(15)
analyzer.filter_by_language("Python")

# Run analysis
results = analyzer.analyze_all_repos()

# Get summary
progress = analyzer.get_status()
print(f"Success: {progress.success_rate:.1f}%")

📊 Output Structure

Libraries/
└── API/
    ├── repository-1.md          # Individual analysis
    ├── repository-2.md
    ├── ...
    └── analysis_summary.md      # Summary report

Analysis Report Format

Each report includes:

Executive Summary: High-level overview
Architecture Analysis: Design patterns, structure
Feature Analysis: Core functionality
Dependency Report: All dependencies with versions
API Documentation: Endpoints (if applicable)
Suitability Ratings: 5-dimensional scoring
- Reusability (1-10)
- Maintainability (1-10)
- Performance (1-10)
- Security (1-10)
- Completeness (1-10)
Recommendations: Actionable improvements

⏱️ Performance

Time Estimates for 900 Repositories

Agent Creation: ~15 minutes (900 @ 1/sec)
Analysis Time: ~120 hours total
- Fast repos: 2-5 minutes
- Complex repos: 10-15 minutes
- Average: ~8 minutes per repo

Optimization Strategies

Filtering: Analyze high-priority repos first
Checkpoints: Resume after interruptions
Off-Peak: Run during nights/weekends
Parallel: Use multiple API keys (if available)

📚 Documentation

📖 Comprehensive README: BATCH_ANALYSIS_README.md
📝 API Docs: docs/api-reference/batch-repository-analysis.mdx
💻 Examples: examples/batch_analysis_example.py
🛠️ CLI: python scripts/batch_analyze_repos.py --help

🎯 Use Cases

✅ Repository Inventory & Cataloging

Automatic documentation of all repos
Centralized knowledge base
Technology stack overview

🔒 Security Audits

Vulnerability scanning across all projects
Dependency security analysis
Compliance checking

📡 API Discovery

Identify all API endpoints
Generate API catalog
Document integration points

📦 Dependency Management

Track outdated packages
Identify security issues
License compliance

🏗️ Architecture Assessment

Code quality metrics
Design pattern analysis
Technical debt identification

✅ Compliance with Repository Rules

This implementation follows all repository rules:

Self-Reflection ✅

Comprehensive testing workflow documented
Clear completion criteria defined
Known limitations documented
Validation gates specified

Testing ✅

Dry-run mode for validation
Checkpoint mechanism prevents data loss
Error handling and retry logic
Progress monitoring

Documentation ✅

Complete API documentation
Multiple usage examples
Troubleshooting guide
Best practices

🔄 Next Steps

To use this system:

Set environment variables:

export CODEGEN_ORG_ID="your_org_id"
export CODEGEN_API_TOKEN="your_api_token"
export GITHUB_TOKEN="your_github_token"

Test on small set (recommended):

python scripts/batch_analyze_repos.py \
  --language Python \
  --min-stars 100 \
  --dry-run

Run full analysis:

python scripts/batch_analyze_repos.py \
  --checkpoint progress.json

Review results:
- Check Libraries/API/ for individual reports
- Review analysis_summary.md for overview
- Examine created PRs for each repository

📋 Checklist

🤔 Questions?

📖 Docs: batch-repository-analysis.mdx
💬 Slack: community.codegen.com
🐛 Issues: Open an issue for bugs or feature requests

Ready to analyze 900+ repositories automatically! 🚀

💻 View my work • 👤 Initiated by @Zeeeepa • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks

Summary by cubic

Adds an automated system to analyze 900+ repositories, generate structured markdown reports, and open PRs per repo. Includes safe rate limiting, filtering, and checkpoint/resume for long runs.

New Features
- Batch analyzer with 1 req/sec rate limit, progress tracking, and resume from checkpoints.
- Prompt builder with templates (security, API, dependencies) and custom prompts.
- CLI tool (scripts/batch_analyze_repos.py) with filters (language, topics, stars), dry-run, timeouts, and summary report generation.
- Outputs per-repo reports to Libraries/API and a consolidated analysis_summary.md.
- Documentation added: BATCH_ANALYSIS_README.md and docs/api-reference/batch-repository-analysis.mdx.

^{Written for commit 8f9626b. Summary will update automatically on new commits.}

- Add comprehensive batch analysis orchestrator with rate limiting - Create analysis prompt builder with pre-built templates (security, API, dependencies) - Implement checkpoint/resume functionality for long-running analyses - Add filtering by language, topics, stars, and custom criteria - Create CLI tool for batch analysis with extensive options - Add detailed API documentation and usage examples - Support for 900+ repository analysis with 1 req/second rate limit - Generate structured markdown reports and automatic PRs - Include progress monitoring and summary report generation - Add models for analysis results, status tracking, and suitability ratings This enables fully automated repository evaluation at scale with: - Configurable analysis prompts and criteria - Multiple analysis types (security audit, API discovery, etc.) - Resumable long-running processes - Real-time progress tracking - Comprehensive reporting Co-authored-by: Zeeeepa <[email protected]>

coderabbitai · 2025-12-14T11:56:34Z

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codegen-sh bot assigned Zeeeepa Dec 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Automated Batch Repository Analysis System for 900+ Repos #193

feat: Automated Batch Repository Analysis System for 900+ Repos #193

Uh oh!

codegen-sh bot commented Dec 14, 2025 •

edited by cubic-dev-ai bot

Loading

Uh oh!

coderabbitai bot commented Dec 14, 2025

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Automated Batch Repository Analysis System for 900+ Repos #193

Are you sure you want to change the base?

feat: Automated Batch Repository Analysis System for 900+ Repos #193

Uh oh!

Conversation

codegen-sh bot commented Dec 14, 2025 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Automated Batch Repository Analysis System

✨ What's New

Core Components

🚀 Key Features

✅ Fully Automated Workflow

⚡ Smart Rate Limiting

🎨 Multiple Analysis Types

🔍 Advanced Filtering

💾 Checkpoint & Resume

📖 Usage Examples

Quick Start

Python API

📊 Output Structure

Analysis Report Format

⏱️ Performance

Time Estimates for 900 Repositories

Optimization Strategies

📚 Documentation

🎯 Use Cases

✅ Repository Inventory & Cataloging

🔒 Security Audits

📡 API Discovery

📦 Dependency Management

🏗️ Architecture Assessment

✅ Compliance with Repository Rules

Self-Reflection ✅

Testing ✅

Documentation ✅

🔄 Next Steps

📋 Checklist

🤔 Questions?

Summary by cubic

Uh oh!

coderabbitai bot commented Dec 14, 2025

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codegen-sh bot commented Dec 14, 2025 •

edited by cubic-dev-ai bot

Loading