diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index 6133b04..b3b9e8f 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -1,4 +1,4 @@ -name: Outline Benchmarks +name: Grainchain Benchmarks on: schedule: @@ -37,9 +37,6 @@ jobs: run: | uv sync --all-extras - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - name: Configure Git run: | git config --global user.name "Benchmark Bot" @@ -48,8 +45,6 @@ jobs: - name: Run benchmarks run: | uv run python benchmarks/scripts/auto_publish.py --run-benchmark - env: - DOCKER_HOST: unix:///var/run/docker.sock - name: Generate summary report run: | diff --git a/README.md b/README.md index ed02f6d..900f0fd 100644 --- a/README.md +++ b/README.md @@ -105,62 +105,64 @@ Compare sandbox providers with comprehensive performance testing: ```bash # Test individual providers -grainchain benchmark --provider local -grainchain benchmark --provider e2b -grainchain benchmark --provider daytona -grainchain benchmark --provider morph +python benchmarks/scripts/grainchain_benchmark.py --providers local +python benchmarks/scripts/grainchain_benchmark.py --providers e2b +python benchmarks/scripts/grainchain_benchmark.py --providers daytona +python benchmarks/scripts/grainchain_benchmark.py --providers morph -# Generate timestamped results -grainchain benchmark --provider local --output benchmarks/results/ +# Test multiple providers at once +python benchmarks/scripts/grainchain_benchmark.py --providers local e2b --iterations 3 -# Check latest benchmark status (without running new tests) -./scripts/benchmark_status.sh +# Generate automated summary report +python benchmarks/scripts/auto_publish.py --generate-summary ``` ### Full Benchmark Suite -Run comprehensive benchmarks across all providers: +Run comprehensive benchmarks across all available providers: ```bash -# Quick: Run all providers and save results -for provider in local e2b daytona morph; do - echo "🚀 Testing $provider..." - grainchain benchmark --provider $provider --output benchmarks/results/ -done +# Run full benchmark suite with all providers +python benchmarks/scripts/grainchain_benchmark.py --providers local e2b modal daytona morph --iterations 3 -# Comprehensive: Generate a full report that can be committed -./scripts/benchmark_all.sh +# Run automated benchmark and generate summary (used by CI) +python benchmarks/scripts/auto_publish.py --run-benchmark -# Advanced: Use the detailed benchmark script -./benchmarks/scripts/run_grainchain_benchmark.sh "local e2b daytona morph" 3 +# Generate summary from existing results +python benchmarks/scripts/auto_publish.py --generate-summary ``` -The `benchmark_all.sh` script generates timestamped reports in `benchmarks/results/` that include: +The benchmark system generates timestamped reports in `benchmarks/results/` that include: -- Performance comparison tables -- Environment details (OS, commit hash) -- Analysis and recommendations -- Raw benchmark data for tracking trends +- Performance comparison tables across providers +- Success rates and error analysis +- Detailed metrics for each test scenario +- JSON data for historical tracking +- Automated summary reports ### Current Performance Baseline -Latest benchmark results (updated 2024-05-31): +Latest benchmark results (updated 2025-07-06): -| Provider | Total Time | Basic Echo | Python Test | File Ops | Performance | -| ----------- | ---------- | ---------- | ----------- | -------- | ---------------- | -| **Local** | 0.036s | 0.007s | 0.021s | 0.008s | ⚡ Fastest | -| **E2B** | 0.599s | 0.331s | 0.111s | 0.156s | 🚀 Balanced | -| **Daytona** | 1.012s | 0.305s | 0.156s | 0.551s | 🛡️ Comprehensive | -| **Morph** | 0.250s | 0.005s | 0.010s | 0.005s | 🚀 Instant Snapshots | +| Provider | Success Rate | Avg Time (s) | Status | Performance | +|----------|--------------|--------------|--------|-------------| +| **Local** | 76.7% | 1.09 | ✅ Available | ⚡ Fastest | +| **E2B** | - | - | ❓ Not tested | 🚀 Cloud-based | +| **Daytona** | - | - | ❓ Not tested | 🛡️ Comprehensive | +| **Morph** | - | - | ❌ Payment required | 🚀 Instant Snapshots | > **Performance Notes**: > -> - Local: Best for development/testing (17x faster than E2B, 28x faster than Daytona) -> - E2B: Production-ready with good speed and reliability -> - Daytona: Full workspace environments with comprehensive tooling -> - Morph: Custom base images, instant snapshots, <250ms startup +> - **Local**: Best for development/testing, fastest execution, 76.7% success rate +> - **E2B**: Production-ready cloud sandboxes (requires API key setup) +> - **Daytona**: Full workspace environments with comprehensive tooling +> - **Morph**: Custom base images with instant snapshots (requires paid plan) +> +> Success rates reflect the percentage of test scenarios that complete successfully. +> The Local provider shows 76.7% due to snapshot restoration limitations in the current test. Results are automatically saved to `benchmarks/results/` and can be committed to track performance over time. +View the full benchmark summary at [`benchmarks/results/SUMMARY.md`](benchmarks/results/SUMMARY.md). ## 🎯 Why Grainchain? diff --git a/benchmarks/results/SUMMARY.md b/benchmarks/results/SUMMARY.md new file mode 100644 index 0000000..db51118 --- /dev/null +++ b/benchmarks/results/SUMMARY.md @@ -0,0 +1,39 @@ +# Grainchain Benchmark Summary + +**Last Updated:** 2025-07-06 20:49:29 +**Total Benchmark Runs:** 1 + +## Recent Results + +| Date | Status | Success Rate | Avg Time (s) | Providers | Notes | +|------|--------|--------------|--------------|-----------|-------| +| 2025-07-06 | ✅ | 76.7% | 1.09 | local | OK | + +## Configuration + +The benchmarks use the following configuration: +- **Providers:** Local, E2B, Modal, Daytona, Morph (when available) +- **Test Scenarios:** Basic commands, Python execution, File operations, Computational tasks, Snapshot lifecycle +- **Default Iterations:** 3 +- **Timeout:** 30 seconds per scenario + +## Metrics Collected + +- **Sandbox Creation Time:** Time to create a new sandbox +- **Command Execution Time:** Time to execute individual commands +- **Success Rate:** Percentage of successful operations +- **File Operations:** Upload/download performance +- **Snapshot Lifecycle:** Git clone, snapshot creation, and restoration + +## Test Scenarios + +1. **Basic Commands:** Shell commands (echo, pwd, ls, whoami, date) +2. **Python Execution:** Python script execution and version checks +3. **File Operations:** File upload/download with various sizes +4. **Computational Tasks:** CPU-intensive Python operations +5. **Snapshot Lifecycle:** Git clone, file creation, snapshot, kill, and restore + +## Automation + +This summary is automatically updated when new benchmark results are available. +Results are committed to the repository for historical tracking. diff --git a/benchmarks/results/grainchain_benchmark_20250706_204709.md b/benchmarks/results/grainchain_benchmark_20250706_204709.md new file mode 100644 index 0000000..3d7b23c --- /dev/null +++ b/benchmarks/results/grainchain_benchmark_20250706_204709.md @@ -0,0 +1,70 @@ +# Grainchain Provider Benchmark Report + +**Generated:** 2025-07-06T20:47:04.074559 +**Duration:** 5.44 seconds +**Providers Tested:** local +**Test Scenarios:** 5 + +## Executive Summary + +| Provider | Success Rate | Avg Time (s) | Creation Time (s) | Status | +|----------|--------------|--------------|-------------------|--------| +| local | 76.7% | 1.09 | 0.00 | ⚠️ | + +## 🏆 Best Performers + +- **Most Reliable:** local +- **Fastest Execution:** local +- **Fastest Startup:** local + +## Detailed Results + +### LOCAL Provider + +- **Overall Success Rate:** 76.7% +- **Average Scenario Time:** 1.09s +- **Average Creation Time:** 0.00s + +#### Basic Commands +- **Success Rate:** 100.0% +- **Average Time:** 0.02s +- **Iterations:** 1/1 + +#### Python Execution +- **Success Rate:** 100.0% +- **Average Time:** 0.07s +- **Iterations:** 1/1 + +#### File Operations +- **Success Rate:** 0.0% +- **Average Time:** 0.00s +- **Iterations:** 1/1 + +#### Computational Tasks +- **Success Rate:** 100.0% +- **Average Time:** 0.06s +- **Iterations:** 1/1 + +#### Snapshot Lifecycle +- **Success Rate:** 83.3% +- **Average Time:** 5.27s +- **Iterations:** 1/1 + +## Configuration + +```json +{ + "providers": [ + "local" + ], + "iterations": 1, + "timeout": 30, + "parallel_tests": false, + "detailed_metrics": true, + "export_formats": [ + "json", + "markdown", + "html" + ] +} +``` diff --git a/benchmarks/results/grainchain_benchmark_20250706_204945.md b/benchmarks/results/grainchain_benchmark_20250706_204945.md new file mode 100644 index 0000000..a39257b --- /dev/null +++ b/benchmarks/results/grainchain_benchmark_20250706_204945.md @@ -0,0 +1,70 @@ +# Grainchain Provider Benchmark Report + +**Generated:** 2025-07-06T20:49:45.139726 +**Duration:** 0.50 seconds +**Providers Tested:** local +**Test Scenarios:** 5 + +## Executive Summary + +| Provider | Success Rate | Avg Time (s) | Creation Time (s) | Status | +|----------|--------------|--------------|-------------------|--------| +| local | 73.3% | 0.03 | 0.00 | ⚠️ | + +## 🏆 Best Performers + +- **Most Reliable:** local +- **Fastest Execution:** local +- **Fastest Startup:** local + +## Detailed Results + +### LOCAL Provider + +- **Overall Success Rate:** 73.3% +- **Average Scenario Time:** 0.03s +- **Average Creation Time:** 0.00s + +#### Basic Commands +- **Success Rate:** 100.0% +- **Average Time:** 0.01s +- **Iterations:** 3/3 + +#### Python Execution +- **Success Rate:** 100.0% +- **Average Time:** 0.07s +- **Iterations:** 3/3 + +#### File Operations +- **Success Rate:** 0.0% +- **Average Time:** 0.00s +- **Iterations:** 3/3 + +#### Computational Tasks +- **Success Rate:** 100.0% +- **Average Time:** 0.07s +- **Iterations:** 3/3 + +#### Snapshot Lifecycle +- **Success Rate:** 66.7% +- **Average Time:** 0.01s +- **Iterations:** 3/3 + +## Configuration + +```json +{ + "providers": [ + "local" + ], + "iterations": 3, + "timeout": 30, + "parallel_tests": false, + "detailed_metrics": true, + "export_formats": [ + "json", + "markdown", + "html" + ] +} +``` diff --git a/benchmarks/results/latest_grainchain.md b/benchmarks/results/latest_grainchain.md index a4c074c..a39257b 100644 --- a/benchmarks/results/latest_grainchain.md +++ b/benchmarks/results/latest_grainchain.md @@ -1,15 +1,15 @@ # Grainchain Provider Benchmark Report -**Generated:** 2025-06-04T02:53:12.502516 -**Duration:** 5.69 seconds -**Providers Tested:** local, morph +**Generated:** 2025-07-06T20:49:45.139726 +**Duration:** 0.50 seconds +**Providers Tested:** local **Test Scenarios:** 5 ## Executive Summary | Provider | Success Rate | Avg Time (s) | Creation Time (s) | Status | |----------|--------------|--------------|-------------------|--------| -| local | 76.7% | 1.07 | 0.00 | ⚠️ | +| local | 73.3% | 0.03 | 0.00 | ⚠️ | ## 🏆 Best Performers @@ -21,53 +21,43 @@ ### LOCAL Provider -- **Overall Success Rate:** 76.7% -- **Average Scenario Time:** 1.07s +- **Overall Success Rate:** 73.3% +- **Average Scenario Time:** 0.03s - **Average Creation Time:** 0.00s #### Basic Commands - **Success Rate:** 100.0% -- **Average Time:** 0.02s -- **Iterations:** 1/1 +- **Average Time:** 0.01s +- **Iterations:** 3/3 #### Python Execution - **Success Rate:** 100.0% - **Average Time:** 0.07s -- **Iterations:** 1/1 +- **Iterations:** 3/3 #### File Operations - **Success Rate:** 0.0% - **Average Time:** 0.00s -- **Iterations:** 1/1 +- **Iterations:** 3/3 #### Computational Tasks - **Success Rate:** 100.0% -- **Average Time:** 0.06s -- **Iterations:** 1/1 +- **Average Time:** 0.07s +- **Iterations:** 3/3 #### Snapshot Lifecycle -- **Success Rate:** 83.3% -- **Average Time:** 5.22s -- **Iterations:** 1/1 - -### MORPH Provider - -❌ **Status:** unavailable -**Error:** Failed to create sandbox: Failed to create sandbox: Morph authentication failed: HTTP Error 402 for url 'https://cloud.morph.so/api/snapshot' -Status Code: 402 -Response Body: { - "detail": "Payment required" -} +- **Success Rate:** 66.7% +- **Average Time:** 0.01s +- **Iterations:** 3/3 ## Configuration ```json { "providers": [ - "local", - "morph" + "local" ], - "iterations": 1, + "iterations": 3, "timeout": 30, "parallel_tests": false, "detailed_metrics": true, diff --git a/benchmarks/scripts/auto_publish.py b/benchmarks/scripts/auto_publish.py index 549b1ec..7de538d 100755 --- a/benchmarks/scripts/auto_publish.py +++ b/benchmarks/scripts/auto_publish.py @@ -18,7 +18,6 @@ # Add the scripts directory to Python path sys.path.append(str(Path(__file__).parent)) -from benchmark_runner import BenchmarkRunner # noqa: E402 class AutoPublisher: @@ -36,17 +35,32 @@ def __init__(self): def run_benchmark_and_publish(self) -> bool: """Run benchmark and publish results""" try: - # Run the benchmark - self.logger.info("Starting automated benchmark run...") - runner = BenchmarkRunner() - results = runner.run_benchmark() + # Run the grainchain benchmark instead of Docker-based benchmark + self.logger.info("Starting automated grainchain benchmark run...") - if results.get("status") != "completed": - self.logger.error("Benchmark failed, not publishing results") + import subprocess + import sys + + # Run the grainchain benchmark script + result = subprocess.run( + [ + sys.executable, + "benchmarks/scripts/grainchain_benchmark.py", + "--providers", + "local", + "--iterations", + "3", + ], + capture_output=True, + text=True, + cwd=self.repo_root, + ) + + if result.returncode != 0: + self.logger.error(f"Grainchain benchmark failed: {result.stderr}") return False - # Save results - runner.save_results(results) + self.logger.info("Grainchain benchmark completed successfully") # Commit and push results return self._commit_and_push_results() @@ -101,12 +115,12 @@ def _commit_and_push_results(self) -> bool: def generate_summary_report(self) -> None: """Generate a summary report from all historical results""" try: - # Find all result files - result_files = list(self.results_dir.glob("benchmark_*.json")) + # Find all grainchain result files + result_files = list(self.results_dir.glob("grainchain_benchmark_*.json")) result_files.sort() if not result_files: - self.logger.warning("No benchmark results found") + self.logger.warning("No grainchain benchmark results found") return # Load all results @@ -120,7 +134,7 @@ def generate_summary_report(self) -> None: self.logger.warning(f"Failed to load {file_path}: {e}") # Generate summary markdown - summary_md = self._generate_summary_markdown(all_results) + summary_md = self._generate_grainchain_summary_markdown(all_results) # Save summary summary_file = self.results_dir / "SUMMARY.md" @@ -132,74 +146,82 @@ def generate_summary_report(self) -> None: except Exception as e: self.logger.error(f"Failed to generate summary report: {e}") - def _generate_summary_markdown(self, results: list) -> str: - """Generate summary markdown from all results""" - md_content = f"""# Outline Benchmark Summary + def _generate_grainchain_summary_markdown(self, results: list) -> str: + """Generate summary markdown from grainchain benchmark results""" + md_content = f"""# Grainchain Benchmark Summary **Last Updated:** {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} **Total Benchmark Runs:** {len(results)} ## Recent Results -| Date | Status | Avg Build Time (s) | Avg Memory (MB) | Notes | -|------|--------|-------------------|-----------------|-------| +| Date | Status | Success Rate | Avg Time (s) | Providers | Notes | +|------|--------|--------------|--------------|-----------|-------| """ # Show last 10 results recent_results = results[-10:] if len(results) > 10 else results for result in reversed(recent_results): - date = result.get("start_time", "Unknown")[:10] # Extract date part + # Extract date from benchmark_info + start_time = result.get("benchmark_info", {}).get("start_time", "Unknown") + date = start_time[:10] if start_time != "Unknown" else "Unknown" + status = "✅" if result.get("status") == "completed" else "❌" - # Calculate averages - snapshots = result.get("snapshots", []) - build_times = [] - memory_usages = [] - - for snapshot in snapshots: - if "metrics" in snapshot and "performance" in snapshot["metrics"]: - build_time = snapshot["metrics"]["performance"].get( - "build_time_seconds" - ) - if build_time and isinstance(build_time, int | float): - build_times.append(build_time) - - if "metrics" in snapshot and "container" in snapshot["metrics"]: - memory = snapshot["metrics"]["container"].get("memory_usage") - if memory and isinstance(memory, int | float): - memory_usages.append(memory / 1024 / 1024) # Convert to MB - - avg_build = ( - round(sum(build_times) / len(build_times), 2) if build_times else "N/A" + # Calculate overall success rate and average time + provider_results = result.get("provider_results", {}) + success_rates = [] + avg_times = [] + providers = [] + + for provider, provider_data in provider_results.items(): + providers.append(provider) + if provider_data.get("status") == "completed": + overall_metrics = provider_data.get("overall_metrics", {}) + success_rate = overall_metrics.get("overall_success_rate", 0) + avg_time = overall_metrics.get("avg_scenario_time", 0) + success_rates.append(success_rate) + avg_times.append(avg_time) + + overall_success = ( + round((sum(success_rates) / len(success_rates)) * 100, 1) + if success_rates + else 0 ) - avg_memory = ( - round(sum(memory_usages) / len(memory_usages), 2) - if memory_usages - else "N/A" + overall_avg_time = ( + round(sum(avg_times) / len(avg_times), 2) if avg_times else 0 ) + providers_str = ", ".join(providers) if providers else "None" notes = "Failed" if result.get("status") != "completed" else "OK" - md_content += ( - f"| {date} | {status} | {avg_build} | {avg_memory} | {notes} |\n" - ) + md_content += f"| {date} | {status} | {overall_success}% | {overall_avg_time} | {providers_str} | {notes} |\n" md_content += """ ## Configuration The benchmarks use the following configuration: -- **Base Image:** `ghcr.io/openai/codex-universal:latest` -- **Node Version:** 20 -- **Benchmark Iterations:** 3 -- **Trivial Changes:** Comment addition, whitespace, log statements +- **Providers:** Local, E2B, Modal, Daytona, Morph (when available) +- **Test Scenarios:** Basic commands, Python execution, File operations, Computational tasks, Snapshot lifecycle +- **Default Iterations:** 3 +- **Timeout:** 30 seconds per scenario ## Metrics Collected -- **Build Time:** Time to run `yarn build` -- **Memory Usage:** Container memory consumption -- **File System:** Package count and directory sizes -- **Test Time:** Time to run test suite +- **Sandbox Creation Time:** Time to create a new sandbox +- **Command Execution Time:** Time to execute individual commands +- **Success Rate:** Percentage of successful operations +- **File Operations:** Upload/download performance +- **Snapshot Lifecycle:** Git clone, snapshot creation, and restoration + +## Test Scenarios + +1. **Basic Commands:** Shell commands (echo, pwd, ls, whoami, date) +2. **Python Execution:** Python script execution and version checks +3. **File Operations:** File upload/download with various sizes +4. **Computational Tasks:** CPU-intensive Python operations +5. **Snapshot Lifecycle:** Git clone, file creation, snapshot, kill, and restore ## Automation