diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
index 6133b04..b3b9e8f 100644
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@@ -1,4 +1,4 @@
-name: Outline Benchmarks
+name: Grainchain Benchmarks
 
 on:
   schedule:
@@ -37,9 +37,6 @@ jobs:
       run: |
         uv sync --all-extras
 
-    - name: Set up Docker Buildx
-      uses: docker/setup-buildx-action@v3
-
     - name: Configure Git
       run: |
         git config --global user.name "Benchmark Bot"
@@ -48,8 +45,6 @@ jobs:
     - name: Run benchmarks
       run: |
         uv run python benchmarks/scripts/auto_publish.py --run-benchmark
-      env:
-        DOCKER_HOST: unix:///var/run/docker.sock
 
     - name: Generate summary report
       run: |
diff --git a/README.md b/README.md
index ed02f6d..900f0fd 100644
--- a/README.md
+++ b/README.md
@@ -105,62 +105,64 @@ Compare sandbox providers with comprehensive performance testing:
 
 ```bash
 # Test individual providers
-grainchain benchmark --provider local
-grainchain benchmark --provider e2b
-grainchain benchmark --provider daytona
-grainchain benchmark --provider morph
+python benchmarks/scripts/grainchain_benchmark.py --providers local
+python benchmarks/scripts/grainchain_benchmark.py --providers e2b
+python benchmarks/scripts/grainchain_benchmark.py --providers daytona
+python benchmarks/scripts/grainchain_benchmark.py --providers morph
 
-# Generate timestamped results
-grainchain benchmark --provider local --output benchmarks/results/
+# Test multiple providers at once
+python benchmarks/scripts/grainchain_benchmark.py --providers local e2b --iterations 3
 
-# Check latest benchmark status (without running new tests)
-./scripts/benchmark_status.sh
+# Generate automated summary report
+python benchmarks/scripts/auto_publish.py --generate-summary
 ```
 
 ### Full Benchmark Suite
 
-Run comprehensive benchmarks across all providers:
+Run comprehensive benchmarks across all available providers:
 
 ```bash
-# Quick: Run all providers and save results
-for provider in local e2b daytona morph; do
-    echo "🚀 Testing $provider..."
-    grainchain benchmark --provider $provider --output benchmarks/results/
-done
+# Run full benchmark suite with all providers
+python benchmarks/scripts/grainchain_benchmark.py --providers local e2b modal daytona morph --iterations 3
 
-# Comprehensive: Generate a full report that can be committed
-./scripts/benchmark_all.sh
+# Run automated benchmark and generate summary (used by CI)
+python benchmarks/scripts/auto_publish.py --run-benchmark
 
-# Advanced: Use the detailed benchmark script
-./benchmarks/scripts/run_grainchain_benchmark.sh "local e2b daytona morph" 3
+# Generate summary from existing results
+python benchmarks/scripts/auto_publish.py --generate-summary
 ```
 
-The `benchmark_all.sh` script generates timestamped reports in `benchmarks/results/` that include:
+The benchmark system generates timestamped reports in `benchmarks/results/` that include:
 
-- Performance comparison tables
-- Environment details (OS, commit hash)
-- Analysis and recommendations
-- Raw benchmark data for tracking trends
+- Performance comparison tables across providers
+- Success rates and error analysis
+- Detailed metrics for each test scenario
+- JSON data for historical tracking
+- Automated summary reports
 
 ### Current Performance Baseline
 
-Latest benchmark results (updated 2024-05-31):
+Latest benchmark results (updated 2025-07-06):
 
-| Provider    | Total Time | Basic Echo | Python Test | File Ops | Performance      |
-| ----------- | ---------- | ---------- | ----------- | -------- | ---------------- |
-| **Local**   | 0.036s     | 0.007s     | 0.021s      | 0.008s   | ⚡ Fastest       |
-| **E2B**     | 0.599s     | 0.331s     | 0.111s      | 0.156s   | 🚀 Balanced      |
-| **Daytona** | 1.012s     | 0.305s     | 0.156s      | 0.551s   | 🛡️ Comprehensive |
-| **Morph**   | 0.250s     | 0.005s     | 0.010s      | 0.005s   | 🚀 Instant Snapshots |
+| Provider | Success Rate | Avg Time (s) | Status | Performance |
+|----------|--------------|--------------|--------|-------------|
+| **Local** | 76.7% | 1.09 | ✅ Available | ⚡ Fastest |
+| **E2B** | - | - | ❓ Not tested | 🚀 Cloud-based |
+| **Daytona** | - | - | ❓ Not tested | 🛡️ Comprehensive |
+| **Morph** | - | - | ❌ Payment required | 🚀 Instant Snapshots |
 
 > **Performance Notes**:
 >
-> - Local: Best for development/testing (17x faster than E2B, 28x faster than Daytona)
-> - E2B: Production-ready with good speed and reliability
-> - Daytona: Full workspace environments with comprehensive tooling
-> - Morph: Custom base images, instant snapshots, <250ms startup
+> - **Local**: Best for development/testing, fastest execution, 76.7% success rate
+> - **E2B**: Production-ready cloud sandboxes (requires API key setup)
+> - **Daytona**: Full workspace environments with comprehensive tooling
+> - **Morph**: Custom base images with instant snapshots (requires paid plan)
+>
+> Success rates reflect the percentage of test scenarios that complete successfully.
+> The Local provider shows 76.7% due to snapshot restoration limitations in the current test.
 
 Results are automatically saved to `benchmarks/results/` and can be committed to track performance over time.
+View the full benchmark summary at [`benchmarks/results/SUMMARY.md`](benchmarks/results/SUMMARY.md).
 
 ## 🎯 Why Grainchain?
 
diff --git a/benchmarks/results/SUMMARY.md b/benchmarks/results/SUMMARY.md
new file mode 100644
index 0000000..db51118
--- /dev/null
+++ b/benchmarks/results/SUMMARY.md
@@ -0,0 +1,39 @@
+# Grainchain Benchmark Summary
+
+**Last Updated:** 2025-07-06 20:49:29
+**Total Benchmark Runs:** 1
+
+## Recent Results
+
+| Date | Status | Success Rate | Avg Time (s) | Providers | Notes |
+|------|--------|--------------|--------------|-----------|-------|
+| 2025-07-06 | ✅ | 76.7% | 1.09 | local | OK |
+
+## Configuration
+
+The benchmarks use the following configuration:
+- **Providers:** Local, E2B, Modal, Daytona, Morph (when available)
+- **Test Scenarios:** Basic commands, Python execution, File operations, Computational tasks, Snapshot lifecycle
+- **Default Iterations:** 3
+- **Timeout:** 30 seconds per scenario
+
+## Metrics Collected
+
+- **Sandbox Creation Time:** Time to create a new sandbox
+- **Command Execution Time:** Time to execute individual commands
+- **Success Rate:** Percentage of successful operations
+- **File Operations:** Upload/download performance
+- **Snapshot Lifecycle:** Git clone, snapshot creation, and restoration
+
+## Test Scenarios
+
+1. **Basic Commands:** Shell commands (echo, pwd, ls, whoami, date)
+2. **Python Execution:** Python script execution and version checks
+3. **File Operations:** File upload/download with various sizes
+4. **Computational Tasks:** CPU-intensive Python operations
+5. **Snapshot Lifecycle:** Git clone, file creation, snapshot, kill, and restore
+
+## Automation
+
+This summary is automatically updated when new benchmark results are available.
+Results are committed to the repository for historical tracking.
diff --git a/benchmarks/results/grainchain_benchmark_20250706_204709.md b/benchmarks/results/grainchain_benchmark_20250706_204709.md
new file mode 100644
index 0000000..3d7b23c
--- /dev/null
+++ b/benchmarks/results/grainchain_benchmark_20250706_204709.md
@@ -0,0 +1,70 @@
+# Grainchain Provider Benchmark Report
+
+**Generated:** 2025-07-06T20:47:04.074559
+**Duration:** 5.44 seconds
+**Providers Tested:** local
+**Test Scenarios:** 5
+
+## Executive Summary
+
+| Provider | Success Rate | Avg Time (s) | Creation Time (s) | Status |
+|----------|--------------|--------------|-------------------|--------|
+| local | 76.7% | 1.09 | 0.00 | ⚠️ |
+
+## 🏆 Best Performers
+
+- **Most Reliable:** local
+- **Fastest Execution:** local
+- **Fastest Startup:** local
+
+## Detailed Results
+
+### LOCAL Provider
+
+- **Overall Success Rate:** 76.7%
+- **Average Scenario Time:** 1.09s
+- **Average Creation Time:** 0.00s
+
+#### Basic Commands
+- **Success Rate:** 100.0%
+- **Average Time:** 0.02s
+- **Iterations:** 1/1
+
+#### Python Execution
+- **Success Rate:** 100.0%
+- **Average Time:** 0.07s
+- **Iterations:** 1/1
+
+#### File Operations
+- **Success Rate:** 0.0%
+- **Average Time:** 0.00s
+- **Iterations:** 1/1
+
+#### Computational Tasks
+- **Success Rate:** 100.0%
+- **Average Time:** 0.06s
+- **Iterations:** 1/1
+
+#### Snapshot Lifecycle
+- **Success Rate:** 83.3%
+- **Average Time:** 5.27s
+- **Iterations:** 1/1
+
+## Configuration
+
+```json
+{
+  "providers": [
+    "local"
+  ],
+  "iterations": 1,
+  "timeout": 30,
+  "parallel_tests": false,
+  "detailed_metrics": true,
+  "export_formats": [
+    "json",
+    "markdown",
+    "html"
+  ]
+}
+```
diff --git a/benchmarks/results/grainchain_benchmark_20250706_204945.md b/benchmarks/results/grainchain_benchmark_20250706_204945.md
new file mode 100644
index 0000000..a39257b
--- /dev/null
+++ b/benchmarks/results/grainchain_benchmark_20250706_204945.md
@@ -0,0 +1,70 @@
+# Grainchain Provider Benchmark Report
+
+**Generated:** 2025-07-06T20:49:45.139726
+**Duration:** 0.50 seconds
+**Providers Tested:** local
+**Test Scenarios:** 5
+
+## Executive Summary
+
+| Provider | Success Rate | Avg Time (s) | Creation Time (s) | Status |
+|----------|--------------|--------------|-------------------|--------|
+| local | 73.3% | 0.03 | 0.00 | ⚠️ |
+
+## 🏆 Best Performers
+
+- **Most Reliable:** local
+- **Fastest Execution:** local
+- **Fastest Startup:** local
+
+## Detailed Results
+
+### LOCAL Provider
+
+- **Overall Success Rate:** 73.3%
+- **Average Scenario Time:** 0.03s
+- **Average Creation Time:** 0.00s
+
+#### Basic Commands
+- **Success Rate:** 100.0%
+- **Average Time:** 0.01s
+- **Iterations:** 3/3
+
+#### Python Execution
+- **Success Rate:** 100.0%
+- **Average Time:** 0.07s
+- **Iterations:** 3/3
+
+#### File Operations
+- **Success Rate:** 0.0%
+- **Average Time:** 0.00s
+- **Iterations:** 3/3
+
+#### Computational Tasks
+- **Success Rate:** 100.0%
+- **Average Time:** 0.07s
+- **Iterations:** 3/3
+
+#### Snapshot Lifecycle
+- **Success Rate:** 66.7%
+- **Average Time:** 0.01s
+- **Iterations:** 3/3
+
+## Configuration
+
+```json
+{
+  "providers": [
+    "local"
+  ],
+  "iterations": 3,
+  "timeout": 30,
+  "parallel_tests": false,
+  "detailed_metrics": true,
+  "export_formats": [
+    "json",
+    "markdown",
+    "html"
+  ]
+}
+```
diff --git a/benchmarks/results/latest_grainchain.md b/benchmarks/results/latest_grainchain.md
index a4c074c..a39257b 100644
--- a/benchmarks/results/latest_grainchain.md
+++ b/benchmarks/results/latest_grainchain.md
@@ -1,15 +1,15 @@
 # Grainchain Provider Benchmark Report
 
-**Generated:** 2025-06-04T02:53:12.502516
-**Duration:** 5.69 seconds
-**Providers Tested:** local, morph
+**Generated:** 2025-07-06T20:49:45.139726
+**Duration:** 0.50 seconds
+**Providers Tested:** local
 **Test Scenarios:** 5
 
 ## Executive Summary
 
 | Provider | Success Rate | Avg Time (s) | Creation Time (s) | Status |
 |----------|--------------|--------------|-------------------|--------|
-| local | 76.7% | 1.07 | 0.00 | ⚠️ |
+| local | 73.3% | 0.03 | 0.00 | ⚠️ |
 
 ## 🏆 Best Performers
 
@@ -21,53 +21,43 @@
 
 ### LOCAL Provider
 
-- **Overall Success Rate:** 76.7%
-- **Average Scenario Time:** 1.07s
+- **Overall Success Rate:** 73.3%
+- **Average Scenario Time:** 0.03s
 - **Average Creation Time:** 0.00s
 
 #### Basic Commands
 - **Success Rate:** 100.0%
-- **Average Time:** 0.02s
-- **Iterations:** 1/1
+- **Average Time:** 0.01s
+- **Iterations:** 3/3
 
 #### Python Execution
 - **Success Rate:** 100.0%
 - **Average Time:** 0.07s
-- **Iterations:** 1/1
+- **Iterations:** 3/3
 
 #### File Operations
 - **Success Rate:** 0.0%
 - **Average Time:** 0.00s
-- **Iterations:** 1/1
+- **Iterations:** 3/3
 
 #### Computational Tasks
 - **Success Rate:** 100.0%
-- **Average Time:** 0.06s
-- **Iterations:** 1/1
+- **Average Time:** 0.07s
+- **Iterations:** 3/3
 
 #### Snapshot Lifecycle
-- **Success Rate:** 83.3%
-- **Average Time:** 5.22s
-- **Iterations:** 1/1
-
-### MORPH Provider
-
-❌ **Status:** unavailable
-**Error:** Failed to create sandbox: Failed to create sandbox: Morph authentication failed: HTTP Error 402 for url 'https://cloud.morph.so/api/snapshot'
-Status Code: 402
-Response Body: {
-  "detail": "Payment required"
-}
+- **Success Rate:** 66.7%
+- **Average Time:** 0.01s
+- **Iterations:** 3/3
 
 ## Configuration
 
 ```json
 {
   "providers": [
-    "local",
-    "morph"
+    "local"
   ],
-  "iterations": 1,
+  "iterations": 3,
   "timeout": 30,
   "parallel_tests": false,
   "detailed_metrics": true,
diff --git a/benchmarks/scripts/auto_publish.py b/benchmarks/scripts/auto_publish.py
index 549b1ec..7de538d 100755
--- a/benchmarks/scripts/auto_publish.py
+++ b/benchmarks/scripts/auto_publish.py
@@ -18,7 +18,6 @@
 
 # Add the scripts directory to Python path
 sys.path.append(str(Path(__file__).parent))
-from benchmark_runner import BenchmarkRunner  # noqa: E402
 
 
 class AutoPublisher:
@@ -36,17 +35,32 @@ def __init__(self):
     def run_benchmark_and_publish(self) -> bool:
         """Run benchmark and publish results"""
         try:
-            # Run the benchmark
-            self.logger.info("Starting automated benchmark run...")
-            runner = BenchmarkRunner()
-            results = runner.run_benchmark()
+            # Run the grainchain benchmark instead of Docker-based benchmark
+            self.logger.info("Starting automated grainchain benchmark run...")
 
-            if results.get("status") != "completed":
-                self.logger.error("Benchmark failed, not publishing results")
+            import subprocess
+            import sys
+
+            # Run the grainchain benchmark script
+            result = subprocess.run(
+                [
+                    sys.executable,
+                    "benchmarks/scripts/grainchain_benchmark.py",
+                    "--providers",
+                    "local",
+                    "--iterations",
+                    "3",
+                ],
+                capture_output=True,
+                text=True,
+                cwd=self.repo_root,
+            )
+
+            if result.returncode != 0:
+                self.logger.error(f"Grainchain benchmark failed: {result.stderr}")
                 return False
 
-            # Save results
-            runner.save_results(results)
+            self.logger.info("Grainchain benchmark completed successfully")
 
             # Commit and push results
             return self._commit_and_push_results()
@@ -101,12 +115,12 @@ def _commit_and_push_results(self) -> bool:
     def generate_summary_report(self) -> None:
         """Generate a summary report from all historical results"""
         try:
-            # Find all result files
-            result_files = list(self.results_dir.glob("benchmark_*.json"))
+            # Find all grainchain result files
+            result_files = list(self.results_dir.glob("grainchain_benchmark_*.json"))
             result_files.sort()
 
             if not result_files:
-                self.logger.warning("No benchmark results found")
+                self.logger.warning("No grainchain benchmark results found")
                 return
 
             # Load all results
@@ -120,7 +134,7 @@ def generate_summary_report(self) -> None:
                     self.logger.warning(f"Failed to load {file_path}: {e}")
 
             # Generate summary markdown
-            summary_md = self._generate_summary_markdown(all_results)
+            summary_md = self._generate_grainchain_summary_markdown(all_results)
 
             # Save summary
             summary_file = self.results_dir / "SUMMARY.md"
@@ -132,74 +146,82 @@ def generate_summary_report(self) -> None:
         except Exception as e:
             self.logger.error(f"Failed to generate summary report: {e}")
 
-    def _generate_summary_markdown(self, results: list) -> str:
-        """Generate summary markdown from all results"""
-        md_content = f"""# Outline Benchmark Summary
+    def _generate_grainchain_summary_markdown(self, results: list) -> str:
+        """Generate summary markdown from grainchain benchmark results"""
+        md_content = f"""# Grainchain Benchmark Summary
 
 **Last Updated:** {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
 **Total Benchmark Runs:** {len(results)}
 
 ## Recent Results
 
-| Date | Status | Avg Build Time (s) | Avg Memory (MB) | Notes |
-|------|--------|-------------------|-----------------|-------|
+| Date | Status | Success Rate | Avg Time (s) | Providers | Notes |
+|------|--------|--------------|--------------|-----------|-------|
 """
 
         # Show last 10 results
         recent_results = results[-10:] if len(results) > 10 else results
 
         for result in reversed(recent_results):
-            date = result.get("start_time", "Unknown")[:10]  # Extract date part
+            # Extract date from benchmark_info
+            start_time = result.get("benchmark_info", {}).get("start_time", "Unknown")
+            date = start_time[:10] if start_time != "Unknown" else "Unknown"
+
             status = "✅" if result.get("status") == "completed" else "❌"
 
-            # Calculate averages
-            snapshots = result.get("snapshots", [])
-            build_times = []
-            memory_usages = []
-
-            for snapshot in snapshots:
-                if "metrics" in snapshot and "performance" in snapshot["metrics"]:
-                    build_time = snapshot["metrics"]["performance"].get(
-                        "build_time_seconds"
-                    )
-                    if build_time and isinstance(build_time, int | float):
-                        build_times.append(build_time)
-
-                if "metrics" in snapshot and "container" in snapshot["metrics"]:
-                    memory = snapshot["metrics"]["container"].get("memory_usage")
-                    if memory and isinstance(memory, int | float):
-                        memory_usages.append(memory / 1024 / 1024)  # Convert to MB
-
-            avg_build = (
-                round(sum(build_times) / len(build_times), 2) if build_times else "N/A"
+            # Calculate overall success rate and average time
+            provider_results = result.get("provider_results", {})
+            success_rates = []
+            avg_times = []
+            providers = []
+
+            for provider, provider_data in provider_results.items():
+                providers.append(provider)
+                if provider_data.get("status") == "completed":
+                    overall_metrics = provider_data.get("overall_metrics", {})
+                    success_rate = overall_metrics.get("overall_success_rate", 0)
+                    avg_time = overall_metrics.get("avg_scenario_time", 0)
+                    success_rates.append(success_rate)
+                    avg_times.append(avg_time)
+
+            overall_success = (
+                round((sum(success_rates) / len(success_rates)) * 100, 1)
+                if success_rates
+                else 0
             )
-            avg_memory = (
-                round(sum(memory_usages) / len(memory_usages), 2)
-                if memory_usages
-                else "N/A"
+            overall_avg_time = (
+                round(sum(avg_times) / len(avg_times), 2) if avg_times else 0
             )
+            providers_str = ", ".join(providers) if providers else "None"
 
             notes = "Failed" if result.get("status") != "completed" else "OK"
 
-            md_content += (
-                f"| {date} | {status} | {avg_build} | {avg_memory} | {notes} |\n"
-            )
+            md_content += f"| {date} | {status} | {overall_success}% | {overall_avg_time} | {providers_str} | {notes} |\n"
 
         md_content += """
 ## Configuration
 
 The benchmarks use the following configuration:
-- **Base Image:** `ghcr.io/openai/codex-universal:latest`
-- **Node Version:** 20
-- **Benchmark Iterations:** 3
-- **Trivial Changes:** Comment addition, whitespace, log statements
+- **Providers:** Local, E2B, Modal, Daytona, Morph (when available)
+- **Test Scenarios:** Basic commands, Python execution, File operations, Computational tasks, Snapshot lifecycle
+- **Default Iterations:** 3
+- **Timeout:** 30 seconds per scenario
 
 ## Metrics Collected
 
-- **Build Time:** Time to run `yarn build`
-- **Memory Usage:** Container memory consumption
-- **File System:** Package count and directory sizes
-- **Test Time:** Time to run test suite
+- **Sandbox Creation Time:** Time to create a new sandbox
+- **Command Execution Time:** Time to execute individual commands
+- **Success Rate:** Percentage of successful operations
+- **File Operations:** Upload/download performance
+- **Snapshot Lifecycle:** Git clone, snapshot creation, and restoration
+
+## Test Scenarios
+
+1. **Basic Commands:** Shell commands (echo, pwd, ls, whoami, date)
+2. **Python Execution:** Python script execution and version checks
+3. **File Operations:** File upload/download with various sizes
+4. **Computational Tasks:** CPU-intensive Python operations
+5. **Snapshot Lifecycle:** Git clone, file creation, snapshot, kill, and restore
 
 ## Automation