|
| 1 | +# Benchmark Viewer Integration Design |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This document describes how to integrate benchmark evaluation results (WAA, WebArena, OSWorld) into the unified viewer, enabling side-by-side comparison of model performance across benchmark tasks. |
| 6 | + |
| 7 | +## Current State |
| 8 | + |
| 9 | +### Viewer Capabilities |
| 10 | +- Step-by-step screenshot playback |
| 11 | +- Human vs model action comparison per step |
| 12 | +- Checkpoint switching (None, Epoch 1, Epoch 2, etc.) |
| 13 | +- Metrics: accuracy percentage, step count |
| 14 | + |
| 15 | +### Benchmark Module (`openadapt_ml/benchmarks/`) |
| 16 | +- `BenchmarkAdapter` interface for different benchmarks |
| 17 | +- `WAAAdapter` for Windows Agent Arena |
| 18 | +- `AzureWAAOrchestrator` for parallel VM execution |
| 19 | +- Produces per-task success/failure results |
| 20 | + |
| 21 | +## Design Goals |
| 22 | + |
| 23 | +1. **Unified Experience**: Same viewer UI for captures and benchmark tasks |
| 24 | +2. **Model Comparison**: Compare multiple models on identical tasks |
| 25 | +3. **Drill-Down**: From aggregate metrics → task list → step-by-step replay |
| 26 | +4. **Actionable Insights**: Identify failure patterns, common errors |
| 27 | + |
| 28 | +## Proposed Architecture |
| 29 | + |
| 30 | +### Data Model |
| 31 | + |
| 32 | +``` |
| 33 | +benchmark_results/ |
| 34 | +├── waa_eval_20241214/ |
| 35 | +│ ├── metadata.json # Benchmark config, models evaluated |
| 36 | +│ ├── tasks/ |
| 37 | +│ │ ├── task_001/ |
| 38 | +│ │ │ ├── task.json # Task definition, success criteria |
| 39 | +│ │ │ ├── screenshots/ # Execution screenshots |
| 40 | +│ │ │ ├── model_a.json # Model A's execution trace |
| 41 | +│ │ │ └── model_b.json # Model B's execution trace |
| 42 | +│ │ └── task_002/ |
| 43 | +│ │ └── ... |
| 44 | +│ └── summary.json # Aggregate results |
| 45 | +``` |
| 46 | + |
| 47 | +### Schema Extensions |
| 48 | + |
| 49 | +```python |
| 50 | +@dataclass |
| 51 | +class BenchmarkTask: |
| 52 | + task_id: str |
| 53 | + name: str |
| 54 | + domain: str # e.g., "browser", "file_manager", "settings" |
| 55 | + description: str |
| 56 | + success_criteria: str |
| 57 | + max_steps: int |
| 58 | + |
| 59 | +@dataclass |
| 60 | +class BenchmarkExecution: |
| 61 | + task_id: str |
| 62 | + model_id: str # e.g., "qwen3vl-2b-epoch5", "gpt-4v" |
| 63 | + success: bool |
| 64 | + steps_taken: int |
| 65 | + execution_time: float |
| 66 | + error_message: Optional[str] |
| 67 | + trace: List[ExecutionStep] # Screenshots + actions |
| 68 | + |
| 69 | +@dataclass |
| 70 | +class ExecutionStep: |
| 71 | + step_idx: int |
| 72 | + screenshot_path: str |
| 73 | + action: Action # From existing schema |
| 74 | + reasoning: Optional[str] |
| 75 | + timestamp: float |
| 76 | +``` |
| 77 | + |
| 78 | +### Viewer Integration |
| 79 | + |
| 80 | +#### 1. Benchmark Dashboard Tab |
| 81 | +Add a third tab alongside "Training" and "Viewer": |
| 82 | + |
| 83 | +``` |
| 84 | +[Training] [Viewer] [Benchmarks] |
| 85 | +``` |
| 86 | + |
| 87 | +The Benchmarks tab shows: |
| 88 | +- Dropdown: Select benchmark run (by date/name) |
| 89 | +- Summary metrics: Overall success rate, by-domain breakdown |
| 90 | +- Task list with pass/fail status |
| 91 | +- Model comparison table |
| 92 | + |
| 93 | +#### 2. Task Drill-Down View |
| 94 | +Clicking a task opens step-by-step view: |
| 95 | +- Same UI as current Viewer tab |
| 96 | +- Model selector: Switch between model executions |
| 97 | +- Side-by-side mode: Compare two models simultaneously |
| 98 | +- Failure analysis: Highlight where execution diverged |
| 99 | + |
| 100 | +#### 3. Model Comparison Mode |
| 101 | +New comparison layout: |
| 102 | + |
| 103 | +``` |
| 104 | +┌─────────────────┬─────────────────┐ |
| 105 | +│ Model A │ Model B │ |
| 106 | +│ [Screenshot] │ [Screenshot] │ |
| 107 | +│ Action: CLICK │ Action: TYPE │ |
| 108 | +│ Step 3 of 12 │ Step 3 of 8 │ |
| 109 | +│ ✓ Succeeded │ ✗ Failed │ |
| 110 | +└─────────────────┴─────────────────┘ |
| 111 | +``` |
| 112 | + |
| 113 | +### Implementation Plan |
| 114 | + |
| 115 | +#### Phase 1: Data Collection |
| 116 | +1. Extend `BenchmarkRunner` to save execution traces |
| 117 | +2. Save screenshots at each step during benchmark runs |
| 118 | +3. Store structured results in `benchmark_results/` directory |
| 119 | + |
| 120 | +#### Phase 2: Viewer Backend |
| 121 | +1. Add `load_benchmark_results()` function to trainer.py |
| 122 | +2. Generate `benchmark.html` from results |
| 123 | +3. Reuse existing viewer components where possible |
| 124 | + |
| 125 | +#### Phase 3: UI Components |
| 126 | +1. Benchmark summary dashboard (success rates, charts) |
| 127 | +2. Task list with filtering (by domain, status, model) |
| 128 | +3. Step-by-step replay with model comparison |
| 129 | +4. Export capabilities (CSV, JSON) |
| 130 | + |
| 131 | +#### Phase 4: Analysis Features |
| 132 | +1. Failure clustering: Group similar failures |
| 133 | +2. Step-level accuracy: Where do models commonly fail? |
| 134 | +3. Difficulty estimation: Rank tasks by model success rate |
| 135 | +4. Regression detection: Compare across training runs |
| 136 | + |
| 137 | +## Integration with Existing Code |
| 138 | + |
| 139 | +### Viewer Generation |
| 140 | +The consolidated `_generate_unified_viewer_from_extracted_data()` function can be extended: |
| 141 | + |
| 142 | +```python |
| 143 | +def generate_benchmark_viewer( |
| 144 | + benchmark_dir: Path, |
| 145 | + output_path: Path, |
| 146 | +) -> None: |
| 147 | + """Generate viewer for benchmark results.""" |
| 148 | + # Load benchmark metadata |
| 149 | + metadata = load_benchmark_metadata(benchmark_dir) |
| 150 | + |
| 151 | + # Load all task results |
| 152 | + tasks = load_task_results(benchmark_dir) |
| 153 | + |
| 154 | + # Generate HTML using same template patterns |
| 155 | + html = _generate_benchmark_viewer_html(metadata, tasks) |
| 156 | + output_path.write_text(html) |
| 157 | +``` |
| 158 | + |
| 159 | +### Shared Components |
| 160 | +Reuse from existing viewer: |
| 161 | +- Header/nav component (`_get_shared_header_css()`, `_generate_shared_header_html()`) |
| 162 | +- Screenshot display with click markers |
| 163 | +- Action comparison boxes |
| 164 | +- Playback controls |
| 165 | + |
| 166 | +### CLI Integration |
| 167 | +```bash |
| 168 | +# Run benchmark and generate viewer |
| 169 | +uv run python -m openadapt_ml.benchmarks.cli run-azure --tasks 10 --viewer |
| 170 | + |
| 171 | +# Generate viewer from existing results |
| 172 | +uv run python -m openadapt_ml.benchmarks.cli viewer benchmark_results/waa_eval_20241214/ |
| 173 | + |
| 174 | +# Serve benchmark viewer |
| 175 | +uv run python -m openadapt_ml.cloud.local serve --benchmark benchmark_results/waa_eval_20241214/ |
| 176 | +``` |
| 177 | + |
| 178 | +## Open Questions |
| 179 | + |
| 180 | +1. **Screenshot Storage**: Benchmark runs may produce thousands of screenshots. Cloud storage (S3/Azure Blob) or local with lazy loading? |
| 181 | + |
| 182 | +2. **Real-time Updates**: Should viewer update live during benchmark runs, or only after completion? |
| 183 | + |
| 184 | +3. **Comparison Granularity**: Compare at task level, step level, or both? |
| 185 | + |
| 186 | +4. **Historical Tracking**: How to track model improvement across multiple benchmark runs? |
| 187 | + |
| 188 | +## Dependencies |
| 189 | + |
| 190 | +- Existing viewer consolidation (DONE) |
| 191 | +- WAA benchmark adapter (implemented) |
| 192 | +- Azure orchestration (implemented) |
| 193 | +- Screenshot capture during benchmark runs (needs work) |
| 194 | + |
| 195 | +## Success Metrics |
| 196 | + |
| 197 | +- Users can view benchmark results in browser |
| 198 | +- Side-by-side model comparison works |
| 199 | +- Drill-down from summary to step-level works |
| 200 | +- Performance: Handles 100+ tasks without lag |
0 commit comments