Skip to content

Commit f62ba9c

Browse files
committed
docs: add benchmark viewer integration design and update TODOs
1 parent 6126aeb commit f62ba9c

File tree

2 files changed

+230
-11
lines changed

2 files changed

+230
-11
lines changed

CLAUDE.md

Lines changed: 30 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -333,21 +333,20 @@ az ml workspace sync-keys -n openadapt-ml -g openadapt-agents
333333
**Priority**: Low - current dashboard shows key metrics (loss, epoch, step). Terminal output mainly useful for debugging.
334334

335335
### Early Termination Controls
336-
**Status**: TODO - HIGH PRIORITY
336+
**Status**: DONE
337337

338338
**Problem**: Training runs until completion even when loss is low enough. Wastes GPU credits ($0.75/hr for A10).
339339

340-
**Requirements**:
341-
1. **Auto-termination**: Stop training when loss drops below threshold (e.g., 0.5 or configurable)
342-
2. **Dashboard button**: "Stop Training" button in dashboard UI that terminates Lambda instance
343-
3. **Checkpoint download**: Auto-download best checkpoint before termination
344-
4. **Cost awareness**: Show running cost and prompt user when approaching budget
340+
**Solution implemented**:
341+
1. **Auto-termination**: `early_stop_loss` and `early_stop_patience` in stub_provider.py
342+
2. **Dashboard button**: "Stop Training" button calls `/api/stop` endpoint
343+
3. **Stop signal**: Creates `STOP_TRAINING` file that training loop checks
344+
4. **Termination status**: Dashboard shows termination reason (auto_complete, auto_low_loss, user_stop)
345345

346-
**Implementation approach**:
347-
- Add `early_stop_loss` to training config (already exists but may not terminate instance)
348-
- Add terminate endpoint that dashboard can call
349-
- Modify Lambda monitor to download checkpoints on termination
350-
- Add "Stop Training" button to dashboard config section
346+
**Files changed**:
347+
- `openadapt_ml/cloud/local.py` - Added `/api/stop` POST endpoint
348+
- `openadapt_ml/training/stub_provider.py` - Added early stop logic, termination status
349+
- `openadapt_ml/training/trainer.py` - Added `updateTerminationStatus()` JS function
351350

352351
### Cloud Cost Estimation in Viewers
353352
**Status**: TODO
@@ -406,3 +405,23 @@ The README §7.1 API-backed adapters section uses correct model names:
406405
Verified:
407406
- API key environment variable names: ANTHROPIC_API_KEY, OPENAI_API_KEY ✓
408407
- Backend flag options: `claude`, `openai` in CLI ✓
408+
409+
### Benchmark Viewer Integration
410+
**Status**: TODO - HIGH PRIORITY
411+
412+
**Goal**: Integrate benchmark evaluation results (WAA, WebArena, OSWorld) into the unified viewer.
413+
414+
**Design doc**: `docs/benchmark_viewer_integration.md`
415+
416+
**Key features**:
417+
1. **Benchmarks tab**: Third tab alongside Training and Viewer
418+
2. **Task-level view**: List of benchmark tasks with pass/fail status
419+
3. **Step-by-step replay**: Same UI as Viewer tab for benchmark executions
420+
4. **Model comparison**: Side-by-side comparison of different models on same task
421+
5. **Aggregate metrics**: Success rate by domain, difficulty rankings
422+
423+
**Implementation phases**:
424+
1. Data collection: Save screenshots during benchmark runs
425+
2. Viewer backend: `generate_benchmark_viewer()` function
426+
3. UI components: Summary dashboard, task list, replay
427+
4. Analysis: Failure clustering, regression detection
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# Benchmark Viewer Integration Design
2+
3+
## Overview
4+
5+
This document describes how to integrate benchmark evaluation results (WAA, WebArena, OSWorld) into the unified viewer, enabling side-by-side comparison of model performance across benchmark tasks.
6+
7+
## Current State
8+
9+
### Viewer Capabilities
10+
- Step-by-step screenshot playback
11+
- Human vs model action comparison per step
12+
- Checkpoint switching (None, Epoch 1, Epoch 2, etc.)
13+
- Metrics: accuracy percentage, step count
14+
15+
### Benchmark Module (`openadapt_ml/benchmarks/`)
16+
- `BenchmarkAdapter` interface for different benchmarks
17+
- `WAAAdapter` for Windows Agent Arena
18+
- `AzureWAAOrchestrator` for parallel VM execution
19+
- Produces per-task success/failure results
20+
21+
## Design Goals
22+
23+
1. **Unified Experience**: Same viewer UI for captures and benchmark tasks
24+
2. **Model Comparison**: Compare multiple models on identical tasks
25+
3. **Drill-Down**: From aggregate metrics → task list → step-by-step replay
26+
4. **Actionable Insights**: Identify failure patterns, common errors
27+
28+
## Proposed Architecture
29+
30+
### Data Model
31+
32+
```
33+
benchmark_results/
34+
├── waa_eval_20241214/
35+
│ ├── metadata.json # Benchmark config, models evaluated
36+
│ ├── tasks/
37+
│ │ ├── task_001/
38+
│ │ │ ├── task.json # Task definition, success criteria
39+
│ │ │ ├── screenshots/ # Execution screenshots
40+
│ │ │ ├── model_a.json # Model A's execution trace
41+
│ │ │ └── model_b.json # Model B's execution trace
42+
│ │ └── task_002/
43+
│ │ └── ...
44+
│ └── summary.json # Aggregate results
45+
```
46+
47+
### Schema Extensions
48+
49+
```python
50+
@dataclass
51+
class BenchmarkTask:
52+
task_id: str
53+
name: str
54+
domain: str # e.g., "browser", "file_manager", "settings"
55+
description: str
56+
success_criteria: str
57+
max_steps: int
58+
59+
@dataclass
60+
class BenchmarkExecution:
61+
task_id: str
62+
model_id: str # e.g., "qwen3vl-2b-epoch5", "gpt-4v"
63+
success: bool
64+
steps_taken: int
65+
execution_time: float
66+
error_message: Optional[str]
67+
trace: List[ExecutionStep] # Screenshots + actions
68+
69+
@dataclass
70+
class ExecutionStep:
71+
step_idx: int
72+
screenshot_path: str
73+
action: Action # From existing schema
74+
reasoning: Optional[str]
75+
timestamp: float
76+
```
77+
78+
### Viewer Integration
79+
80+
#### 1. Benchmark Dashboard Tab
81+
Add a third tab alongside "Training" and "Viewer":
82+
83+
```
84+
[Training] [Viewer] [Benchmarks]
85+
```
86+
87+
The Benchmarks tab shows:
88+
- Dropdown: Select benchmark run (by date/name)
89+
- Summary metrics: Overall success rate, by-domain breakdown
90+
- Task list with pass/fail status
91+
- Model comparison table
92+
93+
#### 2. Task Drill-Down View
94+
Clicking a task opens step-by-step view:
95+
- Same UI as current Viewer tab
96+
- Model selector: Switch between model executions
97+
- Side-by-side mode: Compare two models simultaneously
98+
- Failure analysis: Highlight where execution diverged
99+
100+
#### 3. Model Comparison Mode
101+
New comparison layout:
102+
103+
```
104+
┌─────────────────┬─────────────────┐
105+
│ Model A │ Model B │
106+
│ [Screenshot] │ [Screenshot] │
107+
│ Action: CLICK │ Action: TYPE │
108+
│ Step 3 of 12 │ Step 3 of 8 │
109+
│ ✓ Succeeded │ ✗ Failed │
110+
└─────────────────┴─────────────────┘
111+
```
112+
113+
### Implementation Plan
114+
115+
#### Phase 1: Data Collection
116+
1. Extend `BenchmarkRunner` to save execution traces
117+
2. Save screenshots at each step during benchmark runs
118+
3. Store structured results in `benchmark_results/` directory
119+
120+
#### Phase 2: Viewer Backend
121+
1. Add `load_benchmark_results()` function to trainer.py
122+
2. Generate `benchmark.html` from results
123+
3. Reuse existing viewer components where possible
124+
125+
#### Phase 3: UI Components
126+
1. Benchmark summary dashboard (success rates, charts)
127+
2. Task list with filtering (by domain, status, model)
128+
3. Step-by-step replay with model comparison
129+
4. Export capabilities (CSV, JSON)
130+
131+
#### Phase 4: Analysis Features
132+
1. Failure clustering: Group similar failures
133+
2. Step-level accuracy: Where do models commonly fail?
134+
3. Difficulty estimation: Rank tasks by model success rate
135+
4. Regression detection: Compare across training runs
136+
137+
## Integration with Existing Code
138+
139+
### Viewer Generation
140+
The consolidated `_generate_unified_viewer_from_extracted_data()` function can be extended:
141+
142+
```python
143+
def generate_benchmark_viewer(
144+
benchmark_dir: Path,
145+
output_path: Path,
146+
) -> None:
147+
"""Generate viewer for benchmark results."""
148+
# Load benchmark metadata
149+
metadata = load_benchmark_metadata(benchmark_dir)
150+
151+
# Load all task results
152+
tasks = load_task_results(benchmark_dir)
153+
154+
# Generate HTML using same template patterns
155+
html = _generate_benchmark_viewer_html(metadata, tasks)
156+
output_path.write_text(html)
157+
```
158+
159+
### Shared Components
160+
Reuse from existing viewer:
161+
- Header/nav component (`_get_shared_header_css()`, `_generate_shared_header_html()`)
162+
- Screenshot display with click markers
163+
- Action comparison boxes
164+
- Playback controls
165+
166+
### CLI Integration
167+
```bash
168+
# Run benchmark and generate viewer
169+
uv run python -m openadapt_ml.benchmarks.cli run-azure --tasks 10 --viewer
170+
171+
# Generate viewer from existing results
172+
uv run python -m openadapt_ml.benchmarks.cli viewer benchmark_results/waa_eval_20241214/
173+
174+
# Serve benchmark viewer
175+
uv run python -m openadapt_ml.cloud.local serve --benchmark benchmark_results/waa_eval_20241214/
176+
```
177+
178+
## Open Questions
179+
180+
1. **Screenshot Storage**: Benchmark runs may produce thousands of screenshots. Cloud storage (S3/Azure Blob) or local with lazy loading?
181+
182+
2. **Real-time Updates**: Should viewer update live during benchmark runs, or only after completion?
183+
184+
3. **Comparison Granularity**: Compare at task level, step level, or both?
185+
186+
4. **Historical Tracking**: How to track model improvement across multiple benchmark runs?
187+
188+
## Dependencies
189+
190+
- Existing viewer consolidation (DONE)
191+
- WAA benchmark adapter (implemented)
192+
- Azure orchestration (implemented)
193+
- Screenshot capture during benchmark runs (needs work)
194+
195+
## Success Metrics
196+
197+
- Users can view benchmark results in browser
198+
- Side-by-side model comparison works
199+
- Drill-down from summary to step-level works
200+
- Performance: Handles 100+ tasks without lag

0 commit comments

Comments
 (0)