Skip to content

Commit afb5bb8

Browse files
committed
fix(viewer): extract predictions from window.comparisonData and fix elapsed time loading
1 parent a31dbf9 commit afb5bb8

File tree

6 files changed

+1135
-51
lines changed

6 files changed

+1135
-51
lines changed

CLAUDE.md

Lines changed: 224 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -61,12 +61,47 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a
6161

6262
The benchmark integration module is implemented in `openadapt_ml/benchmarks/`:
6363
- `base.py` - BenchmarkAdapter interface, data classes
64-
- `agent.py` - BenchmarkAgent, PolicyAgent, ScriptedAgent, RandomAgent
64+
- `agent.py` - BenchmarkAgent, PolicyAgent, APIBenchmarkAgent, ScriptedAgent, RandomAgent
6565
- `runner.py` - evaluate_agent_on_benchmark(), compute_metrics()
6666
- `waa.py` - WAAAdapter (requires WAA repo), WAAMockAdapter (for testing)
6767
- `azure.py` - AzureWAAOrchestrator for parallel VM execution
6868
- `cli.py` - Command-line interface for WAA evaluation
6969

70+
### APIBenchmarkAgent
71+
72+
The `APIBenchmarkAgent` wraps hosted VLM APIs (Claude, GPT-4V) for benchmark evaluation baselines.
73+
This enables comparing fine-tuned models against off-the-shelf VLMs.
74+
75+
```python
76+
from openadapt_ml.benchmarks import APIBenchmarkAgent, evaluate_agent_on_benchmark
77+
78+
# Claude baseline
79+
agent = APIBenchmarkAgent(provider="anthropic")
80+
results = evaluate_agent_on_benchmark(agent, adapter)
81+
82+
# GPT-4V baseline
83+
agent = APIBenchmarkAgent(provider="openai")
84+
results = evaluate_agent_on_benchmark(agent, adapter)
85+
```
86+
87+
CLI usage:
88+
```bash
89+
# Run Claude evaluation on mock tasks
90+
uv run python -m openadapt_ml.benchmarks.cli run-api --provider anthropic --tasks 5
91+
92+
# Run GPT-4V evaluation
93+
uv run python -m openadapt_ml.benchmarks.cli run-api --provider openai --tasks 5
94+
95+
# Disable accessibility tree in prompts
96+
uv run python -m openadapt_ml.benchmarks.cli run-api --no-a11y --tasks 5
97+
```
98+
99+
The agent:
100+
- Converts BenchmarkObservation to API format (screenshot + structured prompt)
101+
- Parses VLM responses into BenchmarkActions using regex patterns
102+
- Supports CLICK(x,y), CLICK([id]), TYPE("text"), KEY(key), SCROLL(dir), DONE()
103+
- Stores raw VLM responses in `action.raw_action` for debugging
104+
70105
### Azure Automation
71106

72107
`scripts/setup_azure.py` fully automates Azure setup with 15 steps:
@@ -170,6 +205,48 @@ uv run python -m openadapt_ml.scripts.compare \
170205
--open
171206
```
172207

208+
## Benchmark Data Collection & Testing
209+
210+
```bash
211+
# Test benchmark data collection (Phase 1)
212+
# Creates directory structure with screenshots, execution traces, and metadata
213+
uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 5
214+
215+
# Custom run name and output directory
216+
uv run python -m openadapt_ml.benchmarks.cli test-collection \
217+
--tasks 10 \
218+
--run-name my_test_run \
219+
--output benchmark_results \
220+
--model-id "my-agent-v1"
221+
222+
# Run the standalone test script (equivalent to test-collection)
223+
uv run python test_data_collection.py
224+
```
225+
226+
**Output directory structure:**
227+
```
228+
benchmark_results/
229+
├── {run_name}/
230+
│ ├── metadata.json # Benchmark name, model ID, timestamp
231+
│ ├── summary.json # Aggregate metrics (success rate, avg steps)
232+
│ └── tasks/
233+
│ ├── task_001/
234+
│ │ ├── task.json # Task definition
235+
│ │ ├── execution.json # Execution trace with steps
236+
│ │ └── screenshots/
237+
│ │ ├── step_000.png
238+
│ │ ├── step_001.png
239+
│ │ └── ...
240+
│ └── task_002/
241+
│ └── ...
242+
```
243+
244+
**Key files:**
245+
- `execution.json`: Contains step-by-step trace with actions, reasoning, timestamps
246+
- `task.json`: Task definition with instruction, domain, time limits
247+
- `summary.json`: High-level metrics suitable for benchmark viewer
248+
- `screenshots/`: PNG screenshots at each step
249+
173250
## Viewer Setup Troubleshooting
174251

175252
**Problem**: Viewer shows "No model loaded" after training.
@@ -266,12 +343,17 @@ The training dashboard and capture viewer share UI components for visual consist
266343
## TODO / Known Issues
267344

268345
### PyPI Publishing
269-
**Status**: TODO
346+
**Status**: DONE
270347

271-
openadapt-capture and openadapt-privacy are published to PyPI, but openadapt-ml is not yet. Should set up:
272-
- Package metadata in pyproject.toml
273-
- GitHub Actions workflow for publishing
274-
- Version management
348+
Completed by background agent:
349+
- Updated `pyproject.toml` with package metadata (description, authors, classifiers, URLs, license)
350+
- Created `LICENSE` (MIT, matching related projects)
351+
- Created `.github/workflows/publish.yml` for automated PyPI publishing on version tags
352+
- Build system: hatchling
353+
354+
To publish:
355+
1. Set up PyPI trusted publishing (PyPI → Account Settings → Publishing)
356+
2. `git tag v0.1.0 && git push origin v0.1.0`
275357

276358
### Azure WAA Evaluation - ACR Auth Issue
277359
**Status**: FIXED - setup_azure.py now configures ACR authentication automatically
@@ -316,21 +398,36 @@ az ml workspace sync-keys -n openadapt-ml -g openadapt-agents
316398
- [ACR Pull Role Assignment](https://learn.microsoft.com/en-us/azure/container-registry/container-registry-authentication-managed-identity)
317399

318400
### Training Dashboard - Terminal Output Streaming
319-
**Status**: TODO - nice to have
401+
**Status**: DONE
320402

321403
**Goal**: Show training command line output in the browser dashboard in real-time.
322404

323-
**Possible approaches**:
324-
1. **File-based polling** (simplest): Training writes stdout to `training_output/training.log`, browser polls and displays in a `<pre>` element with auto-scroll
325-
2. **WebSocket**: Run training in subprocess, stream stdout via WebSocket server to browser
326-
3. **Server-sent events (SSE)**: Similar to WebSocket but simpler, one-way streaming
327-
328-
**Recommended**: File-based polling is simplest and consistent with current JSON polling approach. Add:
329-
- `--log-stdout` flag to train.py that tees output to training.log
330-
- Add scrollable terminal panel to dashboard.html
331-
- Poll training.log alongside training_log.json
405+
**Implementation**: File-based polling approach
406+
1. Training writes stdout to `training_output/training.log` with timestamps
407+
2. Browser polls training.log every 2 seconds alongside training_log.json
408+
3. Displays last 500 lines in scrollable terminal panel with auto-scroll
409+
4. Terminal panel features:
410+
- Dark terminal theme (black background, green/colored text)
411+
- Auto-scroll toggle (on by default)
412+
- Text wrap toggle
413+
- Collapse/expand button
414+
- Line counter
415+
- Syntax highlighting (errors in red, warnings in orange, success in green)
332416

333-
**Priority**: Low - current dashboard shows key metrics (loss, epoch, step). Terminal output mainly useful for debugging.
417+
**Files changed**:
418+
- `openadapt_ml/training/trainer.py`:
419+
- Added terminal panel CSS styles
420+
- Added terminal panel HTML section
421+
- Added JavaScript polling function `fetchTerminalOutput()`
422+
- Added `TrainingLogger._log_to_terminal()` method
423+
- Updated `train_supervised()` to log key messages to training.log
424+
- `openadapt_ml/training/stub_provider.py`:
425+
- Added `_log()` method for dual stdout/file logging
426+
- All training output now written to training.log
427+
- `openadapt_ml/cloud/local.py`:
428+
- No changes needed - serve command already serves all files from training_output
429+
430+
**Usage**: Terminal output automatically appears in dashboard during training. Works with both stub and real training.
334431

335432
### Early Termination Controls
336433
**Status**: DONE
@@ -349,19 +446,16 @@ az ml workspace sync-keys -n openadapt-ml -g openadapt-agents
349446
- `openadapt_ml/training/trainer.py` - Added `updateTerminationStatus()` JS function
350447

351448
### Cloud Cost Estimation in Viewers
352-
**Status**: TODO
353-
354-
**Goal**: Show estimated cloud costs in training dashboard and viewer.
449+
**Status**: DONE
355450

356-
**Requirements**:
357-
- Display running cost based on instance type and elapsed time
358-
- Show estimated total cost for completion
359-
- Include cost breakdown by resource type (GPU, storage, transfer)
451+
Added cost display panel to viewer that shows:
452+
- Running cost based on instance type and elapsed time
453+
- Instance type and hourly rate
454+
- Only visible for cloud training (hidden for local/stub)
360455

361-
**Implementation notes**:
456+
Supported rates:
362457
- Lambda Labs: $0.75/hr for A10, $1.29/hr for A100
363-
- Azure ML: Variable based on VM type
364-
- Should be visible in both dashboard.html and viewer.html
458+
- Automatic detection from `instance_type` in training_log.json
365459

366460
### Current Working Capture
367461
**Path**: `/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift`
@@ -370,15 +464,27 @@ az ml workspace sync-keys -n openadapt-ml -g openadapt-agents
370464
**Notes**: Real-world macOS settings navigation capture for training/evaluation
371465

372466
### Evaluation Samples Display Enhancement
373-
**Status**: TODO - needs fleshing out
467+
**Status**: DONE
374468

375-
**Current state**: Shows human/predicted coords, model thinking text, legend
376-
**Future improvements**:
377-
- Show the actual screenshot image (need to sync from Lambda or embed base64)
378-
- Visual overlay showing click positions on image
379-
- Side-by-side human vs predicted action comparison
380-
- Full model output (not truncated)
381-
- Filter/search evaluations by epoch or correctness
469+
Enhanced evaluation gallery in dashboard with:
470+
- **Filter controls**: Dropdown filters for epoch and correctness (All/Correct/Incorrect)
471+
- **Visual markers**: H (human) and AI (predicted) click markers on screenshots
472+
- **Expandable model output**: "Show full output" toggle for raw model reasoning
473+
- **Better layout**: Image container with overlay, content section with coordinates
474+
- **Sample count**: "Showing X of Y samples" with filter status
475+
476+
Files changed:
477+
- `openadapt_ml/training/trainer.py` - Enhanced CSS, HTML, and JS for eval gallery
478+
479+
### Viewer Playback Controls
480+
**Status**: DONE
481+
482+
Added full playback controls to the viewer:
483+
- **Buttons**: ⏮ Rewind, ◀ Prev, ▶ Play/Pause, ▶ Next, ⏭ End
484+
- **Speed control**: 0.5x, 1x, 2x, 4x playback speeds
485+
- **Progress bar**: Click-to-seek to any step
486+
- **Keyboard shortcuts**: Space (play/pause), Home/End (jump), Arrow keys (step)
487+
- **Enhanced details panel**: Shows full model output with scrollable raw prediction data
382488

383489
### Viewer Code Consolidation
384490
**Status**: DONE
@@ -407,7 +513,7 @@ Verified:
407513
- Backend flag options: `claude`, `openai` in CLI ✓
408514

409515
### Benchmark Viewer Integration
410-
**Status**: TODO - HIGH PRIORITY
516+
**Status**: Phase 1 DONE, Phases 2-4 TODO
411517

412518
**Goal**: Integrate benchmark evaluation results (WAA, WebArena, OSWorld) into the unified viewer.
413519

@@ -421,7 +527,85 @@ Verified:
421527
5. **Aggregate metrics**: Success rate by domain, difficulty rankings
422528

423529
**Implementation phases**:
424-
1. Data collection: Save screenshots during benchmark runs
425-
2. Viewer backend: `generate_benchmark_viewer()` function
426-
3. UI components: Summary dashboard, task list, replay
427-
4. Analysis: Failure clustering, regression detection
530+
1.**Data collection** (DONE): Save screenshots during benchmark runs
531+
- Created `openadapt_ml/benchmarks/data_collection.py` with `ExecutionTraceCollector`
532+
- Updated `runner.py` to save execution traces automatically
533+
- Added CLI command: `uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 5`
534+
- Directory structure: `benchmark_results/{run_name}/tasks/{task_id}/`
535+
- Each task has: `task.json`, `execution.json`, `screenshots/`
536+
- Test script: `test_data_collection.py` validates all files are created
537+
2. **Viewer backend** (TODO): `generate_benchmark_viewer()` function
538+
3. **UI components** (TODO): Summary dashboard, task list, replay
539+
4. **Analysis** (TODO): Failure clustering, regression detection
540+
541+
**Phase 1 verification:**
542+
```bash
543+
# Test data collection
544+
uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 5
545+
546+
# Verify output
547+
ls -la benchmark_results/{run_name}/tasks/task_001/
548+
# Should contain: task.json, execution.json, screenshots/
549+
550+
# Check JSON structure
551+
cat benchmark_results/{run_name}/summary.json
552+
cat benchmark_results/{run_name}/tasks/task_001/execution.json
553+
```
554+
555+
## Preventing Stale Data Issues
556+
557+
**CRITICAL**: When working on dashboard/viewer code, follow this process to avoid showing stale data:
558+
559+
### After Code Changes
560+
561+
1. **Always regenerate HTML files** after modifying trainer.py, viewer.py, or local.py:
562+
```bash
563+
uv run python -m openadapt_ml.cloud.local viewer
564+
```
565+
566+
2. **Verify regeneration worked** by checking key values:
567+
```bash
568+
# Check elapsed time was updated (should NOT be 0)
569+
grep "baseElapsedTime" training_output/current/dashboard.html
570+
571+
# Check comparison data exists in viewer
572+
grep "predictionsByCheckpoint" training_output/current/viewer.html
573+
```
574+
575+
3. **Hard refresh browser** to bypass cache:
576+
- macOS: `Cmd+Shift+R`
577+
- Windows/Linux: `Ctrl+Shift+R`
578+
- Or use DevTools → Network → "Disable cache" checkbox
579+
580+
4. **Use HTTP serving** (not file://) for auto-refresh:
581+
```bash
582+
uv run python -m openadapt_ml.cloud.local serve --port 8080 --open
583+
```
584+
585+
### Before Showing User
586+
587+
Before presenting dashboard/viewer to user, verify:
588+
- [ ] Elapsed time shows correct value (not 0m 0s)
589+
- [ ] Comparison screenshots load (not blank/404)
590+
- [ ] Model predictions appear in dropdown
591+
- [ ] Loss curve shows data
592+
- [ ] Timestamp info panel shows recent dates
593+
594+
### Automatic Data Loading Checklist
595+
596+
The viewer should automatically load:
597+
- [ ] Capture data from `comparison_epoch*.html` files (extracts `window.comparisonData`)
598+
- [ ] Predictions from same comparison HTML files (human + predicted actions per step)
599+
- [ ] Evaluations from `training_log.json` (if present)
600+
- [ ] Recording events from capture data (note: `recording.end` depends on capture source)
601+
602+
### Common Issues
603+
604+
| Symptom | Cause | Fix |
605+
|---------|-------|-----|
606+
| Elapsed time shows 0m 0s | `elapsed_time` not loaded from training_log.json | Check `state.elapsed_time = data.get("elapsed_time", 0.0)` in local.py |
607+
| No comparison screenshots | Paths point to Lambda not local | Update `capture_path` in training_log.json to local path |
608+
| Missing model predictions | No `comparison_epoch*.html` files or wrong data format | Run compare script: `uv run python -m openadapt_ml.scripts.compare --capture ... --checkpoint ...` |
609+
| Predictions not extracted | HTML uses `window.comparisonData` but regex expects `const` | Use regex `(?:const\s+\|window\.)comparisonData` pattern |
610+
| Stale data after code change | Browser caching HTML | Hard refresh (Cmd+Shift+R) or disable cache |
611+
| Screenshots 404 | Screenshot symlink broken | Recreate: `ln -sf /path/to/capture/screenshots training_output/current/screenshots` |

0 commit comments

Comments
 (0)