Skip to content

Commit 7887890

Browse files
committed
feat: v0.1.0 release with benchmark integration, grounding module, and cloud training
1 parent afb5bb8 commit 7887890

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+8651
-200
lines changed

.github/workflows/publish.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: Publish to PyPI
2+
3+
on:
4+
push:
5+
tags:
6+
- 'v*'
7+
8+
jobs:
9+
publish:
10+
runs-on: ubuntu-latest
11+
permissions:
12+
id-token: write # Required for trusted publishing
13+
contents: read # Required for checkout
14+
15+
steps:
16+
- uses: actions/checkout@v4
17+
18+
- name: Install uv
19+
uses: astral-sh/setup-uv@v4
20+
with:
21+
version: "latest"
22+
23+
- name: Set up Python
24+
run: uv python install 3.12
25+
26+
- name: Build package
27+
run: uv build
28+
29+
- name: Publish to PyPI
30+
uses: pypa/gh-action-pypi-publish@release/v1

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,11 @@ synthetic_train_dev/
3131
# Ephemeral training/eval artifacts
3232
logs/
3333
checkpoints/
34+
checkpoints_*/
3435
/plots/
3536
eval_*.json
37+
benchmark_results/
38+
debug_*/
3639

3740
# Internal documentation (not for public repo)
3841
docs/internal/

CLAUDE.md

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a
1616
**Primary benchmark**: Windows Agent Arena (WAA)
1717
- 154 tasks across 11 Windows domains
1818
- MIT licensed, can run locally or on Azure
19-
- SOTA: ~19.5% success (GPT-4V + OmniParser)
19+
- SOTA: ~19.5% success (GPT-5.1 + OmniParser)
2020

2121
**Future benchmarks** (not yet implemented):
2222
- WebArena/VisualWebArena (browser)
@@ -42,7 +42,16 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a
4242
- Cloud providers: Azure (primary, free tier available), Lambda Labs (GPU rental)
4343
- See `docs/live_inference_design.md` for async inference architecture
4444

45-
7. **Stub Training Adapter (HIGH PRIORITY)** - Always implement stub/mock providers first:
45+
7. **Schema Purity** - The schema must remain domain-agnostic and generic:
46+
- **External systems adapt TO the schema**, not the other way around
47+
- Never add fields to accommodate specific external data structures
48+
- Data transformation belongs in importers/exporters, not core schema
49+
- Use `raw` and `metadata` dict fields for integration-specific data
50+
- If a proposed field feels specific to one use case, it doesn't belong in the schema
51+
- This is a standard open-source library: users import and call functions, they don't shape the API
52+
- See `openadapt_ml/schemas/` for canonical definitions
53+
54+
8. **Stub Training Adapter (HIGH PRIORITY)** - Always implement stub/mock providers first:
4655
- **Never wait on real training to test UI/code changes**
4756
- Use `--stub` flag to simulate training progress without GPU
4857
- Generates fake loss curves, evaluations, checkpoints in seconds
@@ -69,7 +78,7 @@ The benchmark integration module is implemented in `openadapt_ml/benchmarks/`:
6978

7079
### APIBenchmarkAgent
7180

72-
The `APIBenchmarkAgent` wraps hosted VLM APIs (Claude, GPT-4V) for benchmark evaluation baselines.
81+
The `APIBenchmarkAgent` wraps hosted VLM APIs (Claude, GPT-5.1) for benchmark evaluation baselines.
7382
This enables comparing fine-tuned models against off-the-shelf VLMs.
7483

7584
```python
@@ -79,7 +88,7 @@ from openadapt_ml.benchmarks import APIBenchmarkAgent, evaluate_agent_on_benchma
7988
agent = APIBenchmarkAgent(provider="anthropic")
8089
results = evaluate_agent_on_benchmark(agent, adapter)
8190

82-
# GPT-4V baseline
91+
# GPT-5.1 baseline
8392
agent = APIBenchmarkAgent(provider="openai")
8493
results = evaluate_agent_on_benchmark(agent, adapter)
8594
```
@@ -89,7 +98,7 @@ CLI usage:
8998
# Run Claude evaluation on mock tasks
9099
uv run python -m openadapt_ml.benchmarks.cli run-api --provider anthropic --tasks 5
91100

92-
# Run GPT-4V evaluation
101+
# Run GPT-5.1 evaluation
93102
uv run python -m openadapt_ml.benchmarks.cli run-api --provider openai --tasks 5
94103

95104
# Disable accessibility tree in prompts

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 MLDSAI Inc.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,7 @@ uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
8787
This will evaluate and plot **Qwen3 base**, **Qwen3 FT**, **Claude Sonnet 4.5**,
8888
and **GPT-5.1** on the same synthetic login benchmark.
8989

90-
For more details on configs, adapters, and evaluation metrics, see the sections
91-
below and `docs/state_and_next_steps_qwen_login.md`.
90+
For complete documentation including training setup, evaluation metrics, SoM mode results, and reproduction instructions, see **[`docs/qwen_login_experiment.md`](docs/qwen_login_experiment.md)**. For implementation details and technical notes, see `docs/state_and_next_steps_qwen_login.md`.
9291

9392
---
9493

@@ -584,32 +583,33 @@ In particular:
584583

585584
---
586585

587-
## 10. Training on Real Captures
586+
## 10. Training on Real Data
588587

589-
OpenAdapt-ML can train on real GUI recordings captured with [openadapt-capture](https://github.com/OpenAdaptAI/openadapt-capture).
588+
OpenAdapt-ML supports training on real GUI recordings from two sources:
589+
1. **openadapt-capture** - New lightweight recording format
590+
2. **OpenAdapt database** - Original OpenAdapt recordings (legacy)
590591

591-
### 10.1 Capture a workflow
592+
### 10.1 Training on openadapt-capture recordings
593+
594+
[openadapt-capture](https://github.com/OpenAdaptAI/openadapt-capture) is a lightweight GUI recording tool.
592595

593596
```bash
594597
# Install openadapt-capture
595598
uv pip install openadapt-capture
596599

597600
# Record a workflow (e.g., turning off Night Shift)
598601
openadapt-capture record --output ~/captures/turn-off-nightshift
599-
```
600602

601-
### 10.2 Train on the capture
602-
603-
```bash
603+
# Train on the capture
604604
uv run python -m openadapt_ml.scripts.train \
605-
--config configs/qwen3vl_capture_4bit.yaml \
605+
--config configs/qwen3vl_capture.yaml \
606606
--capture ~/captures/turn-off-nightshift \
607607
--open # Opens training dashboard in browser
608608
```
609609

610610
The goal is automatically derived from the directory name (e.g., `"Turn off nightshift"`).
611611

612-
### 10.3 Compare human vs AI predictions
612+
### 10.2 Compare human vs AI predictions
613613

614614
```bash
615615
uv run python -m openadapt_ml.scripts.compare \

configs/qwen3vl_capture.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ training:
2424
weight_decay: 0.01
2525
max_grad_norm: 0.5
2626
logging_steps: 1
27-
# Early stopping: stop if loss stays below threshold
28-
early_stop_loss: 0.01
29-
early_stop_patience: 20
27+
lr_scheduler_type: linear
28+
# Early stopping: stop when loss <= 1.0 (INVARIANT: training should never continue past this)
29+
# Loss <= 1.0 indicates the model has learned the task; further training is diminishing returns
30+
early_stop_loss: 1.0
31+
early_stop_patience: 5

configs/qwen3vl_capture_4bit.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@ training:
2323
weight_decay: 0.01
2424
max_grad_norm: 0.5
2525
logging_steps: 1
26-
# Early stopping: stop when loss < threshold for N consecutive steps
27-
early_stop_loss: 0.1
28-
early_stop_patience: 10
26+
# Early stopping: stop when loss <= 1.0 (INVARIANT: training should never continue past this)
27+
# Loss <= 1.0 indicates the model has learned the task; further training is diminishing returns
28+
early_stop_loss: 1.0
29+
early_stop_patience: 5
2930
# Gradient checkpointing handled in trainer.py
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
model:
2+
name: Qwen/Qwen3-VL-2B-Instruct
3+
load_in_4bit: true # Enable 4-bit quantization to reduce memory
4+
# Image resolution: 512x512 = 262144 pixels for faster training (default is huge)
5+
max_pixels: 262144
6+
7+
lora:
8+
r: 8
9+
lora_alpha: 16
10+
lora_dropout: 0.05
11+
bias: none
12+
target_modules:
13+
- q_proj
14+
- v_proj
15+
task_type: CAUSAL_LM
16+
weights_path: checkpoints/qwen3vl2b_capture_lora_batched
17+
18+
training:
19+
num_train_epochs: 5
20+
per_device_train_batch_size: 4 # Batching enabled!
21+
gradient_accumulation_steps: 1
22+
learning_rate: 5.0e-5
23+
warmup_ratio: 0.1
24+
weight_decay: 0.01
25+
max_grad_norm: 0.5
26+
logging_steps: 1
27+
# Learning rate scheduler: linear, cosine, constant, or none
28+
lr_scheduler_type: linear
29+
# Early stopping: stop when loss <= 1.0 (INVARIANT: training should never continue past this)
30+
# Loss <= 1.0 indicates the model has learned the task; further training is diminishing returns
31+
early_stop_loss: 1.0
32+
early_stop_patience: 5

configs/qwen3vl_synthetic_dev.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,4 @@ training:
2727
weight_decay: 0.0
2828
max_grad_norm: 1.0
2929
logging_steps: 1
30+
lr_scheduler_type: linear

docs/auto_shutoff_design.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Auto-Shutoff Design
2+
3+
## Problem Statement
4+
5+
Cloud GPU training costs $0.75-$3.29/hr. Training should automatically stop when:
6+
1. Model has learned sufficiently (loss plateau)
7+
2. Maximum budget/time is reached
8+
3. Training is diverging (loss exploding)
9+
10+
Without auto-shutoff, training can run indefinitely wasting cloud credits.
11+
12+
## Current Implementation
13+
14+
### 1. Config-Based Early Stopping
15+
16+
Located in `configs/qwen3vl_capture*.yaml`:
17+
18+
```yaml
19+
training:
20+
# INVARIANT: Training stops when loss <= 1.0
21+
early_stop_loss: 1.0
22+
early_stop_patience: 5 # consecutive steps below threshold
23+
```
24+
25+
**Where enforced**: `openadapt_ml/training/trainer.py` in training loop.
26+
27+
### 2. Dashboard Auto-Stop (NEW)
28+
29+
Located in `trainer.py` dashboard JavaScript:
30+
31+
```javascript
32+
const AUTO_STOP_LOSS_THRESHOLD = 1.0;
33+
34+
// When loss <= threshold, automatically call /api/stop
35+
if (!autoStopTriggered && !isTrainingComplete && data.loss <= AUTO_STOP_LOSS_THRESHOLD) {
36+
fetch('/api/stop', { method: 'POST' });
37+
}
38+
```
39+
40+
**Why both?** Redundancy - if the training loop doesn't catch it, the dashboard will.
41+
42+
## Design Principles
43+
44+
### 1. Defense in Depth
45+
Multiple layers check for stop conditions:
46+
- Training loop (primary)
47+
- Dashboard monitor (secondary)
48+
- Max runtime limit (failsafe)
49+
50+
### 2. Fail-Safe Defaults
51+
- `early_stop_loss: 1.0` - Conservative threshold that catches most convergence
52+
- `max_runtime: 60` minutes - Prevents runaway training
53+
- Instance auto-terminate on training completion
54+
55+
### 3. Observable
56+
- Dashboard shows current loss vs threshold
57+
- Notification when auto-stop triggers
58+
- Terminal logs stop reason
59+
60+
## Stop Conditions
61+
62+
| Condition | Threshold | Where Checked | Priority |
63+
|-----------|-----------|---------------|----------|
64+
| Loss convergence | loss <= 1.0 | Training loop, Dashboard | Primary |
65+
| Max runtime | 60 minutes | Lambda CLI | Failsafe |
66+
| User stop | Button click | Dashboard /api/stop | Manual |
67+
| STOP_TRAINING file | File exists | Training loop | Remote trigger |
68+
69+
## Future Enhancements
70+
71+
### Phase 1: Configurable Thresholds (TODO)
72+
Add UI controls in dashboard:
73+
```html
74+
<input type="number" id="loss-threshold" value="1.0" />
75+
<button onclick="updateThreshold()">Update</button>
76+
```
77+
78+
Store in `training_config.json` alongside `training_log.json`.
79+
80+
### Phase 2: Cost-Based Stopping (TODO)
81+
Stop when estimated cost exceeds budget:
82+
```javascript
83+
const MAX_COST = 5.00; // $5 budget
84+
if (currentCost >= MAX_COST) triggerStop('budget_exceeded');
85+
```
86+
87+
### Phase 3: Divergence Detection (TODO)
88+
Stop if loss is increasing consistently:
89+
```javascript
90+
const recentLosses = data.losses.slice(-10);
91+
const trend = calculateTrend(recentLosses);
92+
if (trend > 0.1) triggerStop('diverging'); // Loss increasing
93+
```
94+
95+
### Phase 4: Smart Convergence (TODO)
96+
Use statistical methods to detect true convergence:
97+
- Moving average plateau detection
98+
- Gradient of loss curve approaching zero
99+
- Validation loss not improving
100+
101+
## Implementation Checklist
102+
103+
- [x] Config-based early_stop_loss
104+
- [x] Dashboard auto-stop when loss <= threshold
105+
- [x] Stop notification in UI
106+
- [x] All capture configs updated to loss <= 1.0
107+
- [ ] Configurable threshold in UI
108+
- [ ] Cost-based stopping
109+
- [ ] Divergence detection
110+
- [ ] Cumulative cost tracking across runs
111+
- [ ] SQLite persistence for training history
112+
113+
## Testing
114+
115+
To verify auto-stop works:
116+
117+
```bash
118+
# Run stub training (fast, no GPU)
119+
uv run python -m openadapt_ml.cloud.local serve --port 8080 --stub --open
120+
121+
# Watch dashboard - should auto-stop when loss drops below 1.0
122+
```
123+
124+
## Related Files
125+
126+
- `configs/qwen3vl_capture.yaml` - Early stop config
127+
- `openadapt_ml/training/trainer.py` - Dashboard with auto-stop JS
128+
- `openadapt_ml/training/stub_provider.py` - Early stop logic
129+
- `openadapt_ml/cloud/lambda_labs.py` - Instance termination

0 commit comments

Comments
 (0)