OpenAdaptAI
diff --git a/‎.github/workflows/publish.yml‎
Lines changed: 30 additions & 0 deletions b/‎.github/workflows/publish.yml‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 14 additions & 5 deletions b/‎CLAUDE.md‎
Lines changed: 14 additions & 5 deletions
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 11 additions & 11 deletions b/‎README.md‎
Lines changed: 11 additions & 11 deletions
diff --git a/‎configs/qwen3vl_capture.yaml‎
Lines changed: 5 additions & 3 deletions b/‎configs/qwen3vl_capture.yaml‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎configs/qwen3vl_capture_4bit.yaml‎
Lines changed: 4 additions & 3 deletions b/‎configs/qwen3vl_capture_4bit.yaml‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎configs/qwen3vl_capture_batched.yaml‎
Lines changed: 32 additions & 0 deletions b/‎configs/qwen3vl_capture_batched.yaml‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎configs/qwen3vl_synthetic_dev.yaml‎
Lines changed: 1 addition & 0 deletions b/‎configs/qwen3vl_synthetic_dev.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/auto_shutoff_design.md‎
Lines changed: 129 additions & 0 deletions b/‎docs/auto_shutoff_design.md‎
Lines changed: 129 additions & 0 deletions
@@ -0,0 +1,30 @@
+name: Publish to PyPI
+
+on:
+  push:
+    tags:
+      - 'v*'
+
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    permissions:
+      id-token: write  # Required for trusted publishing
+      contents: read   # Required for checkout
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v4
+        with:
+          version: "latest"
+
+      - name: Set up Python
+        run: uv python install 3.12
+
+      - name: Build package
+        run: uv build
+
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
@@ -31,8 +31,11 @@ synthetic_train_dev/
 # Ephemeral training/eval artifacts
 logs/
 checkpoints/
+checkpoints_*/
 /plots/
 eval_*.json
+benchmark_results/
+debug_*/
 
 # Internal documentation (not for public repo)
 docs/internal/
@@ -16,7 +16,7 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a
 **Primary benchmark**: Windows Agent Arena (WAA)
 - 154 tasks across 11 Windows domains
 - MIT licensed, can run locally or on Azure
-- SOTA: ~19.5% success (GPT-4V + OmniParser)
+- SOTA: ~19.5% success (GPT-5.1 + OmniParser)
 
 **Future benchmarks** (not yet implemented):
 - WebArena/VisualWebArena (browser)
@@ -42,7 +42,16 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a
    - Cloud providers: Azure (primary, free tier available), Lambda Labs (GPU rental)
    - See `docs/live_inference_design.md` for async inference architecture
 
-7. **Stub Training Adapter (HIGH PRIORITY)** - Always implement stub/mock providers first:
+7. **Schema Purity** - The schema must remain domain-agnostic and generic:
+   - **External systems adapt TO the schema**, not the other way around
+   - Never add fields to accommodate specific external data structures
+   - Data transformation belongs in importers/exporters, not core schema
+   - Use `raw` and `metadata` dict fields for integration-specific data
+   - If a proposed field feels specific to one use case, it doesn't belong in the schema
+   - This is a standard open-source library: users import and call functions, they don't shape the API
+   - See `openadapt_ml/schemas/` for canonical definitions
+
+8. **Stub Training Adapter (HIGH PRIORITY)** - Always implement stub/mock providers first:
    - **Never wait on real training to test UI/code changes**
    - Use `--stub` flag to simulate training progress without GPU
    - Generates fake loss curves, evaluations, checkpoints in seconds
@@ -69,7 +78,7 @@ The benchmark integration module is implemented in `openadapt_ml/benchmarks/`:
 
 ### APIBenchmarkAgent
 
-The `APIBenchmarkAgent` wraps hosted VLM APIs (Claude, GPT-4V) for benchmark evaluation baselines.
+The `APIBenchmarkAgent` wraps hosted VLM APIs (Claude, GPT-5.1) for benchmark evaluation baselines.
 This enables comparing fine-tuned models against off-the-shelf VLMs.
 
 ```python
@@ -79,7 +88,7 @@ from openadapt_ml.benchmarks import APIBenchmarkAgent, evaluate_agent_on_benchma
 agent = APIBenchmarkAgent(provider="anthropic")
 results = evaluate_agent_on_benchmark(agent, adapter)
 
-# GPT-4V baseline
+# GPT-5.1 baseline
 agent = APIBenchmarkAgent(provider="openai")
 results = evaluate_agent_on_benchmark(agent, adapter)
 ```
@@ -89,7 +98,7 @@ CLI usage:
 # Run Claude evaluation on mock tasks
 uv run python -m openadapt_ml.benchmarks.cli run-api --provider anthropic --tasks 5
 
-# Run GPT-4V evaluation
+# Run GPT-5.1 evaluation
 uv run python -m openadapt_ml.benchmarks.cli run-api --provider openai --tasks 5
 
 # Disable accessibility tree in prompts
 
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2025 MLDSAI Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -87,8 +87,7 @@ uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
 This will evaluate and plot **Qwen3 base**, **Qwen3 FT**, **Claude Sonnet 4.5**,
 and **GPT-5.1** on the same synthetic login benchmark.
 
-For more details on configs, adapters, and evaluation metrics, see the sections
-below and `docs/state_and_next_steps_qwen_login.md`.
+For complete documentation including training setup, evaluation metrics, SoM mode results, and reproduction instructions, see **[`docs/qwen_login_experiment.md`](docs/qwen_login_experiment.md)**. For implementation details and technical notes, see `docs/state_and_next_steps_qwen_login.md`.
 
 ---
 
@@ -584,32 +583,33 @@ In particular:
 
 ---
 
-## 10. Training on Real Captures
+## 10. Training on Real Data
 
-OpenAdapt-ML can train on real GUI recordings captured with [openadapt-capture](https://github.com/OpenAdaptAI/openadapt-capture).
+OpenAdapt-ML supports training on real GUI recordings from two sources:
+1. **openadapt-capture** - New lightweight recording format
+2. **OpenAdapt database** - Original OpenAdapt recordings (legacy)
 
-### 10.1 Capture a workflow
+### 10.1 Training on openadapt-capture recordings
+
+[openadapt-capture](https://github.com/OpenAdaptAI/openadapt-capture) is a lightweight GUI recording tool.
 
 ```bash
 # Install openadapt-capture
 uv pip install openadapt-capture
 
 # Record a workflow (e.g., turning off Night Shift)
 openadapt-capture record --output ~/captures/turn-off-nightshift
-```
 
-### 10.2 Train on the capture
-
-```bash
+# Train on the capture
 uv run python -m openadapt_ml.scripts.train \
-  --config configs/qwen3vl_capture_4bit.yaml \
+  --config configs/qwen3vl_capture.yaml \
   --capture ~/captures/turn-off-nightshift \
   --open  # Opens training dashboard in browser
 ```
 
 The goal is automatically derived from the directory name (e.g., `"Turn off nightshift"`).
 
-### 10.3 Compare human vs AI predictions
+### 10.2 Compare human vs AI predictions
 
 ```bash
 uv run python -m openadapt_ml.scripts.compare \
 
@@ -24,6 +24,8 @@ training:
   weight_decay: 0.01
   max_grad_norm: 0.5
   logging_steps: 1
-  # Early stopping: stop if loss stays below threshold
-  early_stop_loss: 0.01
-  early_stop_patience: 20
+  lr_scheduler_type: linear
+  # Early stopping: stop when loss <= 1.0 (INVARIANT: training should never continue past this)
+  # Loss <= 1.0 indicates the model has learned the task; further training is diminishing returns
+  early_stop_loss: 1.0
+  early_stop_patience: 5
@@ -23,7 +23,8 @@ training:
   weight_decay: 0.01
   max_grad_norm: 0.5
   logging_steps: 1
-  # Early stopping: stop when loss < threshold for N consecutive steps
-  early_stop_loss: 0.1
-  early_stop_patience: 10
+  # Early stopping: stop when loss <= 1.0 (INVARIANT: training should never continue past this)
+  # Loss <= 1.0 indicates the model has learned the task; further training is diminishing returns
+  early_stop_loss: 1.0
+  early_stop_patience: 5
   # Gradient checkpointing handled in trainer.py
@@ -0,0 +1,32 @@
+model:
+  name: Qwen/Qwen3-VL-2B-Instruct
+  load_in_4bit: true  # Enable 4-bit quantization to reduce memory
+  # Image resolution: 512x512 = 262144 pixels for faster training (default is huge)
+  max_pixels: 262144
+
+lora:
+  r: 8
+  lora_alpha: 16
+  lora_dropout: 0.05
+  bias: none
+  target_modules:
+    - q_proj
+    - v_proj
+  task_type: CAUSAL_LM
+  weights_path: checkpoints/qwen3vl2b_capture_lora_batched
+
+training:
+  num_train_epochs: 5
+  per_device_train_batch_size: 4  # Batching enabled!
+  gradient_accumulation_steps: 1
+  learning_rate: 5.0e-5
+  warmup_ratio: 0.1
+  weight_decay: 0.01
+  max_grad_norm: 0.5
+  logging_steps: 1
+  # Learning rate scheduler: linear, cosine, constant, or none
+  lr_scheduler_type: linear
+  # Early stopping: stop when loss <= 1.0 (INVARIANT: training should never continue past this)
+  # Loss <= 1.0 indicates the model has learned the task; further training is diminishing returns
+  early_stop_loss: 1.0
+  early_stop_patience: 5
@@ -27,3 +27,4 @@ training:
   weight_decay: 0.0
   max_grad_norm: 1.0
   logging_steps: 1
+  lr_scheduler_type: linear
@@ -0,0 +1,129 @@
+# Auto-Shutoff Design
+
+## Problem Statement
+
+Cloud GPU training costs $0.75-$3.29/hr. Training should automatically stop when:
+1. Model has learned sufficiently (loss plateau)
+2. Maximum budget/time is reached
+3. Training is diverging (loss exploding)
+
+Without auto-shutoff, training can run indefinitely wasting cloud credits.
+
+## Current Implementation
+
+### 1. Config-Based Early Stopping
+
+Located in `configs/qwen3vl_capture*.yaml`:
+
+```yaml
+training:
+  # INVARIANT: Training stops when loss <= 1.0
+  early_stop_loss: 1.0
+  early_stop_patience: 5  # consecutive steps below threshold
+```
+
+**Where enforced**: `openadapt_ml/training/trainer.py` in training loop.
+
+### 2. Dashboard Auto-Stop (NEW)
+
+Located in `trainer.py` dashboard JavaScript:
+
+```javascript
+const AUTO_STOP_LOSS_THRESHOLD = 1.0;
+
+// When loss <= threshold, automatically call /api/stop
+if (!autoStopTriggered && !isTrainingComplete && data.loss <= AUTO_STOP_LOSS_THRESHOLD) {
+    fetch('/api/stop', { method: 'POST' });
+}
+```
+
+**Why both?** Redundancy - if the training loop doesn't catch it, the dashboard will.
+
+## Design Principles
+
+### 1. Defense in Depth
+Multiple layers check for stop conditions:
+- Training loop (primary)
+- Dashboard monitor (secondary)
+- Max runtime limit (failsafe)
+
+### 2. Fail-Safe Defaults
+- `early_stop_loss: 1.0` - Conservative threshold that catches most convergence
+- `max_runtime: 60` minutes - Prevents runaway training
+- Instance auto-terminate on training completion
+
+### 3. Observable
+- Dashboard shows current loss vs threshold
+- Notification when auto-stop triggers
+- Terminal logs stop reason
+
+## Stop Conditions
+
+| Condition | Threshold | Where Checked | Priority |
+|-----------|-----------|---------------|----------|
+| Loss convergence | loss <= 1.0 | Training loop, Dashboard | Primary |
+| Max runtime | 60 minutes | Lambda CLI | Failsafe |
+| User stop | Button click | Dashboard /api/stop | Manual |
+| STOP_TRAINING file | File exists | Training loop | Remote trigger |
+
+## Future Enhancements
+
+### Phase 1: Configurable Thresholds (TODO)
+Add UI controls in dashboard:
+```html
+<input type="number" id="loss-threshold" value="1.0" />
+<button onclick="updateThreshold()">Update</button>
+```
+
+Store in `training_config.json` alongside `training_log.json`.
+
+### Phase 2: Cost-Based Stopping (TODO)
+Stop when estimated cost exceeds budget:
+```javascript
+const MAX_COST = 5.00;  // $5 budget
+if (currentCost >= MAX_COST) triggerStop('budget_exceeded');
+```
+
+### Phase 3: Divergence Detection (TODO)
+Stop if loss is increasing consistently:
+```javascript
+const recentLosses = data.losses.slice(-10);
+const trend = calculateTrend(recentLosses);
+if (trend > 0.1) triggerStop('diverging');  // Loss increasing
+```
+
+### Phase 4: Smart Convergence (TODO)
+Use statistical methods to detect true convergence:
+- Moving average plateau detection
+- Gradient of loss curve approaching zero
+- Validation loss not improving
+
+## Implementation Checklist
+
+- [x] Config-based early_stop_loss
+- [x] Dashboard auto-stop when loss <= threshold
+- [x] Stop notification in UI
+- [x] All capture configs updated to loss <= 1.0
+- [ ] Configurable threshold in UI
+- [ ] Cost-based stopping
+- [ ] Divergence detection
+- [ ] Cumulative cost tracking across runs
+- [ ] SQLite persistence for training history
+
+## Testing
+
+To verify auto-stop works:
+
+```bash
+# Run stub training (fast, no GPU)
+uv run python -m openadapt_ml.cloud.local serve --port 8080 --stub --open
+
+# Watch dashboard - should auto-stop when loss drops below 1.0
+```
+
+## Related Files
+
+- `configs/qwen3vl_capture.yaml` - Early stop config
+- `openadapt_ml/training/trainer.py` - Dashboard with auto-stop JS
+- `openadapt_ml/training/stub_provider.py` - Early stop logic
+- `openadapt_ml/cloud/lambda_labs.py` - Instance termination