|
| 1 | +# Auto-Shutoff Design |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | + |
| 5 | +Cloud GPU training costs $0.75-$3.29/hr. Training should automatically stop when: |
| 6 | +1. Model has learned sufficiently (loss plateau) |
| 7 | +2. Maximum budget/time is reached |
| 8 | +3. Training is diverging (loss exploding) |
| 9 | + |
| 10 | +Without auto-shutoff, training can run indefinitely wasting cloud credits. |
| 11 | + |
| 12 | +## Current Implementation |
| 13 | + |
| 14 | +### 1. Config-Based Early Stopping |
| 15 | + |
| 16 | +Located in `configs/qwen3vl_capture*.yaml`: |
| 17 | + |
| 18 | +```yaml |
| 19 | +training: |
| 20 | + # INVARIANT: Training stops when loss <= 1.0 |
| 21 | + early_stop_loss: 1.0 |
| 22 | + early_stop_patience: 5 # consecutive steps below threshold |
| 23 | +``` |
| 24 | +
|
| 25 | +**Where enforced**: `openadapt_ml/training/trainer.py` in training loop. |
| 26 | + |
| 27 | +### 2. Dashboard Auto-Stop (NEW) |
| 28 | + |
| 29 | +Located in `trainer.py` dashboard JavaScript: |
| 30 | + |
| 31 | +```javascript |
| 32 | +const AUTO_STOP_LOSS_THRESHOLD = 1.0; |
| 33 | +
|
| 34 | +// When loss <= threshold, automatically call /api/stop |
| 35 | +if (!autoStopTriggered && !isTrainingComplete && data.loss <= AUTO_STOP_LOSS_THRESHOLD) { |
| 36 | + fetch('/api/stop', { method: 'POST' }); |
| 37 | +} |
| 38 | +``` |
| 39 | + |
| 40 | +**Why both?** Redundancy - if the training loop doesn't catch it, the dashboard will. |
| 41 | + |
| 42 | +## Design Principles |
| 43 | + |
| 44 | +### 1. Defense in Depth |
| 45 | +Multiple layers check for stop conditions: |
| 46 | +- Training loop (primary) |
| 47 | +- Dashboard monitor (secondary) |
| 48 | +- Max runtime limit (failsafe) |
| 49 | + |
| 50 | +### 2. Fail-Safe Defaults |
| 51 | +- `early_stop_loss: 1.0` - Conservative threshold that catches most convergence |
| 52 | +- `max_runtime: 60` minutes - Prevents runaway training |
| 53 | +- Instance auto-terminate on training completion |
| 54 | + |
| 55 | +### 3. Observable |
| 56 | +- Dashboard shows current loss vs threshold |
| 57 | +- Notification when auto-stop triggers |
| 58 | +- Terminal logs stop reason |
| 59 | + |
| 60 | +## Stop Conditions |
| 61 | + |
| 62 | +| Condition | Threshold | Where Checked | Priority | |
| 63 | +|-----------|-----------|---------------|----------| |
| 64 | +| Loss convergence | loss <= 1.0 | Training loop, Dashboard | Primary | |
| 65 | +| Max runtime | 60 minutes | Lambda CLI | Failsafe | |
| 66 | +| User stop | Button click | Dashboard /api/stop | Manual | |
| 67 | +| STOP_TRAINING file | File exists | Training loop | Remote trigger | |
| 68 | + |
| 69 | +## Future Enhancements |
| 70 | + |
| 71 | +### Phase 1: Configurable Thresholds (TODO) |
| 72 | +Add UI controls in dashboard: |
| 73 | +```html |
| 74 | +<input type="number" id="loss-threshold" value="1.0" /> |
| 75 | +<button onclick="updateThreshold()">Update</button> |
| 76 | +``` |
| 77 | + |
| 78 | +Store in `training_config.json` alongside `training_log.json`. |
| 79 | + |
| 80 | +### Phase 2: Cost-Based Stopping (TODO) |
| 81 | +Stop when estimated cost exceeds budget: |
| 82 | +```javascript |
| 83 | +const MAX_COST = 5.00; // $5 budget |
| 84 | +if (currentCost >= MAX_COST) triggerStop('budget_exceeded'); |
| 85 | +``` |
| 86 | + |
| 87 | +### Phase 3: Divergence Detection (TODO) |
| 88 | +Stop if loss is increasing consistently: |
| 89 | +```javascript |
| 90 | +const recentLosses = data.losses.slice(-10); |
| 91 | +const trend = calculateTrend(recentLosses); |
| 92 | +if (trend > 0.1) triggerStop('diverging'); // Loss increasing |
| 93 | +``` |
| 94 | + |
| 95 | +### Phase 4: Smart Convergence (TODO) |
| 96 | +Use statistical methods to detect true convergence: |
| 97 | +- Moving average plateau detection |
| 98 | +- Gradient of loss curve approaching zero |
| 99 | +- Validation loss not improving |
| 100 | + |
| 101 | +## Implementation Checklist |
| 102 | + |
| 103 | +- [x] Config-based early_stop_loss |
| 104 | +- [x] Dashboard auto-stop when loss <= threshold |
| 105 | +- [x] Stop notification in UI |
| 106 | +- [x] All capture configs updated to loss <= 1.0 |
| 107 | +- [ ] Configurable threshold in UI |
| 108 | +- [ ] Cost-based stopping |
| 109 | +- [ ] Divergence detection |
| 110 | +- [ ] Cumulative cost tracking across runs |
| 111 | +- [ ] SQLite persistence for training history |
| 112 | + |
| 113 | +## Testing |
| 114 | + |
| 115 | +To verify auto-stop works: |
| 116 | + |
| 117 | +```bash |
| 118 | +# Run stub training (fast, no GPU) |
| 119 | +uv run python -m openadapt_ml.cloud.local serve --port 8080 --stub --open |
| 120 | +
|
| 121 | +# Watch dashboard - should auto-stop when loss drops below 1.0 |
| 122 | +``` |
| 123 | + |
| 124 | +## Related Files |
| 125 | + |
| 126 | +- `configs/qwen3vl_capture.yaml` - Early stop config |
| 127 | +- `openadapt_ml/training/trainer.py` - Dashboard with auto-stop JS |
| 128 | +- `openadapt_ml/training/stub_provider.py` - Early stop logic |
| 129 | +- `openadapt_ml/cloud/lambda_labs.py` - Instance termination |
0 commit comments