Skip to content

Commit b291f66

Browse files
manage-pod skill: refresh + always-compile + benchmark step
- Compound early stopping note now lists all four reset criteria (val_loss, late_legal, game_completion_rate, avg_plies_completed) matching the trainer change in this PR. - New "Benchmark the Pod" startup step before launching trials, so the agent has ground-truth step times, compile speedup, and concurrency scaling for *this* pod when planning. - Always use torch.compile by default; the warmup is cheap relative to the 1.5-2.2x speedup, even for short runs. - VRAM caveat removed (skill is pod-focused; pod GPUs aren't VRAM-constrained). - max_seq_len default updated to 512. - Tools reference: add lab_resume, document tag filter on lab_results, health_warning event type, and graceful-checkpoint behavior on lab_kill. - Drop the stale 15-30 min compile overhead figure; replace with the measured 10-30 s (NVIDIA) / 1-2 min (AMD) numbers. - Note that uv run works in dev images post #53. .dockerignore: un-ignore .claude/skills so the manage-pod skill ships with the dev image (the rest of .claude stays excluded).
1 parent 681ab32 commit b291f66

File tree

2 files changed

+31
-14
lines changed

2 files changed

+31
-14
lines changed

.claude/skills/manage-pod/SKILL.md

Lines changed: 30 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,23 @@ First, read `runs/lab-notes.md` — this is your handwritten research log with w
3232

3333
Also read `/workspace/pod_manager.md` (auto-maintained by the lab server) for the structured tables and recent events.
3434

35+
### 3. Benchmark the Pod (skip if a benchmark for this pod already exists in lab-notes)
3536

36-
### 3. Start Check-In Cron
37+
Before launching real trials, run `scripts/benchmark.py` to characterize the actual performance of *this specific pod*. GPU model name alone isn't enough — driver version, CUDA/ROCm version, host CPU, PCIe topology, and thermal headroom all affect throughput. The benchmark script gives you ground-truth numbers for:
38+
39+
- **Step time per variant** (eager vs. compiled) → informs how long a target step count will actually take
40+
- **Compile speedup ratio** → confirms `torch.compile` is working and tells you if any models hit a known compile-bug regression
41+
- **Concurrency scaling** → how many models you can fit per GPU and the total throughput at each level (relevant for sweeps and `train_all.py`)
42+
- **Adapter step times** → cost estimate for adapter sweeps
43+
- **Engine throughput** → tells you if the data pipeline will keep the GPU fed
44+
45+
```bash
46+
python3 scripts/benchmark.py --json /workspace/benchmark_<pod-name>.json 2>&1 | tee /workspace/benchmark_<pod-name>.txt
47+
```
48+
49+
This takes ~5-10 minutes. Skim the output, capture the headline numbers (compiled step time for `base`, compile speedup, peak concurrency throughput) into `runs/lab-notes.md`, and use them to plan everything that follows. If a previous run already benchmarked this pod and the numbers are in lab-notes, you can skip this step.
50+
51+
### 4. Start Check-In Cron
3752

3853
Use CronCreate to schedule a recurring check-in. The interval depends on the workload — see [Check-In Intervals](#check-in-intervals) below.
3954

@@ -46,7 +61,7 @@ The cron is your heartbeat — it's how you:
4661
- Make strategic decisions (what to try next, kill underperformers)
4762
- Warn about idle pods
4863

49-
### 4. Act on the Objective
64+
### 5. Act on the Objective
5065

5166
**Monitoring an existing run:**
5267
- Call `lab_status` to see what's running
@@ -65,12 +80,12 @@ You drive the loop manually. At each check-in:
6580
- `lab_launch(config={"run_type": "pretrain", "variant": "base", "max_seq_len": 512, ...})`
6681
- Key tunable parameters:
6782
- **Architecture:** `d_model`, `n_layers`, `n_heads`, `d_ff` (override variant defaults)
68-
- **Sequence length:** `max_seq_len` (default 256; set to 512 for long-game training)
83+
- **Sequence length:** `max_seq_len` (default 512 for long-game training)
6984
- **Data generation:** `mate_boost` (0.0-1.0), `discard_ply_limit` (bool), `no_outcome_token` (bool — strips outcome conditioning)
7085
- **Training:** `lr`, `batch_size`, `accumulation_steps`, `warmup_frac`, `weight_decay`
86+
- **Validation:** `val_games` (default 512; bump to 2048+ for finer forfeit-rate detection)
7187
- **Early stopping:** `patience` (evals without improvement), `legality_late_ply` (ply threshold for late-game legality, defaults to `max_seq_len // 2`)
72-
- **Compound early stopping:** Patience resets when *either* val_loss *or* late-game legality improves. This keeps training alive when loss plateaus but the model is still learning to play legal moves deeper into games. Check `lab_log` to see the patience counter (`pat N/M`).
73-
- **VRAM note:** 512-token sequences double attention memory. Start with smaller `batch_size` (128 or 64) and use `accumulation_steps` to maintain effective batch size.
88+
- **Compound early stopping:** Patience resets when *any* of the following improve: `val_loss`, `late_legal_move_rate`, `game_completion_rate`, or `avg_plies_completed`. This keeps training alive when loss plateaus but the model is still learning to play longer stretches of legal moves — which has been the dominant late-phase signal. Check `lab_log` to see the patience counter (`pat N/M`) and the per-eval line: `complete X.XXX | avg_ply N | forfeit [min-max med N]`.
7489

7590
**Single training run:**
7691
- Call `lab_launch` with the strategy and exact params
@@ -115,9 +130,9 @@ To switch intervals, delete the old cron with `CronDelete` and create a new one.
115130
| Tool | Purpose |
116131
|------|---------|
117132
| `lab_status` | GPUs, running trials with ETAs, cost. Best for first-contact orientation. Only updates val_loss/acc at eval intervals, so don't poll repeatedly for progress. |
118-
| `lab_results` | All trials with metrics + Pareto front + Optuna suggestions. Optionally pass `strategy` to filter. |
119-
| `lab_schema` | Returns JSON Schema for all RunConfig fields. Call before `lab_launch` to discover available parameters. |
120-
| `lab_events` | New events since sequence N. Types: trial_started, trial_completed, trial_failed, trial_killed, gpu_idle. Call at every check-in. |
133+
| `lab_results` | All trials with metrics + Pareto front + Optuna suggestions. Optionally filter by `strategy` and/or `tag`. |
134+
| `lab_schema` | Returns JSON Schema for `PretrainConfig` and `AdapterConfig`. Call before `lab_launch` to discover available parameters. |
135+
| `lab_events` | New events since sequence N (auto-tracked if omitted; pass `since=0` for full history). Types: trial_started, trial_completed, trial_failed, trial_killed, gpu_idle, health_warning. Call at every check-in. |
121136

122137
### Monitoring tools (call at check-ins)
123138

@@ -130,8 +145,9 @@ To switch intervals, delete the old cron with `CronDelete` and create a new one.
130145

131146
| Tool | Purpose |
132147
|------|---------|
133-
| `lab_launch` | Launch one trial from a RunConfig dict. Call `lab_schema` first to see all fields. |
134-
| `lab_kill` | Kill a trial by ID (SIGTERM) |
148+
| `lab_launch` | Launch one trial from a RunConfig dict. Call `lab_schema` first to see all fields. Optionally pass `tags` for grouping. |
149+
| `lab_resume` | Resume a completed/paused trial from its best checkpoint. Creates a new trial with the same config plus `--resume`. Can override `total_steps` or `pause_after_steps`. |
150+
| `lab_kill` | Kill a trial by ID (SIGTERM). The trainer handles SIGTERM gracefully and writes a final checkpoint before exiting, so resume/launch with `resume=<path>` picks up where it left off — no need to wait for a 5K-interval checkpoint. |
135151
| `lab_set_cost` | Set $/hr rate for cost tracking |
136152

137153
---
@@ -171,7 +187,7 @@ You are the decision-maker. Your job at each check-in is **expert judgment**:
171187
- What does it tell us about this strategy/param range?
172188

173189
4. **Strategic decisions.** Based on accumulated results:
174-
- What should the next trial be? Check `lab_results(suggest_strategy=...)` for an Optuna suggestion as a starting point.
190+
- What should the next trial be? `lab_results` returns Optuna suggestions alongside the results table — use them as a starting point when they're in-range.
175191
- Should we change phase? (exploration → exploitation → validation)
176192
- Should we kill any running trials that look unpromising?
177193
- Are there obvious gaps in coverage?
@@ -185,14 +201,14 @@ You are the decision-maker. Your job at each check-in is **expert judgment**:
185201

186202
## Infrastructure Notes (PAWN-specific)
187203

188-
- **`uv run` may be broken** on pods. Use `python3` directly if `uv` fails.
204+
- **`uv run`** works on the dev image (PR #53 registered the engine wheel in uv's workspace). Runtime images install the engine via `uv pip install` outside the workspace — use `python3` directly there if `uv run` complains about the workspace member.
189205
- **Persistent storage** is at `/workspace`. Code is at `/opt/pawn`. Always write results to `/workspace`.
190206
- **`--log-dir /workspace/logs`** — always pass this explicitly.
191207
- **`--local-checkpoints`** — use unless you have a specific HF repo.
192-
- **`--no-compile`** for trials under 20K steps. `torch.compile` overhead is 15-30 min.
208+
- **Always use `torch.compile`** (the default). Warmup is ~10-30 s on NVIDIA and ~1-2 min on AMD, then step time is steady. Even short exploration runs amortize the cost, and the compile speedup (1.5-2.2x) is too valuable to give up. Only pass `--no-compile` if compile is actively broken on the target hardware (e.g. has been an issue with adapters >20M params on MI300X in the past — verify with the benchmark step before assuming).
193209
- **`--num-workers`** — keep total across all processes under `nproc - 4`.
194210
- **AMP float16 can NaN** at high learning rates (>7e-4) after 25-40K steps. Use `--amp-dtype bfloat16` for long runs.
195-
- **SIGTERM for graceful shutdown** of training processes. They save a checkpoint before exiting.
211+
- **SIGTERM for graceful shutdown** of training processes. They save a final checkpoint at the current step (not the last 5K-interval) before exiting, so `lab_kill` + relaunch loses almost no work.
196212
- **HF backups**: Periodically `hf sync /workspace hf://buckets/<repo>` if a bucket is configured.
197213

198214
## Lab Notes

.dockerignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
.git
22
.claude
3+
!.claude/skills
34
.venv
45
.env
56
__pycache__

0 commit comments

Comments
 (0)