You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .claude/skills/manage-pod/SKILL.md
+30-14Lines changed: 30 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,8 +32,23 @@ First, read `runs/lab-notes.md` — this is your handwritten research log with w
32
32
33
33
Also read `/workspace/pod_manager.md` (auto-maintained by the lab server) for the structured tables and recent events.
34
34
35
+
### 3. Benchmark the Pod (skip if a benchmark for this pod already exists in lab-notes)
35
36
36
-
### 3. Start Check-In Cron
37
+
Before launching real trials, run `scripts/benchmark.py` to characterize the actual performance of *this specific pod*. GPU model name alone isn't enough — driver version, CUDA/ROCm version, host CPU, PCIe topology, and thermal headroom all affect throughput. The benchmark script gives you ground-truth numbers for:
38
+
39
+
-**Step time per variant** (eager vs. compiled) → informs how long a target step count will actually take
40
+
-**Compile speedup ratio** → confirms `torch.compile` is working and tells you if any models hit a known compile-bug regression
41
+
-**Concurrency scaling** → how many models you can fit per GPU and the total throughput at each level (relevant for sweeps and `train_all.py`)
42
+
-**Adapter step times** → cost estimate for adapter sweeps
43
+
-**Engine throughput** → tells you if the data pipeline will keep the GPU fed
44
+
45
+
```bash
46
+
python3 scripts/benchmark.py --json /workspace/benchmark_<pod-name>.json 2>&1| tee /workspace/benchmark_<pod-name>.txt
47
+
```
48
+
49
+
This takes ~5-10 minutes. Skim the output, capture the headline numbers (compiled step time for `base`, compile speedup, peak concurrency throughput) into `runs/lab-notes.md`, and use them to plan everything that follows. If a previous run already benchmarked this pod and the numbers are in lab-notes, you can skip this step.
50
+
51
+
### 4. Start Check-In Cron
37
52
38
53
Use CronCreate to schedule a recurring check-in. The interval depends on the workload — see [Check-In Intervals](#check-in-intervals) below.
39
54
@@ -46,7 +61,7 @@ The cron is your heartbeat — it's how you:
46
61
- Make strategic decisions (what to try next, kill underperformers)
47
62
- Warn about idle pods
48
63
49
-
### 4. Act on the Objective
64
+
### 5. Act on the Objective
50
65
51
66
**Monitoring an existing run:**
52
67
- Call `lab_status` to see what's running
@@ -65,12 +80,12 @@ You drive the loop manually. At each check-in:
-**Validation:**`val_games` (default 512; bump to 2048+ for finer forfeit-rate detection)
71
87
-**Early stopping:**`patience` (evals without improvement), `legality_late_ply` (ply threshold for late-game legality, defaults to `max_seq_len // 2`)
72
-
-**Compound early stopping:** Patience resets when *either* val_loss *or* late-game legality improves. This keeps training alive when loss plateaus but the model is still learning to play legal moves deeper into games. Check `lab_log` to see the patience counter (`pat N/M`).
73
-
-**VRAM note:** 512-token sequences double attention memory. Start with smaller `batch_size` (128 or 64) and use `accumulation_steps` to maintain effective batch size.
88
+
-**Compound early stopping:** Patience resets when *any* of the following improve: `val_loss`, `late_legal_move_rate`, `game_completion_rate`, or `avg_plies_completed`. This keeps training alive when loss plateaus but the model is still learning to play longer stretches of legal moves — which has been the dominant late-phase signal. Check `lab_log` to see the patience counter (`pat N/M`) and the per-eval line: `complete X.XXX | avg_ply N | forfeit [min-max med N]`.
74
89
75
90
**Single training run:**
76
91
- Call `lab_launch` with the strategy and exact params
@@ -115,9 +130,9 @@ To switch intervals, delete the old cron with `CronDelete` and create a new one.
115
130
| Tool | Purpose |
116
131
|------|---------|
117
132
|`lab_status`| GPUs, running trials with ETAs, cost. Best for first-contact orientation. Only updates val_loss/acc at eval intervals, so don't poll repeatedly for progress. |
118
-
|`lab_results`| All trials with metrics + Pareto front + Optuna suggestions. Optionally pass `strategy`to filter. |
119
-
|`lab_schema`| Returns JSON Schema for all RunConfig fields. Call before `lab_launch` to discover available parameters. |
120
-
|`lab_events`| New events since sequence N. Types: trial_started, trial_completed, trial_failed, trial_killed, gpu_idle. Call at every check-in. |
133
+
|`lab_results`| All trials with metrics + Pareto front + Optuna suggestions. Optionally filter by `strategy`and/or `tag`. |
134
+
|`lab_schema`| Returns JSON Schema for `PretrainConfig` and `AdapterConfig`. Call before `lab_launch` to discover available parameters. |
135
+
|`lab_events`| New events since sequence N (auto-tracked if omitted; pass `since=0` for full history). Types: trial_started, trial_completed, trial_failed, trial_killed, gpu_idle, health_warning. Call at every check-in. |
121
136
122
137
### Monitoring tools (call at check-ins)
123
138
@@ -130,8 +145,9 @@ To switch intervals, delete the old cron with `CronDelete` and create a new one.
130
145
131
146
| Tool | Purpose |
132
147
|------|---------|
133
-
|`lab_launch`| Launch one trial from a RunConfig dict. Call `lab_schema` first to see all fields. |
134
-
|`lab_kill`| Kill a trial by ID (SIGTERM) |
148
+
|`lab_launch`| Launch one trial from a RunConfig dict. Call `lab_schema` first to see all fields. Optionally pass `tags` for grouping. |
149
+
|`lab_resume`| Resume a completed/paused trial from its best checkpoint. Creates a new trial with the same config plus `--resume`. Can override `total_steps` or `pause_after_steps`. |
150
+
|`lab_kill`| Kill a trial by ID (SIGTERM). The trainer handles SIGTERM gracefully and writes a final checkpoint before exiting, so resume/launch with `resume=<path>` picks up where it left off — no need to wait for a 5K-interval checkpoint. |
135
151
|`lab_set_cost`| Set $/hr rate for cost tracking |
136
152
137
153
---
@@ -171,7 +187,7 @@ You are the decision-maker. Your job at each check-in is **expert judgment**:
171
187
- What does it tell us about this strategy/param range?
172
188
173
189
4.**Strategic decisions.** Based on accumulated results:
174
-
- What should the next trial be? Check `lab_results(suggest_strategy=...)` for an Optuna suggestion as a starting point.
190
+
- What should the next trial be? `lab_results` returns Optuna suggestions alongside the results table — use them as a starting point when they're in-range.
175
191
- Should we change phase? (exploration → exploitation → validation)
176
192
- Should we kill any running trials that look unpromising?
177
193
- Are there obvious gaps in coverage?
@@ -185,14 +201,14 @@ You are the decision-maker. Your job at each check-in is **expert judgment**:
185
201
186
202
## Infrastructure Notes (PAWN-specific)
187
203
188
-
-**`uv run` may be broken** on pods. Use `python3` directly if `uv` fails.
204
+
-**`uv run`**works on the dev image (PR #53 registered the engine wheel in uv's workspace). Runtime images install the engine via `uv pip install` outside the workspace — use `python3` directly there if `uv run` complains about the workspace member.
189
205
-**Persistent storage** is at `/workspace`. Code is at `/opt/pawn`. Always write results to `/workspace`.
190
206
-**`--log-dir /workspace/logs`** — always pass this explicitly.
191
207
-**`--local-checkpoints`** — use unless you have a specific HF repo.
192
-
-**`--no-compile`**for trials under 20K steps. `torch.compile`overhead is 15-30 min.
208
+
-**Always use `torch.compile`**(the default). Warmup is ~10-30 s on NVIDIA and ~1-2 min on AMD, then step time is steady. Even short exploration runs amortize the cost, and the compile speedup (1.5-2.2x) is too valuable to give up. Only pass `--no-compile`if compile is actively broken on the target hardware (e.g. has been an issue with adapters >20M params on MI300X in the past — verify with the benchmark step before assuming).
193
209
-**`--num-workers`** — keep total across all processes under `nproc - 4`.
194
210
-**AMP float16 can NaN** at high learning rates (>7e-4) after 25-40K steps. Use `--amp-dtype bfloat16` for long runs.
195
-
-**SIGTERM for graceful shutdown** of training processes. They save a checkpoint before exiting.
211
+
-**SIGTERM for graceful shutdown** of training processes. They save a final checkpoint at the current step (not the last 5K-interval) before exiting, so `lab_kill` + relaunch loses almost no work.
196
212
-**HF backups**: Periodically `hf sync /workspace hf://buckets/<repo>` if a bucket is configured.
Copy file name to clipboardExpand all lines: README.md
+11Lines changed: 11 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,6 +10,16 @@ Feel free to use PAWN in your own experiments. Note that PAWN was developed as a
10
10
11
11
**PAWN is under active development and is not yet stable. All results are preliminary.**
12
12
13
+
14
+
> [!important]
15
+
> I am actively in process of re-training the model with:
16
+
>
17
+
> - A new vocabulary borrowed from Google DeepMind's [searchless_chess project (Amortized Planning with Large-Scale Transformers: A Case Study on Chess)](https://github.com/google-deepmind/searchless_chess), which doesn't include impossible moves.
18
+
> - A wider 512-token context window.
19
+
>
20
+
> The information below applies to the existing models, which use the previous architecture. The last commit from prior to these changes is tagged [pre-vocab-transition](https://github.com/thomas-schweich/PAWN/tree/pre-vocab-transition). View the repository at that commit to see the implementation of the previous architecture.
21
+
22
+
13
23
## Model Variants
14
24
15
25
Three sizes, trained for 100K steps on random games (~25.6M games each):
@@ -197,6 +207,7 @@ PAWN builds on ideas and tools from the following projects and publications:
197
207
| Linear probes |[Alain & Bengio, "Understanding Intermediate Layers Using Linear Classifier Probes", ICLR Workshop 2017](https://arxiv.org/abs/1610.01644)|
198
208
| MAIA |[McIlroy-Young et al., "Aligning Superhuman AI with Human Behavior: Chess as a Model System", KDD 2020](https://arxiv.org/abs/2006.01855)|
199
209
| AlphaZero |[Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play", Science 2018](https://arxiv.org/abs/1712.01815)|
210
+
| Searchless Chess |[Ruoss et al., Amortized Planning with Large-Scale Transformers: A Case Study on Chess](https://arxiv.org/abs/2402.04494)|
200
211
| Leela Chess Zero |[github.com/LeelaChessZero/lc0](https://github.com/LeelaChessZero/lc0)|
0 commit comments