Commit c4f3744
authored
Add co-training support to pawn-lab MCP server (#58)
* Add co-training support to pawn-lab MCP server
lab_launch now accepts run_type="cotrain" with a list of variant specs,
enabling multi-model pretraining runs (the equivalent of train_all.py)
to be launched, monitored, killed, and resumed through the lab tools.
- Add CotrainVariant and CotrainConfig to pawn/run_config.py with
validators for unique names, non-empty variants, and shm/hf coupling
- Extract ModelSlot + training loop from train_all.py into pawn/cotrain.py
with resume support (per-variant checkpoint loading) and pause_after_steps
- Convert scripts/train_all.py to thin CLI shim over run_cotrain()
- Add cotrain dispatch branch in scripts/train.py
- Update lab runner: _validate_config accepts cotrain, resume_trial
discovers per-variant checkpoints and sets per-variant resume paths
- Update lab monitor: multi-file metrics discovery for cotrain trials
with per-variant offset tracking and aggregation to trial level
- Update lab server: lab_schema exposes cotrain, updated docstrings
- Add Trial.variants field for per-variant state tracking
- 19 new tests (config validation, serialization, monitor aggregation)
* Fix pyright errors: drop redundant epoch parameter from log_train/log_val
The explicit `epoch: int | None` parameter alongside `**metrics: object`
caused pyright to reject callers that spread a `dict[str, float]` (since
a key named "epoch" would be float, not int). The parameter was redundant
— epoch flows through **metrics like every other field. Removing it fixes
all 11 pyright errors in pawn/cotrain.py at the source.
* Address PR review feedback
Bug fixes:
- Pass sdpa_math/no_compile/no_amp flags to configure_gpu() in
run_cotrain so --sdpa-math actually takes effect
- Fix _extract_variant_name to handle underscores in variant names by
joining parts[3:-1] (variant is between timestamp and slug)
- Reject top-level 'resume' field on CotrainConfig with a helpful error
directing users to per-variant resume fields
Improvements:
- Rename _find_best_checkpoint → _find_latest_checkpoint with accurate
docstring (pretrain/cotrain only write step_* dirs, no best/ symlink)
- Populate last_train_acc for cotrain trials in monitor aggregation
- Add explicit multiprocessing_context="spawn" to DataLoader in
run_cotrain to prevent rayon deadlocks independent of caller context
- Fix misleading comment on _hf_push_future.result() — the closure
catches all exceptions, so result() blocks but never raises
Tests:
- Fix test dir names to match actual MetricsLogger format
(run_DATE_TIME_VARIANT_SLUG not run_DATE_TIME_SLUG_VARIANT)
- Add test for underscore-containing variant names
- Add test for top-level resume rejection1 parent 9bf71c9 commit c4f3744
File tree
11 files changed
+1312
-531
lines changed- pawn
- lab
- scripts
- tests
- core
- lab
11 files changed
+1312
-531
lines changedLarge diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
| 44 | + | |
45 | 45 | | |
46 | | - | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
47 | 69 | | |
48 | 70 | | |
49 | 71 | | |
| |||
116 | 138 | | |
117 | 139 | | |
118 | 140 | | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
119 | 301 | | |
120 | 302 | | |
121 | 303 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | | - | |
| 38 | + | |
39 | 39 | | |
40 | 40 | | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
46 | 51 | | |
47 | 52 | | |
48 | 53 | | |
| |||
266 | 271 | | |
267 | 272 | | |
268 | 273 | | |
269 | | - | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
270 | 279 | | |
271 | 280 | | |
272 | 281 | | |
| |||
296 | 305 | | |
297 | 306 | | |
298 | 307 | | |
299 | | - | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
300 | 313 | | |
301 | 314 | | |
302 | 315 | | |
303 | 316 | | |
304 | 317 | | |
305 | 318 | | |
306 | | - | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
307 | 344 | | |
308 | 345 | | |
309 | 346 | | |
310 | 347 | | |
311 | | - | |
312 | 348 | | |
313 | 349 | | |
314 | 350 | | |
315 | 351 | | |
316 | | - | |
| 352 | + | |
| 353 | + | |
317 | 354 | | |
318 | | - | |
319 | | - | |
320 | | - | |
321 | | - | |
322 | | - | |
323 | | - | |
324 | | - | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
325 | 364 | | |
326 | | - | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
327 | 385 | | |
328 | 386 | | |
329 | 387 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| |||
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
108 | | - | |
109 | | - | |
| 108 | + | |
| 109 | + | |
110 | 110 | | |
111 | 111 | | |
112 | 112 | | |
113 | 113 | | |
| 114 | + | |
114 | 115 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| 55 | + | |
| 56 | + | |
55 | 57 | | |
56 | 58 | | |
57 | 59 | | |
| |||
0 commit comments