Add co-training support to pawn-lab MCP server#58
Conversation
lab_launch now accepts run_type="cotrain" with a list of variant specs, enabling multi-model pretraining runs (the equivalent of train_all.py) to be launched, monitored, killed, and resumed through the lab tools. - Add CotrainVariant and CotrainConfig to pawn/run_config.py with validators for unique names, non-empty variants, and shm/hf coupling - Extract ModelSlot + training loop from train_all.py into pawn/cotrain.py with resume support (per-variant checkpoint loading) and pause_after_steps - Convert scripts/train_all.py to thin CLI shim over run_cotrain() - Add cotrain dispatch branch in scripts/train.py - Update lab runner: _validate_config accepts cotrain, resume_trial discovers per-variant checkpoints and sets per-variant resume paths - Update lab monitor: multi-file metrics discovery for cotrain trials with per-variant offset tracking and aggregation to trial level - Update lab server: lab_schema exposes cotrain, updated docstrings - Add Trial.variants field for per-variant state tracking - 19 new tests (config validation, serialization, monitor aggregation)
|
Claude finished @thomas-schweich's task in 3m 56s —— View job PR #58 Code ReviewOverall this is a clean, well-structured extraction. The Bug:
|
…_val The explicit `epoch: int | None` parameter alongside `**metrics: object` caused pyright to reject callers that spread a `dict[str, float]` (since a key named "epoch" would be float, not int). The parameter was redundant — epoch flows through **metrics like every other field. Removing it fixes all 11 pyright errors in pawn/cotrain.py at the source.
Bug fixes: - Pass sdpa_math/no_compile/no_amp flags to configure_gpu() in run_cotrain so --sdpa-math actually takes effect - Fix _extract_variant_name to handle underscores in variant names by joining parts[3:-1] (variant is between timestamp and slug) - Reject top-level 'resume' field on CotrainConfig with a helpful error directing users to per-variant resume fields Improvements: - Rename _find_best_checkpoint → _find_latest_checkpoint with accurate docstring (pretrain/cotrain only write step_* dirs, no best/ symlink) - Populate last_train_acc for cotrain trials in monitor aggregation - Add explicit multiprocessing_context="spawn" to DataLoader in run_cotrain to prevent rayon deadlocks independent of caller context - Fix misleading comment on _hf_push_future.result() — the closure catches all exceptions, so result() blocks but never raises Tests: - Fix test dir names to match actual MetricsLogger format (run_DATE_TIME_VARIANT_SLUG not run_DATE_TIME_SLUG_VARIANT) - Add test for underscore-containing variant names - Add test for top-level resume rejection
|
@claude please follow up review |
|
Claude finished @thomas-schweich's task in 3m 34s —— View job PR #58 Follow-up ReviewAll 7 issues from the original review have been addressed in commit Bug:
gpu_cfg = configure_gpu(
device,
no_compile=config.no_compile,
no_amp=(config.amp_dtype == "none"),
sdpa_math=config.sdpa_math,
)Bug:
Misleading:
Minor:
Minor: DataLoader missing
multiprocessing_context="spawn" if config.num_workers > 0 else None,Correctly conditions on Nit: Misleading comment on
Nit: Top-level
One remaining observationThe ts = datetime.now().strftime("%Y%m%d_%H%M%S") # e.g. "20260410_151230"
parts = [run_prefix, ts]
if suffix:
parts.append(suffix)
parts.append(self.slug)
dir_name = "_".join(parts)
SummaryAll 7 review items are fixed correctly. The |
Summary
run_type="cotrain"to the lab MCP server, enabling multi-model pretraining runs (thetrain_all.pyworkflow) to be launched, monitored, killed, and resumed throughlab_launch/lab_status/lab_kill/lab_resumeCotrainVariantspecs (arbitrary architecture overrides, not just small/base/large), with per-variant resume paths for seamlesslab_resumeModelSlot+ training loop fromscripts/train_all.pyintopawn/cotrain.pyso both the lab and CLI share one implementation;train_all.pybecomes a thin argparse shimKey changes
pawn/run_config.pyCotrainVariant,CotrainConfig, updatedRunConfigunionpawn/cotrain.pyModelSlot+run_cotrain()with resume & pause supportscripts/train_all.pyCotrainConfig→run_cotrain()scripts/train.pycotraindispatch branchpawn/lab/runner.py_validate_config+resume_trialcotrain supportpawn/lab/monitor.pypawn/lab/state.pyTrial.variantsfieldpawn/lab/server.pylab_schema+ docstring updatesTest plan
lab_launch({"run_type": "cotrain", "variants": [{"name": "a", "variant": "toy"}, {"name": "b", "variant": "toy"}], "total_steps": 50, "batch_size": 16, "local_checkpoints": true})on a GPU podlab_resume(trial_id, total_steps=100)after completionscripts/train_all.py --local-checkpoints --total-steps 50