Refactor Dockerfile: factor out common layers and add dev entrypoint#53
Refactor Dockerfile: factor out common layers and add dev entrypoint#53thomas-schweich merged 3 commits intomainfrom
Conversation
- Extract deps-common and dev-common stages shared by CUDA and ROCm - Move caddy/uv COPYs last in common stages (external, don't invalidate our layers) - Add deploy/entrypoint-dev.sh with recursive chown on /opt/pawn and /workspace - Install Rust in dev-common and run uv sync in dev stages so the engine is workspace-registered (fixes uv run triggering a rebuild on first use) - Entrypoint path simplified: COPY . . already includes it, CMD references deploy/entrypoint.sh directly instead of copying it to a separate location
|
Claude finished @thomas-schweich's task in 1m 47s —— View job Code Review
The refactor achieves its goals cleanly. The Bug: Missing
|
…installation in dev-common layer
- Compound early stopping note now lists all four reset criteria (val_loss, late_legal, game_completion_rate, avg_plies_completed) matching the trainer change in this PR. - New "Benchmark the Pod" startup step before launching trials, so the agent has ground-truth step times, compile speedup, and concurrency scaling for *this* pod when planning. - Always use torch.compile by default; the warmup is cheap relative to the 1.5-2.2x speedup, even for short runs. - VRAM caveat removed (skill is pod-focused; pod GPUs aren't VRAM-constrained). - max_seq_len default updated to 512. - Tools reference: add lab_resume, document tag filter on lab_results, health_warning event type, and graceful-checkpoint behavior on lab_kill. - Drop the stale 15-30 min compile overhead figure; replace with the measured 10-30 s (NVIDIA) / 1-2 min (AMD) numbers. - Note that uv run works in dev images post #53. .dockerignore: un-ignore .claude/skills so the manage-pod skill ships with the dev image (the rest of .claude stays excluded).
Summary
deps-commonanddev-commonstages shared by CUDA and ROCm, eliminating ~80 lines of duplicationdeploy/entrypoint-dev.shthat runschown -R pawn:pawnon/opt/pawnand/workspacebefore delegating to the main entrypoint (fixes volume mount ownership)dev-commonand runuv sync --frozen(without--no-install-workspace) in dev stages so the chess engine is properly workspace-registered — fixesuv runtriggering a full engine rebuild on first useCOPY . .already includes it, so CMD now referencesdeploy/entrypoint.shdirectly instead of copying it to a separate locationTest plan
uv run python -c "import chess_engine"works without triggering a Rust build/opt/pawnand/workspaceowned by pawn after boot with a volume attached