Refactor Dockerfile: factor out common layers and add dev entrypoint by thomas-schweich · Pull Request #53 · thomas-schweich/PAWN

thomas-schweich · 2026-04-09T04:54:43Z

Summary

Extract deps-common and dev-common stages shared by CUDA and ROCm, eliminating ~80 lines of duplication
Move caddy/uv COPYs last in common stages — they're pulled from external images and don't depend on our layers, so placing them late avoids cache invalidation on upstream releases
Add deploy/entrypoint-dev.sh that runs chown -R pawn:pawn on /opt/pawn and /workspace before delegating to the main entrypoint (fixes volume mount ownership)
Install Rust toolchain in dev-common and run uv sync --frozen (without --no-install-workspace) in dev stages so the chess engine is properly workspace-registered — fixes uv run triggering a full engine rebuild on first use
Simplify entrypoint path: COPY . . already includes it, so CMD now references deploy/entrypoint.sh directly instead of copying it to a separate location

Test plan

CI Docker builds pass for all four targets (runtime, runtime-rocm, dev, dev-rocm)
Dev image: uv run python -c "import chess_engine" works without triggering a Rust build
Dev image: /opt/pawn and /workspace owned by pawn after boot with a volume attached
Runtime images: entrypoint starts sshd + dashboard as before

- Extract deps-common and dev-common stages shared by CUDA and ROCm - Move caddy/uv COPYs last in common stages (external, don't invalidate our layers) - Add deploy/entrypoint-dev.sh with recursive chown on /opt/pawn and /workspace - Install Rust in dev-common and run uv sync in dev stages so the engine is workspace-registered (fixes uv run triggering a rebuild on first use) - Entrypoint path simplified: COPY . . already includes it, CMD references deploy/entrypoint.sh directly instead of copying it to a separate location

thomas-schweich · 2026-04-09T04:55:12Z

@claude

claude · 2026-04-09T04:55:30Z

Claude finished @thomas-schweich's task in 1m 47s —— View job

Code Review

Read changed files
Analyze Dockerfile refactor
Review entrypoint-dev.sh
Post review feedback

The refactor achieves its goals cleanly. The deps-common / dev-common split eliminates real duplication and the cache-ordering rationale for caddy/uv COPYs is sound. A few things worth flagging:

Bug: Missing `--no-install-recommends` in `dev-common` (line 154)

# dev-common — line 154 (current)
RUN apt-get update && apt-get install -y build-essential \
        openssh-server tini tmux ripgrep jq curl git \
    && rm -rf /var/lib/apt/lists/* \
    && mkdir -p /run/sshd

The original per-GPU dev stages used --no-install-recommends. dev-common dropped it. With build-essential and everything else recommending transitive packages, this likely inflates the image by tens to hundreds of MB. Fix this →

Observation: `chown -R pawn:pawn /opt/pawn` at runtime may be slow

entrypoint-dev.sh runs chown -R pawn:pawn /opt/pawn on every container start. Since /opt/pawn contains the entire venv (including a multi-GB PyTorch install with thousands of files), this scan runs every boot even though the venv files were already correctly owned at build time via --chown=pawn:pawn. The chown is only needed for files that arrived via a volume mount — and RunPod volumes typically mount to /workspace, not /opt/pawn.

If /workspace is the only mount point, the /opt/pawn chown can be dropped with no loss of correctness. If you want to keep it as a safety net, consider scoping it more narrowly (e.g., just chown pawn:pawn /opt/pawn/logs or the specific subdirectory that the volume might shadow).

Nit: entrypoint scripts are not executable in git

Both deploy/entrypoint.sh and deploy/entrypoint-dev.sh are mode 100644 in git. The Dockerfile chmod +x corrects this inside the image, but checking out the repo and running the scripts directly will require a manual chmod. Conventional practice is 100755 for shell scripts. Fix with git update-index --chmod=+x deploy/entrypoint.sh deploy/entrypoint-dev.sh.

Looks good

Caddy/uv COPYs moved last in deps-common and dev-common — good cache hygiene
--chmod=755 on the heredoc COPY (line 185) is a clean replacement for the old RUN chmod +x
Installing Rust in dev-common as USER pawn so .cargo/bin lands under /home/pawn/.cargo is correct; PATH in the dev stages picks it up
uv sync --extra cuXXX --frozen (without --no-install-workspace) in dev stages correctly triggers workspace registration / engine build, while --no-install-workspace in the deps stages is the right split — pre-built wheel goes in first, workspace link comes later
CUDA and ROCm dev stages are now structurally symmetric — easier to maintain

…installation in dev-common layer

- Compound early stopping note now lists all four reset criteria (val_loss, late_legal, game_completion_rate, avg_plies_completed) matching the trainer change in this PR. - New "Benchmark the Pod" startup step before launching trials, so the agent has ground-truth step times, compile speedup, and concurrency scaling for *this* pod when planning. - Always use torch.compile by default; the warmup is cheap relative to the 1.5-2.2x speedup, even for short runs. - VRAM caveat removed (skill is pod-focused; pod GPUs aren't VRAM-constrained). - max_seq_len default updated to 512. - Tools reference: add lab_resume, document tag filter on lab_results, health_warning event type, and graceful-checkpoint behavior on lab_kill. - Drop the stale 15-30 min compile overhead figure; replace with the measured 10-30 s (NVIDIA) / 1-2 min (AMD) numbers. - Note that uv run works in dev images post #53. .dockerignore: un-ignore .claude/skills so the manage-pod skill ships with the dev image (the rest of .claude stays excluded).

thomas-schweich added 2 commits April 8, 2026 22:09

Auto export HF_TOKEN in dev containers.

515298a

Refactor Dockerfile: use --no-install-recommends for build-essential …

9e1df99

…installation in dev-common layer

thomas-schweich merged commit 46a8c3e into main Apr 9, 2026
1 check passed

thomas-schweich deleted the refactor-dockerfile-common-layers branch April 9, 2026 05:19

thomas-schweich mentioned this pull request Apr 12, 2026

Fix dev image: HF_TOKEN crash and claude binary accessibility #59

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Dockerfile: factor out common layers and add dev entrypoint#53

Refactor Dockerfile: factor out common layers and add dev entrypoint#53
thomas-schweich merged 3 commits intomainfrom
refactor-dockerfile-common-layers

thomas-schweich commented Apr 9, 2026

Uh oh!

thomas-schweich commented Apr 9, 2026

Uh oh!

claude bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomas-schweich commented Apr 9, 2026

Summary

Test plan

Uh oh!

thomas-schweich commented Apr 9, 2026

Uh oh!

claude bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Bug: Missing --no-install-recommends in dev-common (line 154)

Observation: chown -R pawn:pawn /opt/pawn at runtime may be slow

Nit: entrypoint scripts are not executable in git

Looks good

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 9, 2026 •

edited

Loading

Bug: Missing `--no-install-recommends` in `dev-common` (line 154)

Observation: `chown -R pawn:pawn /opt/pawn` at runtime may be slow