Skip to content

Refactor Dockerfile: factor out common layers and add dev entrypoint#53

Merged
thomas-schweich merged 3 commits intomainfrom
refactor-dockerfile-common-layers
Apr 9, 2026
Merged

Refactor Dockerfile: factor out common layers and add dev entrypoint#53
thomas-schweich merged 3 commits intomainfrom
refactor-dockerfile-common-layers

Conversation

@thomas-schweich
Copy link
Copy Markdown
Owner

Summary

  • Extract deps-common and dev-common stages shared by CUDA and ROCm, eliminating ~80 lines of duplication
  • Move caddy/uv COPYs last in common stages — they're pulled from external images and don't depend on our layers, so placing them late avoids cache invalidation on upstream releases
  • Add deploy/entrypoint-dev.sh that runs chown -R pawn:pawn on /opt/pawn and /workspace before delegating to the main entrypoint (fixes volume mount ownership)
  • Install Rust toolchain in dev-common and run uv sync --frozen (without --no-install-workspace) in dev stages so the chess engine is properly workspace-registered — fixes uv run triggering a full engine rebuild on first use
  • Simplify entrypoint path: COPY . . already includes it, so CMD now references deploy/entrypoint.sh directly instead of copying it to a separate location

Test plan

  • CI Docker builds pass for all four targets (runtime, runtime-rocm, dev, dev-rocm)
  • Dev image: uv run python -c "import chess_engine" works without triggering a Rust build
  • Dev image: /opt/pawn and /workspace owned by pawn after boot with a volume attached
  • Runtime images: entrypoint starts sshd + dashboard as before

- Extract deps-common and dev-common stages shared by CUDA and ROCm
- Move caddy/uv COPYs last in common stages (external, don't invalidate our layers)
- Add deploy/entrypoint-dev.sh with recursive chown on /opt/pawn and /workspace
- Install Rust in dev-common and run uv sync in dev stages so the engine
  is workspace-registered (fixes uv run triggering a rebuild on first use)
- Entrypoint path simplified: COPY . . already includes it, CMD references
  deploy/entrypoint.sh directly instead of copying it to a separate location
@thomas-schweich
Copy link
Copy Markdown
Owner Author

@claude

@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Claude finished @thomas-schweich's task in 1m 47s —— View job


Code Review

  • Read changed files
  • Analyze Dockerfile refactor
  • Review entrypoint-dev.sh
  • Post review feedback

The refactor achieves its goals cleanly. The deps-common / dev-common split eliminates real duplication and the cache-ordering rationale for caddy/uv COPYs is sound. A few things worth flagging:

Bug: Missing --no-install-recommends in dev-common (line 154)

# dev-common — line 154 (current)
RUN apt-get update && apt-get install -y build-essential \
        openssh-server tini tmux ripgrep jq curl git \
    && rm -rf /var/lib/apt/lists/* \
    && mkdir -p /run/sshd

The original per-GPU dev stages used --no-install-recommends. dev-common dropped it. With build-essential and everything else recommending transitive packages, this likely inflates the image by tens to hundreds of MB. Fix this →

Observation: chown -R pawn:pawn /opt/pawn at runtime may be slow

entrypoint-dev.sh runs chown -R pawn:pawn /opt/pawn on every container start. Since /opt/pawn contains the entire venv (including a multi-GB PyTorch install with thousands of files), this scan runs every boot even though the venv files were already correctly owned at build time via --chown=pawn:pawn. The chown is only needed for files that arrived via a volume mount — and RunPod volumes typically mount to /workspace, not /opt/pawn.

If /workspace is the only mount point, the /opt/pawn chown can be dropped with no loss of correctness. If you want to keep it as a safety net, consider scoping it more narrowly (e.g., just chown pawn:pawn /opt/pawn/logs or the specific subdirectory that the volume might shadow).

Nit: entrypoint scripts are not executable in git

Both deploy/entrypoint.sh and deploy/entrypoint-dev.sh are mode 100644 in git. The Dockerfile chmod +x corrects this inside the image, but checking out the repo and running the scripts directly will require a manual chmod. Conventional practice is 100755 for shell scripts. Fix with git update-index --chmod=+x deploy/entrypoint.sh deploy/entrypoint-dev.sh.

Looks good

  • Caddy/uv COPYs moved last in deps-common and dev-common — good cache hygiene
  • --chmod=755 on the heredoc COPY (line 185) is a clean replacement for the old RUN chmod +x
  • Installing Rust in dev-common as USER pawn so .cargo/bin lands under /home/pawn/.cargo is correct; PATH in the dev stages picks it up
  • uv sync --extra cuXXX --frozen (without --no-install-workspace) in dev stages correctly triggers workspace registration / engine build, while --no-install-workspace in the deps stages is the right split — pre-built wheel goes in first, workspace link comes later
  • CUDA and ROCm dev stages are now structurally symmetric — easier to maintain

@thomas-schweich thomas-schweich merged commit 46a8c3e into main Apr 9, 2026
1 check passed
@thomas-schweich thomas-schweich deleted the refactor-dockerfile-common-layers branch April 9, 2026 05:19
thomas-schweich added a commit that referenced this pull request Apr 10, 2026
- Compound early stopping note now lists all four reset criteria
  (val_loss, late_legal, game_completion_rate, avg_plies_completed)
  matching the trainer change in this PR.
- New "Benchmark the Pod" startup step before launching trials, so
  the agent has ground-truth step times, compile speedup, and
  concurrency scaling for *this* pod when planning.
- Always use torch.compile by default; the warmup is cheap relative
  to the 1.5-2.2x speedup, even for short runs.
- VRAM caveat removed (skill is pod-focused; pod GPUs aren't VRAM-constrained).
- max_seq_len default updated to 512.
- Tools reference: add lab_resume, document tag filter on lab_results,
  health_warning event type, and graceful-checkpoint behavior on lab_kill.
- Drop the stale 15-30 min compile overhead figure; replace with the
  measured 10-30 s (NVIDIA) / 1-2 min (AMD) numbers.
- Note that uv run works in dev images post #53.

.dockerignore: un-ignore .claude/skills so the manage-pod skill ships
with the dev image (the rest of .claude stays excluded).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant