feat: Docker containerization + parallel multi-agent execution#103
feat: Docker containerization + parallel multi-agent execution#103mtgibbs wants to merge 10 commits intosnarktank:mainfrom
Conversation
Add parallel mode that runs N containerized Claude Code agents simultaneously against the same PRD, with network sandboxing, resource limits, and git-based story claiming. New directories: - docker/ — Dockerfile, container entrypoint, iptables firewall scripts - parallel/ — orchestrator, stop, status, parallel prompt, lib helpers Upstream ralph.sh and all existing files are untouched. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from env-var token passing to Docker volume-based auth: - Mount ralph-claude-auth volume at /claude-auth:ro - agent-loop.sh copies credentials to writable ~/.claude/ - Add check_auth_volume() to verify volume before launch - Remove CLAUDE_CODE_OAUTH_TOKEN env var requirement Add --project DIR flag to orchestrator, status, and stop scripts so ralph can target external project directories. Bug fixes discovered during smoke test: - Fix UID 1000 conflict in Dockerfile (node:20-slim uses 1000) - Fix macOS seq counting down when count=0 (guard with -le 0) - Fix PARALLEL_PROMPT path resolution for external projects Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow projects to specify a custom Docker image via --image flag, enabling project-specific tooling (e.g., Deno, Python) without modifying the base ralph-agent image. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Redirect all git output in claim_story() to stderr so only the story ID goes to stdout (prevents garbage in CLAIMED_STORY) - Wrap claim_story call in if-statement to prevent set -e from killing the script when claim returns non-zero - Fix setup_workspace to reset to current branch on restart, not hard-coded origin/main (preserves feature branches) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deno projects need access to jsr.io (Deno's package registry) for dependency resolution and type checking. Without this, agents can't run `deno task check` inside builder containers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…city - Mount only stop_requested file instead of entire .ralph/ directory, preventing agents from reading plaintext auth tokens - Switch stop signal from file-existence to file-content (-s not -f) since the file must exist for Docker bind-mount - Remove hardcoded --platform linux/arm64 so builds work on any arch - Replace hardcoded npm/jsr/deno firewall whitelist with --allow-domain flag, making the firewall language-agnostic - Use treeless bare clone (--filter=blob:none) to avoid exposing old file content that may contain secrets - Add SETENV to sudoers so RALPH_EXTRA_DOMAINS passes through sudo - Document custom image contract and extension pattern in README Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add push.autoSetupRemote to git config so first push on a new branch automatically sets up tracking - Skip git pull --rebase when remote branch doesn't exist yet (new branch from prd.json branchName) - Use file:// prefix for bare clone so --filter=blob:none takes effect (git ignores filters on local path clones) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Projects can now be "ralph-ready" by adding a Dockerfile.ralph to their root. When detected, ralph automatically builds a project-specific image (tagged ralph-agent-<project>:latest) without needing --image. Resolution order: --image flag > Dockerfile.ralph > default base image. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile OverviewGreptile SummaryThis PR adds Docker containerization and parallel multi-agent execution to Ralph, enabling multiple Claude Code agents to work simultaneously on PRD stories in isolated sandboxes. The implementation uses git atomic push for story claiming coordination, iptables firewall for network restrictions, and per-agent progress files to avoid merge conflicts. Key additions:
Issues found:
Architecture highlights:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant H as Host (ralph-parallel.sh)
participant D as Docker
participant A1 as Agent Container 1
participant A2 as Agent Container 2
participant BR as Bare Repo (.ralph/repo.git)
participant Claude as Claude API
H->>D: Build ralph-agent image
H->>D: Create networks (builder/researcher)
H->>D: Launch agent containers
D->>A1: Start agent-loop.sh
D->>A2: Start agent-loop.sh
A1->>A1: Init firewall (iptables)
A2->>A2: Init firewall (iptables)
A1->>A1: Copy Claude credentials
A2->>A2: Copy Claude credentials
A1->>BR: git clone
A2->>BR: git clone
loop Agent Work Loop
A1->>BR: git pull --rebase
A1->>A1: Find unclaimed story in prd.json
A1->>A1: Set claimed_by field
A1->>BR: git push (atomic claim)
alt Push succeeds
A1->>Claude: Run claude with story
Claude-->>A1: Implementation
A1->>A1: Run tests
A1->>A1: Update prd.json (passes: true)
A1->>BR: git push (with retries)
A1->>A1: Append to progress-agent-1.txt
else Push fails (concurrent claim)
A1->>BR: git reset --hard HEAD~1
A1->>BR: git pull --rebase
A1->>A1: Retry with different story
end
A2->>BR: git pull --rebase
A2->>A2: Claim different story
A2->>BR: git push (atomic claim)
A2->>Claude: Run claude with story
Claude-->>A2: Implementation
A2->>BR: git push results
H->>BR: Check if all stories complete
H->>H: Recover stale claims (>30min)
H->>D: Health check containers
end
H->>BR: All stories have passes: true
H->>D: Stop all containers
H->>BR: Sync bare repo to project dir
Last reviewed commit: 2c82c08 |
parallel/ralph-parallel.sh
Outdated
| # Parse claimed_at timestamp (macOS date -j, fallback to GNU date -d) | ||
| local claimed_epoch | ||
| claimed_epoch=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$claimed_at" +%s 2>/dev/null \ | ||
| || date -d "$claimed_at" +%s 2>/dev/null \ | ||
| || echo "0") |
There was a problem hiding this comment.
date -j is macOS-specific BSD date syntax. Will fail on Linux containers (the actual target platform for this code).
| # Parse claimed_at timestamp (macOS date -j, fallback to GNU date -d) | |
| local claimed_epoch | |
| claimed_epoch=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$claimed_at" +%s 2>/dev/null \ | |
| || date -d "$claimed_at" +%s 2>/dev/null \ | |
| || echo "0") | |
| # Parse claimed_at timestamp (use GNU date in Linux containers) | |
| local claimed_epoch | |
| claimed_epoch=$(date -d "$claimed_at" +%s 2>/dev/null || echo "0") |
There was a problem hiding this comment.
Thanks! That was an oversight on my part! I'll do better next time!
| # --- Step 4: Set git identity --- | ||
| setup_git_identity() { | ||
| git config user.name "$AGENT_ID" | ||
| git config user.email "${AGENT_ID}@ralph-agent.local" | ||
| git config pull.rebase true | ||
| git config push.autoSetupRemote true | ||
| } |
There was a problem hiding this comment.
Git config pull.rebase and push.autoSetupRemote set globally for agent user affects all repos. If agent ever works with multiple repos or submodules, this could cause unexpected behavior.
| # Resolve and allow each whitelisted domain | ||
| for domain in "${ALLOWED_DOMAINS[@]}"; do | ||
| # Resolve all IPs for the domain | ||
| ips=$(dig +short "$domain" 2>/dev/null | grep -E '^[0-9]+\.' || true) | ||
| for ip in $ips; do | ||
| iptables -A OUTPUT -p tcp -d "$ip" --dport 443 -j ACCEPT | ||
| echo "[firewall] Allowed: $domain -> $ip:443" | ||
| done | ||
|
|
||
| # Also resolve CNAME targets (CDNs etc) | ||
| cnames=$(dig +short "$domain" 2>/dev/null | grep -v -E '^[0-9]+\.' || true) | ||
| for cname in $cnames; do | ||
| cname_ips=$(dig +short "$cname" 2>/dev/null | grep -E '^[0-9]+\.' || true) | ||
| for ip in $cname_ips; do | ||
| iptables -A OUTPUT -p tcp -d "$ip" --dport 443 -j ACCEPT | ||
| echo "[firewall] Allowed: $domain (via $cname) -> $ip:443" | ||
| done | ||
| done | ||
| done |
There was a problem hiding this comment.
DNS resolution is static at firewall init time. If IPs change after containers start (CDN rotation, DNS updates), agents lose access until container restart.
| if [ ! -d "$BARE_REPO" ]; then | ||
| log_info "Creating bare repo for agent coordination..." | ||
| mkdir -p "$PROJECT_DIR/.ralph" | ||
| git clone --bare --filter=blob:none "file://$PROJECT_DIR" "$BARE_REPO" |
There was a problem hiding this comment.
Using --filter=blob:none for bare repo may cause issues if agents need to access file content from history. Consider implications for projects with large binary assets.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| claude --dangerously-skip-permissions \ | ||
| --print \ | ||
| --model "$CLAUDE_MODEL" \ | ||
| -p "$PROMPT" \ | ||
| &> "$LOGFILE" || { | ||
| echo "[$AGENT_ID] Claude exited with error (code: $?). Check log: $LOGFILE" | ||
| } |
There was a problem hiding this comment.
Using --dangerously-skip-permissions with AI agents operating autonomously is risky. While sandboxing provides some protection, agents can still execute arbitrary commands within container.
There was a problem hiding this comment.
So as a POC I've used this, but I actually think there's probably some middle ground using --permission-mode dontAsk (Claude Docs - Permissions ). If this containerization is something we want to explore further, I'll go define some conventions around what permissions each agent can have. @snarktank
- Swap date parsing order to try GNU date -d first, macOS date -j as fallback (orchestrator runs on the host which could be either OS) - Remove dead token_refresh docs and unused check_token_refresh_file() function — auth model is now volume-based, not token-file-based Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the review! Addressed the actionable items and wanted to document our reasoning on the rest: Fixed
Token refresh documented but not wired up — Correct. This was leftover from an earlier token-file auth model. We've since moved to volume-based auth ( Acknowledged (acceptable as-is)Git config set globally for agent user — Each container runs exactly one user working on exactly one repo. There are no submodules or secondary repos in this pattern. The container is ephemeral and torn down after use, so global git config has no side effects. DNS resolution is static at firewall init — True. If CDN IPs rotate mid-session, the agent would lose access until the container restarts. In practice, agent iterations are short-lived (minutes) and the orchestrator auto-restarts crashed containers, which re-resolves DNS. We considered re-resolving periodically but it adds complexity for a very unlikely failure mode.
|
Previously both agent-loop.sh AND Claude claimed stories. The script
would claim US-001, then Claude would read CLAUDE-parallel.md's claim
protocol and grab US-002 and US-003 before doing any work.
Now:
- agent-loop.sh injects {{CLAIMED_STORY}} into the prompt via sed
- CLAUDE-parallel.md tells Claude which story is pre-assigned
- Claude is explicitly told not to claim additional stories
- Claiming is solely the script's responsibility
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hey! I really like this paradigm and have been playing with it a bit! I wanted to add containerization to it for a bit of safety and try to define a
dockerfile.ralphconvention to help projects onboard. This way we can sandbox our AI assistants to avoid them writing or deleting on our file system with little oversight while they work independently. Worse case is that an agent nukes itself in the process of doing work. Thanks for taking a look! Below are the claude assisted changes and some information on how it works:What's Added
Container sandbox (
docker/) — 4 new filesDockerfile— Base image:node:20-slim+ Claude Code, non-rootagentuser (UID 1001), iptables for firewallagent-loop.sh— Container entrypoint: initializes firewall, copies auth, clones from bare repo, claims stories via git atomic push, runs Claude in a loop, pushes resultsinit-firewall-builder.sh— iptables whitelist: Claude API + user-specified domains via--allow-domain. Everything else is denied.init-firewall-researcher.sh— Full internet access for research-role agentsParallel orchestrator (
parallel/) — 10 new filesralph-parallel.sh— Host-side orchestrator: builds image (auto-detectingDockerfile.ralph), creates Docker networks, launches N containers, monitors health, recovers stale story claims, detects PRD completion and shuts downstop.sh/status.sh— Graceful shutdown and live dashboardCLAUDE-parallel.md— Parallel-aware prompt guiding agents through the claim/implement/push cyclelib/— Auth (env var > file > 1Password), Docker helpers, network setup, loggingExisting file changes — 3 files touched
.gitignore— Added.ralph/,agent_logs/, per-agent progress filesAGENTS.md/README.md— Documented parallel mode, CLI options, quick startKey Design Decisions
progress-agent-N.txtinstead of all appending to oneprogress.txt, avoiding merge conflicts.Dockerfile.ralphconvention — Projects declare their runtime needs by adding aDockerfile.ralphthat extends the base image. Ralph auto-detects and builds it. Resolution:--imageflag >Dockerfile.ralph> default base.--allow-domain— No hardcoded package registries. Users whitelist what their project needs (registry.npmjs.org,pypi.org, etc.). Onlyapi.anthropic.comandstatsig.anthropic.comare always-allowed.ralph-claude-auth), populated once viaclaude login. Agents copy credentials at startup — no host token files mounted into containers.Usage
🤖 Generated with Claude Code