Skip to content

feat: Docker containerization + parallel multi-agent execution#103

Open
mtgibbs wants to merge 10 commits intosnarktank:mainfrom
mtgibbs:main
Open

feat: Docker containerization + parallel multi-agent execution#103
mtgibbs wants to merge 10 commits intosnarktank:mainfrom
mtgibbs:main

Conversation

@mtgibbs
Copy link

@mtgibbs mtgibbs commented Feb 13, 2026

Hey! I really like this paradigm and have been playing with it a bit! I wanted to add containerization to it for a bit of safety and try to define a dockerfile.ralph convention to help projects onboard. This way we can sandbox our AI assistants to avoid them writing or deleting on our file system with little oversight while they work independently. Worse case is that an agent nukes itself in the process of doing work. Thanks for taking a look! Below are the claude assisted changes and some information on how it works:


What's Added

Container sandbox (docker/) — 4 new files

  • Dockerfile — Base image: node:20-slim + Claude Code, non-root agent user (UID 1001), iptables for firewall
  • agent-loop.sh — Container entrypoint: initializes firewall, copies auth, clones from bare repo, claims stories via git atomic push, runs Claude in a loop, pushes results
  • init-firewall-builder.sh — iptables whitelist: Claude API + user-specified domains via --allow-domain. Everything else is denied.
  • init-firewall-researcher.sh — Full internet access for research-role agents

Parallel orchestrator (parallel/) — 10 new files

  • ralph-parallel.sh — Host-side orchestrator: builds image (auto-detecting Dockerfile.ralph), creates Docker networks, launches N containers, monitors health, recovers stale story claims, detects PRD completion and shuts down
  • stop.sh / status.sh — Graceful shutdown and live dashboard
  • CLAUDE-parallel.md — Parallel-aware prompt guiding agents through the claim/implement/push cycle
  • lib/ — Auth (env var > file > 1Password), Docker helpers, network setup, logging

Existing file changes — 3 files touched

  • .gitignore — Added .ralph/, agent_logs/, per-agent progress files
  • AGENTS.md / README.md — Documented parallel mode, CLI options, quick start

Key Design Decisions

  • Git as the coordination layer — A shared bare repo + atomic push for story claiming. No external database, no lock server. If two agents race to claim the same story, one push wins and the other retries with a different story.
  • Per-agent progress files — Each agent writes progress-agent-N.txt instead of all appending to one progress.txt, avoiding merge conflicts.
  • Dockerfile.ralph convention — Projects declare their runtime needs by adding a Dockerfile.ralph that extends the base image. Ralph auto-detects and builds it. Resolution: --image flag > Dockerfile.ralph > default base.
  • Configurable firewall via --allow-domain — No hardcoded package registries. Users whitelist what their project needs (registry.npmjs.org, pypi.org, etc.). Only api.anthropic.com and statsig.anthropic.com are always-allowed.
  • Volume-based auth — Claude credentials live in a Docker volume (ralph-claude-auth), populated once via claude login. Agents copy credentials at startup — no host token files mounted into containers.

Usage

# One-time auth setup
docker run -it --entrypoint bash \
  -v ralph-claude-auth:/home/agent/.claude \
  ralph-agent:latest
# Inside: claude login && exit

# Run 3 agents against a project
./parallel/ralph-parallel.sh \
  --project /path/to/my-project \
  --allow-domain registry.npmjs.org \
  --agents 3

# Monitor / stop
./parallel/status.sh --project /path/to/my-project
./parallel/stop.sh --project /path/to/my-project

🤖 Generated with Claude Code

mtgibbs and others added 8 commits February 12, 2026 23:41
Add parallel mode that runs N containerized Claude Code agents
simultaneously against the same PRD, with network sandboxing,
resource limits, and git-based story claiming.

New directories:
- docker/ — Dockerfile, container entrypoint, iptables firewall scripts
- parallel/ — orchestrator, stop, status, parallel prompt, lib helpers

Upstream ralph.sh and all existing files are untouched.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from env-var token passing to Docker volume-based auth:
- Mount ralph-claude-auth volume at /claude-auth:ro
- agent-loop.sh copies credentials to writable ~/.claude/
- Add check_auth_volume() to verify volume before launch
- Remove CLAUDE_CODE_OAUTH_TOKEN env var requirement

Add --project DIR flag to orchestrator, status, and stop scripts
so ralph can target external project directories.

Bug fixes discovered during smoke test:
- Fix UID 1000 conflict in Dockerfile (node:20-slim uses 1000)
- Fix macOS seq counting down when count=0 (guard with -le 0)
- Fix PARALLEL_PROMPT path resolution for external projects

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow projects to specify a custom Docker image via --image flag,
enabling project-specific tooling (e.g., Deno, Python) without
modifying the base ralph-agent image.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Redirect all git output in claim_story() to stderr so only the
  story ID goes to stdout (prevents garbage in CLAIMED_STORY)
- Wrap claim_story call in if-statement to prevent set -e from
  killing the script when claim returns non-zero
- Fix setup_workspace to reset to current branch on restart,
  not hard-coded origin/main (preserves feature branches)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deno projects need access to jsr.io (Deno's package registry) for
dependency resolution and type checking. Without this, agents can't
run `deno task check` inside builder containers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…city

- Mount only stop_requested file instead of entire .ralph/ directory,
  preventing agents from reading plaintext auth tokens
- Switch stop signal from file-existence to file-content (-s not -f)
  since the file must exist for Docker bind-mount
- Remove hardcoded --platform linux/arm64 so builds work on any arch
- Replace hardcoded npm/jsr/deno firewall whitelist with --allow-domain
  flag, making the firewall language-agnostic
- Use treeless bare clone (--filter=blob:none) to avoid exposing old
  file content that may contain secrets
- Add SETENV to sudoers so RALPH_EXTRA_DOMAINS passes through sudo
- Document custom image contract and extension pattern in README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add push.autoSetupRemote to git config so first push on a new branch
  automatically sets up tracking
- Skip git pull --rebase when remote branch doesn't exist yet (new
  branch from prd.json branchName)
- Use file:// prefix for bare clone so --filter=blob:none takes effect
  (git ignores filters on local path clones)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Projects can now be "ralph-ready" by adding a Dockerfile.ralph to their
root. When detected, ralph automatically builds a project-specific image
(tagged ralph-agent-<project>:latest) without needing --image.

Resolution order: --image flag > Dockerfile.ralph > default base image.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link

greptile-apps bot commented Feb 13, 2026

Greptile Overview

Greptile Summary

This PR adds Docker containerization and parallel multi-agent execution to Ralph, enabling multiple Claude Code agents to work simultaneously on PRD stories in isolated sandboxes. The implementation uses git atomic push for story claiming coordination, iptables firewall for network restrictions, and per-agent progress files to avoid merge conflicts.

Key additions:

  • Docker image based on node:20-slim with Claude Code, non-root user, and firewall capabilities
  • Agent loop script that clones from bare repo, claims stories atomically, runs Claude, and pushes results
  • Orchestrator that launches N containers, monitors health, recovers stale claims, and detects completion
  • Network firewall restricting builder agents to Claude API + whitelisted domains (configurable via --allow-domain)
  • Dockerfile.ralph convention for project-specific runtime requirements
  • Comprehensive documentation with examples, auth setup, and debugging guide

Issues found:

  • Date parsing incompatibility: ralph-parallel.sh:357-361 uses macOS-specific date -j syntax that will fail on Linux (the actual container platform)
  • Incomplete token refresh feature: documented in README but check_token_refresh_file() never called from orchestrator
  • Static DNS resolution in firewall may cause connectivity loss if IPs change
  • --dangerously-skip-permissions flag used with autonomous agents (security trade-off documented in PR description)

Architecture highlights:

  • Git as coordination layer eliminates need for external lock server or database
  • Bare repo (.ralph/repo.git) enables reliable multi-agent pushing without conflicts on checked-out branches
  • Resource limits (--memory, --cpus, --pids-limit) prevent runaway containers
  • Graceful shutdown with 120s timeout before force-kill
  • Automatic stale claim recovery (30min threshold)

Confidence Score: 4/5

  • Safe to merge with one critical fix needed for Linux compatibility
  • The implementation is well-architected with solid error handling, security sandboxing, and comprehensive documentation. The git-based coordination layer is elegant and the Docker containerization achieves the stated security goals. However, the date parsing bug in ralph-parallel.sh will cause runtime failures on Linux (the primary target platform), and the token refresh feature is incomplete. These issues are fixable but prevent a score of 5.
  • Pay close attention to parallel/ralph-parallel.sh (date parsing bug on line 357) and verify token refresh implementation if that feature is needed

Important Files Changed

Filename Overview
docker/Dockerfile Clean base image setup with proper non-root user, minimal dependencies, and secure sudo configuration
docker/agent-loop.sh Comprehensive agent loop with git-based story claiming, firewall init, and robust error handling; minor git config concerns
docker/init-firewall-builder.sh Solid iptables firewall with DNS whitelisting; DNS resolution happens before lockdown
parallel/ralph-parallel.sh Feature-rich orchestrator with agent management, stale claim recovery, auto image building; date compatibility issue exists
parallel/lib/docker-helpers.sh Clean container lifecycle management with proper volume mounts and resource limits
parallel/CLAUDE-parallel.md Excellent parallel-aware prompt with clear claim protocol, conflict resolution, and push protocol

Sequence Diagram

sequenceDiagram
    participant H as Host (ralph-parallel.sh)
    participant D as Docker
    participant A1 as Agent Container 1
    participant A2 as Agent Container 2
    participant BR as Bare Repo (.ralph/repo.git)
    participant Claude as Claude API

    H->>D: Build ralph-agent image
    H->>D: Create networks (builder/researcher)
    H->>D: Launch agent containers
    D->>A1: Start agent-loop.sh
    D->>A2: Start agent-loop.sh
    
    A1->>A1: Init firewall (iptables)
    A2->>A2: Init firewall (iptables)
    A1->>A1: Copy Claude credentials
    A2->>A2: Copy Claude credentials
    
    A1->>BR: git clone
    A2->>BR: git clone
    
    loop Agent Work Loop
        A1->>BR: git pull --rebase
        A1->>A1: Find unclaimed story in prd.json
        A1->>A1: Set claimed_by field
        A1->>BR: git push (atomic claim)
        
        alt Push succeeds
            A1->>Claude: Run claude with story
            Claude-->>A1: Implementation
            A1->>A1: Run tests
            A1->>A1: Update prd.json (passes: true)
            A1->>BR: git push (with retries)
            A1->>A1: Append to progress-agent-1.txt
        else Push fails (concurrent claim)
            A1->>BR: git reset --hard HEAD~1
            A1->>BR: git pull --rebase
            A1->>A1: Retry with different story
        end
        
        A2->>BR: git pull --rebase
        A2->>A2: Claim different story
        A2->>BR: git push (atomic claim)
        A2->>Claude: Run claude with story
        Claude-->>A2: Implementation
        A2->>BR: git push results
        
        H->>BR: Check if all stories complete
        H->>H: Recover stale claims (>30min)
        H->>D: Health check containers
    end
    
    H->>BR: All stories have passes: true
    H->>D: Stop all containers
    H->>BR: Sync bare repo to project dir
Loading

Last reviewed commit: 2c82c08

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 357 to 361
# Parse claimed_at timestamp (macOS date -j, fallback to GNU date -d)
local claimed_epoch
claimed_epoch=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$claimed_at" +%s 2>/dev/null \
|| date -d "$claimed_at" +%s 2>/dev/null \
|| echo "0")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date -j is macOS-specific BSD date syntax. Will fail on Linux containers (the actual target platform for this code).

Suggested change
# Parse claimed_at timestamp (macOS date -j, fallback to GNU date -d)
local claimed_epoch
claimed_epoch=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$claimed_at" +%s 2>/dev/null \
|| date -d "$claimed_at" +%s 2>/dev/null \
|| echo "0")
# Parse claimed_at timestamp (use GNU date in Linux containers)
local claimed_epoch
claimed_epoch=$(date -d "$claimed_at" +%s 2>/dev/null || echo "0")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That was an oversight on my part! I'll do better next time!

Comment on lines +76 to +82
# --- Step 4: Set git identity ---
setup_git_identity() {
git config user.name "$AGENT_ID"
git config user.email "${AGENT_ID}@ralph-agent.local"
git config pull.rebase true
git config push.autoSetupRemote true
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git config pull.rebase and push.autoSetupRemote set globally for agent user affects all repos. If agent ever works with multiple repos or submodules, this could cause unexpected behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Comment on lines +42 to +60
# Resolve and allow each whitelisted domain
for domain in "${ALLOWED_DOMAINS[@]}"; do
# Resolve all IPs for the domain
ips=$(dig +short "$domain" 2>/dev/null | grep -E '^[0-9]+\.' || true)
for ip in $ips; do
iptables -A OUTPUT -p tcp -d "$ip" --dport 443 -j ACCEPT
echo "[firewall] Allowed: $domain -> $ip:443"
done

# Also resolve CNAME targets (CDNs etc)
cnames=$(dig +short "$domain" 2>/dev/null | grep -v -E '^[0-9]+\.' || true)
for cname in $cnames; do
cname_ips=$(dig +short "$cname" 2>/dev/null | grep -E '^[0-9]+\.' || true)
for ip in $cname_ips; do
iptables -A OUTPUT -p tcp -d "$ip" --dport 443 -j ACCEPT
echo "[firewall] Allowed: $domain (via $cname) -> $ip:443"
done
done
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DNS resolution is static at firewall init time. If IPs change after containers start (CDN rotation, DNS updates), agents lose access until container restart.

if [ ! -d "$BARE_REPO" ]; then
log_info "Creating bare repo for agent coordination..."
mkdir -p "$PROJECT_DIR/.ralph"
git clone --bare --filter=blob:none "file://$PROJECT_DIR" "$BARE_REPO"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --filter=blob:none for bare repo may cause issues if agents need to access file content from history. Consider implications for projects with large binary assets.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +311 to +317
claude --dangerously-skip-permissions \
--print \
--model "$CLAUDE_MODEL" \
-p "$PROMPT" \
&> "$LOGFILE" || {
echo "[$AGENT_ID] Claude exited with error (code: $?). Check log: $LOGFILE"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --dangerously-skip-permissions with AI agents operating autonomously is risky. While sandboxing provides some protection, agents can still execute arbitrary commands within container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as a POC I've used this, but I actually think there's probably some middle ground using --permission-mode dontAsk (Claude Docs - Permissions ). If this containerization is something we want to explore further, I'll go define some conventions around what permissions each agent can have. @snarktank

- Swap date parsing order to try GNU date -d first, macOS date -j as
  fallback (orchestrator runs on the host which could be either OS)
- Remove dead token_refresh docs and unused check_token_refresh_file()
  function — auth model is now volume-based, not token-file-based

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mtgibbs
Copy link
Author

mtgibbs commented Feb 13, 2026

Thanks for the review! Addressed the actionable items and wanted to document our reasoning on the rest:

Fixed

date -j is macOS-specific — Good catch. Swapped the order so GNU date -d is tried first with macOS date -j as fallback. The orchestrator runs on the host (could be either OS), so both paths are needed. Fixed in 991e2b9.

Token refresh documented but not wired up — Correct. This was leftover from an earlier token-file auth model. We've since moved to volume-based auth (claude login into a Docker volume), so the token_refresh mechanism doesn't apply. Removed the dead function and docs in 991e2b9.

Acknowledged (acceptable as-is)

Git config set globally for agent user — Each container runs exactly one user working on exactly one repo. There are no submodules or secondary repos in this pattern. The container is ephemeral and torn down after use, so global git config has no side effects.

DNS resolution is static at firewall init — True. If CDN IPs rotate mid-session, the agent would lose access until the container restarts. In practice, agent iterations are short-lived (minutes) and the orchestrator auto-restarts crashed containers, which re-resolves DNS. We considered re-resolving periodically but it adds complexity for a very unlikely failure mode.

--filter=blob:none implications — This is intentional. Agents work on HEAD and don't need old file blobs — the treeless clone gives them commit history and branch refs for orientation without exposing file content from old commits (which could contain leaked secrets). Projects with large binary assets that agents need to reference could use --image with a custom clone strategy, but for PRD-based feature work this is the right tradeoff.

--dangerously-skip-permissions is risky — Agreed, and this is exactly why the containerization exists. The flag is required for headless operation (no human to approve tool calls). The container is the mitigation: no host filesystem access, restricted network, resource limits, non-root user. Worst case is an agent damages its own container, which is ephemeral and disposable.

Previously both agent-loop.sh AND Claude claimed stories. The script
would claim US-001, then Claude would read CLAUDE-parallel.md's claim
protocol and grab US-002 and US-003 before doing any work.

Now:
- agent-loop.sh injects {{CLAIMED_STORY}} into the prompt via sed
- CLAUDE-parallel.md tells Claude which story is pre-assigned
- Claude is explicitly told not to claim additional stories
- Claiming is solely the script's responsibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant