Skip to content

RFC: Docker Swarm mode (--mode=swarm) with atomic Caddyfile distribution via Swarm configs (no Admin API) #773

@oneingan

Description

@oneingan

Hey @lucaslorentz — thanks, happy to discuss design.

  1. Controller responsibilities (create vs update)
  • I’d strongly prefer the Swarm controller to only update an existing “caddy” Swarm service (name/ID passed via flag/env), not auto-create it.
  • That keeps it composable with user stacks (ports, placement/constraints, resources, volumes for cert storage, etc.). The controller’s job is: render desired config + distribute it + trigger a safe rollout.
  1. Worker-only + “atomic” config distribution (no Admin API exposure)
  • Render a full Caddyfile, publish it as an immutable Swarm object (Docker config; or secret if people want encrypted-at-rest), named by content hash (e.g. caddyfile-<sha>).
  • Update the Caddy service to swap the mounted config to the new object (remove old config, add new config at /etc/caddy/Caddyfile).
  • This is atomic at the task/container level (each task mounts one immutable object), avoids partial-write windows, and avoids exposing/broadcasting Caddy’s Admin API on unprivileged worker nodes or running Caddy on managers.
  1. Reliability: service update vs hot reload (trade-off + mitigations)
  • I agree hot reload is best for preserving long-lived connections. The trade-off here is convenience/security boundary vs connection continuity:
    • Hot reload usually implies reaching each instance’s Admin API (or exec’ing into tasks), which I’m trying to avoid in a “workers-only, no-admin-port” mode.
    • Service update is operationally simpler and keeps the Admin API closed, but replacing tasks can terminate long-lived connections (websockets beyond the stop grace period).
  • Mitigations (practical Swarm knobs + guidance):
    • Conservative rollout: update-parallelism=1, small update-delay, sensible update-monitor.
    • Prefer update-order=start-first where feasible (replicated / no host-port conflicts) so capacity stays up while the new task becomes healthy.
    • If running global + host-mode published ports (common for edge proxies), start-first may not be possible due to port binding; then do stop-first but keep disruption bounded with serial updates + delays.
    • Enable auto-rollback on failure (update-failure-action=rollback) and configure rollback parameters.
    • Add healthchecks and set an adequate stop-grace-period so Swarm can verify readiness and Caddy can shut down cleanly.

If you’re open to it, I’d implement this incrementally:

  • Phase 1: “config distribution + safe service update” Swarm mode (with docs + recommended update/rollback settings and clear limitations).
  • Optional later: explore a per-node local reload agent (admin bound to 127.0.0.1 only) if we want hot-reload semantics without exposing Admin across the cluster.

Originally posted by @oneingan in #766 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions