RFC: Docker Swarm mode (--mode=swarm) with atomic Caddyfile distribution via Swarm configs (no Admin API)

Hey @lucaslorentz — thanks, happy to discuss design.

1) Controller responsibilities (create vs update)
- I’d strongly prefer the Swarm controller to only *update* an existing “caddy” Swarm service (name/ID passed via flag/env), not auto-create it.
- That keeps it composable with user stacks (ports, placement/constraints, resources, volumes for cert storage, etc.). The controller’s job is: render desired config + distribute it + trigger a safe rollout.

2) Worker-only + “atomic” config distribution (no Admin API exposure)
- Render a full Caddyfile, publish it as an immutable Swarm object (Docker *config*; or *secret* if people want encrypted-at-rest), named by content hash (e.g. `caddyfile-<sha>`).
- Update the Caddy service to swap the mounted config to the new object (remove old config, add new config at `/etc/caddy/Caddyfile`).
- This is atomic at the task/container level (each task mounts one immutable object), avoids partial-write windows, and avoids exposing/broadcasting Caddy’s Admin API on unprivileged worker nodes or running Caddy on managers.

3) Reliability: service update vs hot reload (trade-off + mitigations)
- I agree hot reload is best for preserving long-lived connections. The trade-off here is convenience/security boundary vs connection continuity:
  - Hot reload usually implies reaching each instance’s Admin API (or exec’ing into tasks), which I’m trying to avoid in a “workers-only, no-admin-port” mode.
  - Service update is operationally simpler and keeps the Admin API closed, but replacing tasks can terminate long-lived connections (websockets beyond the stop grace period).
- Mitigations (practical Swarm knobs + guidance):
  - Conservative rollout: `update-parallelism=1`, small `update-delay`, sensible `update-monitor`.
  - Prefer `update-order=start-first` where feasible (replicated / no host-port conflicts) so capacity stays up while the new task becomes healthy.
  - If running global + host-mode published ports (common for edge proxies), start-first may not be possible due to port binding; then do `stop-first` but keep disruption bounded with serial updates + delays.
  - Enable auto-rollback on failure (`update-failure-action=rollback`) and configure rollback parameters.
  - Add healthchecks and set an adequate `stop-grace-period` so Swarm can verify readiness and Caddy can shut down cleanly.

If you’re open to it, I’d implement this incrementally:
- Phase 1: “config distribution + safe service update” Swarm mode (with docs + recommended update/rollback settings and clear limitations).
- Optional later: explore a per-node local reload agent (admin bound to 127.0.0.1 only) if we want hot-reload semantics without exposing Admin across the cluster.

_Originally posted by @oneingan in https://github.com/lucaslorentz/caddy-docker-proxy/issues/766#issuecomment-3940722409_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Docker Swarm mode (--mode=swarm) with atomic Caddyfile distribution via Swarm configs (no Admin API) #773

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Docker Swarm mode (--mode=swarm) with atomic Caddyfile distribution via Swarm configs (no Admin API) #773

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions