-
-
Notifications
You must be signed in to change notification settings - Fork 210
Open
Description
Hey @lucaslorentz — thanks, happy to discuss design.
- Controller responsibilities (create vs update)
- I’d strongly prefer the Swarm controller to only update an existing “caddy” Swarm service (name/ID passed via flag/env), not auto-create it.
- That keeps it composable with user stacks (ports, placement/constraints, resources, volumes for cert storage, etc.). The controller’s job is: render desired config + distribute it + trigger a safe rollout.
- Worker-only + “atomic” config distribution (no Admin API exposure)
- Render a full Caddyfile, publish it as an immutable Swarm object (Docker config; or secret if people want encrypted-at-rest), named by content hash (e.g.
caddyfile-<sha>). - Update the Caddy service to swap the mounted config to the new object (remove old config, add new config at
/etc/caddy/Caddyfile). - This is atomic at the task/container level (each task mounts one immutable object), avoids partial-write windows, and avoids exposing/broadcasting Caddy’s Admin API on unprivileged worker nodes or running Caddy on managers.
- Reliability: service update vs hot reload (trade-off + mitigations)
- I agree hot reload is best for preserving long-lived connections. The trade-off here is convenience/security boundary vs connection continuity:
- Hot reload usually implies reaching each instance’s Admin API (or exec’ing into tasks), which I’m trying to avoid in a “workers-only, no-admin-port” mode.
- Service update is operationally simpler and keeps the Admin API closed, but replacing tasks can terminate long-lived connections (websockets beyond the stop grace period).
- Mitigations (practical Swarm knobs + guidance):
- Conservative rollout:
update-parallelism=1, smallupdate-delay, sensibleupdate-monitor. - Prefer
update-order=start-firstwhere feasible (replicated / no host-port conflicts) so capacity stays up while the new task becomes healthy. - If running global + host-mode published ports (common for edge proxies), start-first may not be possible due to port binding; then do
stop-firstbut keep disruption bounded with serial updates + delays. - Enable auto-rollback on failure (
update-failure-action=rollback) and configure rollback parameters. - Add healthchecks and set an adequate
stop-grace-periodso Swarm can verify readiness and Caddy can shut down cleanly.
- Conservative rollout:
If you’re open to it, I’d implement this incrementally:
- Phase 1: “config distribution + safe service update” Swarm mode (with docs + recommended update/rollback settings and clear limitations).
- Optional later: explore a per-node local reload agent (admin bound to 127.0.0.1 only) if we want hot-reload semantics without exposing Admin across the cluster.
Originally posted by @oneingan in #766 (comment)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels