-
Notifications
You must be signed in to change notification settings - Fork 80
Description
Join me on a small adventure as I struggled and failed to define a health check for the Portainer agent in a swarm stack.
It's perhaps not-so-well known that adding a healthcheck: object to the agent service will cause it to fail to start up IF it's attached to a new network (i.e., when you deploy a new stack) OR more precisely, when there are no other 'healthy' agent containers on the network. The root cause is a race condition/dependency loop: Swarm service lookup relies on 'healthy' containers. If a container is not 'healthy', then its ip will not be available for a swarm service lookup. Naturally, if there are no 'healthy' containers for a service, DNS won't respond with an empty list; it'll just respond with no such host.
For reference, here is my healthcheck:
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "--no-check-certificate", "https://localhost:9001/ping"] # Using 127.0.0.1 makes no differenceThis creates a really interesting (frustrating) dependency loop: agent can't start because dns has no hosts -> dns has no hosts because service has no healthy containers -> no container is healthy because the healthcheck fails -> healthcheck fails because agent did not start -> agent can't start ... (oops)
So hopefully this explains why:
Line 140 in f8c8db2
| time.Sleep(3 * time.Second) |
I'm not a go person myself, so I would like a maintainer use this information to change the startup logic. My own suggestion is that instead of terminating fatally, assume that you are the first node and start anyway, but this also requires thinking about when agents start in parallel (don't want more than one assuming they are the only one in the cluster). To my knowledge, Docker does not provide a way to make replicas start sequentially rather than simultaneously. And there's also no way to tell the difference between a DNS failure, an invalid host being looked up, and a correct host lookup with a legitimate "no hosts" response. This may be a bigger lift than I'm understanding, so at least an acknowledgement that I understand the problem would be appreciated.
Parent issue:
portainer/portainer#13076
Depend on this issue:
#731
Related:
#161