Demystifying swarm dns

Join me on a small adventure as I struggled and failed to define a health check for the Portainer agent in a swarm stack.

It's perhaps not-so-well known that adding a `healthcheck:` object to the agent service will cause it to fail to start up IF it's attached to a new network (i.e., when you deploy a new stack) OR more precisely, when there are no other 'healthy' agent containers on the network. The root cause is a race condition/dependency loop: Swarm service lookup relies on 'healthy' containers. If a container is not 'healthy', then its ip will not be available for a swarm service lookup. Naturally, if there are no 'healthy' containers for a service, DNS won't respond with an empty list; it'll just respond with no such host.

For reference, here is my healthcheck:

```yaml
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "--no-check-certificate", "https://localhost:9001/ping"] # Using 127.0.0.1 makes no difference
```

This creates a really interesting (frustrating) dependency loop: agent can't start because dns has no hosts -> dns has no hosts because service has no healthy containers -> no container is healthy because the healthcheck fails -> healthcheck fails because agent did not start -> agent can't start ... (oops) 

So hopefully this explains why: https://github.com/portainer/agent/blob/f8c8db21d864fe6ad83f2004b204c27d82abf659/cmd/agent/main.go#L140 is here. Basically, the reason this works at all without a health check is that the first lookup would _always_ fail for the first container to start. But because sleep is called, Docker marks it as healthy because there is no health check, and advertises the IP in DNS, meaning all the other containers will start without a problem. In this case, sleep is just a workaround for the race condition that occurs when the agent performs a DNS lookup before even its own container has been marked healthy.

I'm not a go person myself, so I would like a maintainer use this information to change the startup logic. My own suggestion is that instead of terminating fatally, assume that you are the first node and start anyway, but this also requires thinking about when agents start in parallel (don't want more than one assuming they are the only one in the cluster). To my knowledge, Docker does not provide a way to make replicas start sequentially rather than simultaneously. And there's also no way to tell the difference between a DNS failure, an invalid host being looked up, and a correct host lookup with a legitimate "no hosts" response. This may be a bigger lift than I'm understanding, so at least an acknowledgement that I understand the problem would be appreciated.

Parent issue:
https://github.com/portainer/portainer/issues/13076

Depend on this issue:
https://github.com/portainer/agent/issues/731

Related:
https://github.com/portainer/agent/issues/161

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demystifying swarm dns #738

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Demystifying swarm dns #738

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions