Skip to content

Demystifying swarm dns #738

@SheepReaper

Description

@SheepReaper

Join me on a small adventure as I struggled and failed to define a health check for the Portainer agent in a swarm stack.

It's perhaps not-so-well known that adding a healthcheck: object to the agent service will cause it to fail to start up IF it's attached to a new network (i.e., when you deploy a new stack) OR more precisely, when there are no other 'healthy' agent containers on the network. The root cause is a race condition/dependency loop: Swarm service lookup relies on 'healthy' containers. If a container is not 'healthy', then its ip will not be available for a swarm service lookup. Naturally, if there are no 'healthy' containers for a service, DNS won't respond with an empty list; it'll just respond with no such host.

For reference, here is my healthcheck:

    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "--no-check-certificate", "https://localhost:9001/ping"] # Using 127.0.0.1 makes no difference

This creates a really interesting (frustrating) dependency loop: agent can't start because dns has no hosts -> dns has no hosts because service has no healthy containers -> no container is healthy because the healthcheck fails -> healthcheck fails because agent did not start -> agent can't start ... (oops)

So hopefully this explains why:

time.Sleep(3 * time.Second)
is here. Basically, the reason this works at all without a health check is that the first lookup would always fail for the first container to start. But because sleep is called, Docker marks it as healthy because there is no health check, and advertises the IP in DNS, meaning all the other containers will start without a problem. In this case, sleep is just a workaround for the race condition that occurs when the agent performs a DNS lookup before even its own container has been marked healthy.

I'm not a go person myself, so I would like a maintainer use this information to change the startup logic. My own suggestion is that instead of terminating fatally, assume that you are the first node and start anyway, but this also requires thinking about when agents start in parallel (don't want more than one assuming they are the only one in the cluster). To my knowledge, Docker does not provide a way to make replicas start sequentially rather than simultaneously. And there's also no way to tell the difference between a DNS failure, an invalid host being looked up, and a correct host lookup with a legitimate "no hosts" response. This may be a bigger lift than I'm understanding, so at least an acknowledgement that I understand the problem would be appreciated.

Parent issue:
portainer/portainer#13076

Depend on this issue:
#731

Related:
#161

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions