Skip to content

Conversation

vyavdoshenko
Copy link
Contributor

Fixes zombie process accumulation (#5844) and snapshot loading detection (#5863) in Docker healthcheck.

Key Changes:

  • Add tini as PID 1 to reap zombie processes
  • Use redis-cli + trap cleanup EXIT for reliable subprocess management
  • Check INFO PERSISTENCE loading state to prevent traffic during snapshot loading
  • Fix process detection to work with tini (/dragonfly vs 1/dragonfly)

Impact:

  • Eliminates zombie process buildup
  • Prevents 3+ minute downtime during Kubernetes rollouts

Fixes: #5844
Fixes: #5863

fi

_healthcheck="nc -q1 $HOST $PORT"
# Use redis-cli instead of nc for better reliability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we used redis-cli in the past but had problems with that - please check the history and see why.
I prefer reverting to nc if this is not crucial for the fixes references in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

# Step 2: Check if server is in LOADING state
# During snapshot loading, the server responds to PING but is not ready for traffic
# Note: 'loading' field is in PERSISTENCE section
INFO_OUTPUT=$(timeout 3 $REDIS_CLI INFO PERSISTENCE 2>/dev/null)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why do we need two invocations here.
why not save the response into a variable and check if it's PONG or a "LOADING " error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@romange
Copy link
Collaborator

romange commented Sep 30, 2025

so the current state that the pod loading into snapshot returns healthy and you are changing it to non-healthy status?
I am confused now.

@vyavdoshenko
Copy link
Contributor Author

vyavdoshenko commented Sep 30, 2025

so the current state that the pod loading into snapshot returns healthy and you are changing it to non-healthy status? I am confused now.

Yes.

Current behavior (the problem):

  • Pod is loading snapshot into memory (takes 3+ minutes)
  • Healthcheck sends PING → receives PONG
  • Kubernetes thinks the pod is healthy and ready for traffic
  • Kubernetes kills the old pod and routes traffic to the new one
  • Result: 3+ minutes of downtime because the new pod is still loading data

After this fix:

  • Pod is loading snapshot into memory
  • Healthcheck sends PING → receives -LOADING Dragonfly is loading the dataset in memory
  • Kubernetes sees pod as unhealthy/not ready
  • Kubernetes keeps the old pod running until the new pod finishes loading
  • Result: zero downtime during rolling updates

The issue (#5863):
Reported this problem - during helm upgrades, the readiness probe passes too early (on PING), causing 3 minutes of downtime while the new pod loads the snapshot.

@romange
Copy link
Collaborator

romange commented Oct 1, 2025

so the current state that the pod loading into snapshot returns healthy and you are changing it to non-healthy status? I am confused now.

Yes.

Current behavior (the problem):

  • Pod is loading snapshot into memory (takes 3+ minutes)
  • Healthcheck sends PING → receives PONG
  • Kubernetes thinks the pod is healthy and ready for traffic
  • Kubernetes kills the old pod and routes traffic to the new one
  • Result: 3+ minutes of downtime because the new pod is still loading data

After this fix:

  • Pod is loading snapshot into memory
  • Healthcheck sends PING → receives -LOADING Dragonfly is loading the dataset in memory
  • Kubernetes sees pod as unhealthy/not ready
  • Kubernetes keeps the old pod running until the new pod finishes loading
  • Result: zero downtime during rolling updates

The issue (#5863): Reported this problem - during helm upgrades, the readiness probe passes too early (on PING), causing 3 minutes of downtime while the new pod loads the snapshot.

I disagree that the loading phase should be considered as "not healthy". K8S can decide to restart unhealthy pods leading to an infinite loop of a pod loading / restarting.
I think the previous healthcheck logic is perfect and this will keep this way until there is a specific use-case that proves otherwise. Lets revert that part of the code

@romange
Copy link
Collaborator

romange commented Oct 1, 2025

but I am ready to be proven otherwise, I am just worried that it will ruin other usecases

@romange
Copy link
Collaborator

romange commented Oct 1, 2025

@vyavdoshenko is it possible to demonstrate the faulty behavior using local kubectl or something?

@vyavdoshenko
Copy link
Contributor Author

@romange
I reproduced the issue on my local cluster.
demo.tar.gz
I think we can save the current behavior and add a parameter to the healthcheck.sh for strict checking of the LOADING state.
Also, maybe it's worth creating a help page on how to create a custom health checking script using the added volume.

@vyavdoshenko
Copy link
Contributor Author

@romange
I added an optional flag -l or --check-loading for checking the LOADING state. It doesn't break current behavior and can be used as an option if needed.
If you are against it, I will revert it to the original behavior.

Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how it is possible to run healthchecks with this argument?

@vyavdoshenko
Copy link
Contributor Author

how it is possible to run healthchecks with this argument?

I mean that this is possible to use outside with an argument.

docker run -d --name df-test docker.dragonflydb.io/dragonflydb/dragonfly

docker exec df-test /usr/local/bin/healthcheck.sh && echo "Pass" || echo "Fail"

docker exec df-test /usr/local/bin/healthcheck.sh --check-loading && echo "Pass" || echo "Fail"

Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this argument helps anyone as it's not trigerred by docker.
I do not know if it's true but based on gemini, K8s does not rely on docker healthchecks as they do not distinguish between liveness and readiness:

Why Kubernetes Ignores the Docker Health Check
Kubernetes needs a single, unified mechanism to manage the lifecycle of a Pod and its containers. It has three distinct probe types that offer granular control over its self-healing logic, which the single Docker Health Check cannot provide:

Liveness Probe: Should I restart this container?

Readiness Probe: Should I send traffic to this Pod? (The Docker Health Check has no equivalent for this essential function).

Startup Probe: Is the container finished starting up yet?

Because the Docker Health Check conflates "alive" and "ready," Kubernetes replaces it entirely with its own superior probing system. You should always use Kubernetes Liveness and Readiness Probes for applications running on Kubernetes.

@vyavdoshenko
Copy link
Contributor Author

I just wanted to say that this parameter can be used outside.
It can also be used in k8s like a parameter:

readinessProbe:
  exec:
    command:
    - /usr/local/bin/healthcheck.sh
    - --check-loading

I can create an example configuration for a local cluster on how to use it with parameters and without.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DragonflyDB's health check script using nc has created zombie processes. Pass in custom logic to Dragonfly health check
2 participants