fix(docker): prevent zombie processes and improve healthcheck #5878

vyavdoshenko · 2025-09-30T16:14:27Z

Fixes zombie process accumulation (#5844) and snapshot loading detection (#5863) in Docker healthcheck.

Key Changes:

Add tini as PID 1 to reap zombie processes
Use redis-cli + trap cleanup EXIT for reliable subprocess management
Check INFO PERSISTENCE loading state to prevent traffic during snapshot loading
Fix process detection to work with tini (/dragonfly vs 1/dragonfly)

Impact:

Eliminates zombie process buildup
Prevents 3+ minute downtime during Kubernetes rollouts

Fixes: #5844
Fixes: #5863

romange · 2025-09-30T16:20:41Z

tools/docker/healthcheck.sh

 fi

-_healthcheck="nc -q1 $HOST $PORT"
+# Use redis-cli instead of nc for better reliability


we used redis-cli in the past but had problems with that - please check the history and see why.
I prefer reverting to nc if this is not crucial for the fixes references in the PR.

romange · 2025-09-30T16:22:25Z

tools/docker/healthcheck.sh

+# Step 2: Check if server is in LOADING state
+# During snapshot loading, the server responds to PING but is not ready for traffic
+# Note: 'loading' field is in PERSISTENCE section
+INFO_OUTPUT=$(timeout 3 $REDIS_CLI INFO PERSISTENCE 2>/dev/null)


I do not understand why do we need two invocations here.
why not save the response into a variable and check if it's PONG or a "LOADING " error?

romange · 2025-09-30T16:51:22Z

so the current state that the pod loading into snapshot returns healthy and you are changing it to non-healthy status?
I am confused now.

vyavdoshenko · 2025-09-30T16:58:35Z

so the current state that the pod loading into snapshot returns healthy and you are changing it to non-healthy status? I am confused now.

Yes.

Current behavior (the problem):

Pod is loading snapshot into memory (takes 3+ minutes)
Healthcheck sends PING → receives PONG
Kubernetes thinks the pod is healthy and ready for traffic
Kubernetes kills the old pod and routes traffic to the new one
Result: 3+ minutes of downtime because the new pod is still loading data

After this fix:

Pod is loading snapshot into memory
Healthcheck sends PING → receives -LOADING Dragonfly is loading the dataset in memory
Kubernetes sees pod as unhealthy/not ready
Kubernetes keeps the old pod running until the new pod finishes loading
Result: zero downtime during rolling updates

The issue (#5863):
Reported this problem - during helm upgrades, the readiness probe passes too early (on PING), causing 3 minutes of downtime while the new pod loads the snapshot.

romange · 2025-10-01T09:26:20Z

so the current state that the pod loading into snapshot returns healthy and you are changing it to non-healthy status? I am confused now.

Yes.

Current behavior (the problem):

Pod is loading snapshot into memory (takes 3+ minutes)

Healthcheck sends PING → receives PONG

Kubernetes thinks the pod is healthy and ready for traffic

Kubernetes kills the old pod and routes traffic to the new one

Result: 3+ minutes of downtime because the new pod is still loading data

After this fix:

Pod is loading snapshot into memory

Healthcheck sends PING → receives -LOADING Dragonfly is loading the dataset in memory

Kubernetes sees pod as unhealthy/not ready

Kubernetes keeps the old pod running until the new pod finishes loading

Result: zero downtime during rolling updates

The issue (#5863): Reported this problem - during helm upgrades, the readiness probe passes too early (on PING), causing 3 minutes of downtime while the new pod loads the snapshot.

I disagree that the loading phase should be considered as "not healthy". K8S can decide to restart unhealthy pods leading to an infinite loop of a pod loading / restarting.
I think the previous healthcheck logic is perfect and this will keep this way until there is a specific use-case that proves otherwise. Lets revert that part of the code

romange · 2025-10-01T09:27:41Z

but I am ready to be proven otherwise, I am just worried that it will ruin other usecases

romange · 2025-10-01T09:28:37Z

@vyavdoshenko is it possible to demonstrate the faulty behavior using local kubectl or something?

vyavdoshenko · 2025-10-02T12:32:27Z

@romange
I reproduced the issue on my local cluster.
demo.tar.gz
I think we can save the current behavior and add a parameter to the healthcheck.sh for strict checking of the LOADING state.
Also, maybe it's worth creating a help page on how to create a custom health checking script using the added volume.

vyavdoshenko · 2025-10-02T13:06:44Z

@romange
I added an optional flag -l or --check-loading for checking the LOADING state. It doesn't break current behavior and can be used as an option if needed.
If you are against it, I will revert it to the original behavior.

romange

how it is possible to run healthchecks with this argument?

vyavdoshenko · 2025-10-02T15:15:53Z

how it is possible to run healthchecks with this argument?

I mean that this is possible to use outside with an argument.

docker run -d --name df-test docker.dragonflydb.io/dragonflydb/dragonfly

docker exec df-test /usr/local/bin/healthcheck.sh && echo "Pass" || echo "Fail"

docker exec df-test /usr/local/bin/healthcheck.sh --check-loading && echo "Pass" || echo "Fail"

romange

I do not think this argument helps anyone as it's not trigerred by docker.
I do not know if it's true but based on gemini, K8s does not rely on docker healthchecks as they do not distinguish between liveness and readiness:

Why Kubernetes Ignores the Docker Health Check
Kubernetes needs a single, unified mechanism to manage the lifecycle of a Pod and its containers. It has three distinct probe types that offer granular control over its self-healing logic, which the single Docker Health Check cannot provide:

Liveness Probe: Should I restart this container?

Readiness Probe: Should I send traffic to this Pod? (The Docker Health Check has no equivalent for this essential function).

Startup Probe: Is the container finished starting up yet?

Because the Docker Health Check conflates "alive" and "ready," Kubernetes replaces it entirely with its own superior probing system. You should always use Kubernetes Liveness and Readiness Probes for applications running on Kubernetes.

vyavdoshenko · 2025-10-02T15:55:41Z

I just wanted to say that this parameter can be used outside.
It can also be used in k8s like a parameter:

readinessProbe:
  exec:
    command:
    - /usr/local/bin/healthcheck.sh
    - --check-loading

I can create an example configuration for a local cluster on how to use it with parameters and without.

vyavdoshenko added 2 commits September 30, 2025 17:47

fix: health check script

13b4ff4

fix: health check script and docker files

7381c65

vyavdoshenko requested review from romange, kostasrim and dranikpg September 30, 2025 16:14

vyavdoshenko self-assigned this Sep 30, 2025

romange reviewed Sep 30, 2025

View reviewed changes

fix: review comments

185f447

vyavdoshenko requested a review from romange September 30, 2025 17:01

fix: healthcheck script

4ea5bcd

romange reviewed Oct 2, 2025

View reviewed changes

vyavdoshenko requested a review from romange October 3, 2025 06:28

vyavdoshenko mentioned this pull request Oct 3, 2025

Pass in custom logic to Dragonfly health check #5881

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(docker): prevent zombie processes and improve healthcheck #5878

fix(docker): prevent zombie processes and improve healthcheck #5878

vyavdoshenko commented Sep 30, 2025

Uh oh!

romange Sep 30, 2025

Uh oh!

vyavdoshenko Sep 30, 2025

Uh oh!

romange Sep 30, 2025

Uh oh!

vyavdoshenko Sep 30, 2025

Uh oh!

romange commented Sep 30, 2025

Uh oh!

vyavdoshenko commented Sep 30, 2025 •

edited

Loading

Uh oh!

romange commented Oct 1, 2025

Uh oh!

romange commented Oct 1, 2025

Uh oh!

romange commented Oct 1, 2025

Uh oh!

vyavdoshenko commented Oct 2, 2025

Uh oh!

vyavdoshenko commented Oct 2, 2025

Uh oh!

romange left a comment

Uh oh!

vyavdoshenko commented Oct 2, 2025

Uh oh!

romange left a comment

Uh oh!

vyavdoshenko commented Oct 2, 2025

Uh oh!

Uh oh!

fix(docker): prevent zombie processes and improve healthcheck #5878

Are you sure you want to change the base?

fix(docker): prevent zombie processes and improve healthcheck #5878

Conversation

vyavdoshenko commented Sep 30, 2025

Uh oh!

romange Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

vyavdoshenko Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

romange Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

vyavdoshenko Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

romange commented Sep 30, 2025

Uh oh!

vyavdoshenko commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romange commented Oct 1, 2025

Uh oh!

romange commented Oct 1, 2025

Uh oh!

romange commented Oct 1, 2025

Uh oh!

vyavdoshenko commented Oct 2, 2025

Uh oh!

vyavdoshenko commented Oct 2, 2025

Uh oh!

romange left a comment

Choose a reason for hiding this comment

Uh oh!

vyavdoshenko commented Oct 2, 2025

Uh oh!

romange left a comment

Choose a reason for hiding this comment

Uh oh!

vyavdoshenko commented Oct 2, 2025

Uh oh!

Uh oh!

vyavdoshenko commented Sep 30, 2025 •

edited

Loading