-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix(docker): prevent zombie processes and improve healthcheck #5878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tools/docker/healthcheck.sh
Outdated
fi | ||
|
||
_healthcheck="nc -q1 $HOST $PORT" | ||
# Use redis-cli instead of nc for better reliability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we used redis-cli in the past but had problems with that - please check the history and see why.
I prefer reverting to nc
if this is not crucial for the fixes references in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
tools/docker/healthcheck.sh
Outdated
# Step 2: Check if server is in LOADING state | ||
# During snapshot loading, the server responds to PING but is not ready for traffic | ||
# Note: 'loading' field is in PERSISTENCE section | ||
INFO_OUTPUT=$(timeout 3 $REDIS_CLI INFO PERSISTENCE 2>/dev/null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why do we need two invocations here.
why not save the response into a variable and check if it's PONG or a "LOADING " error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
so the current state that the pod loading into snapshot returns healthy and you are changing it to non-healthy status? |
Yes. Current behavior (the problem):
After this fix:
The issue (#5863): |
I disagree that the loading phase should be considered as "not healthy". K8S can decide to restart unhealthy pods leading to an infinite loop of a pod loading / restarting. |
but I am ready to be proven otherwise, I am just worried that it will ruin other usecases |
@vyavdoshenko is it possible to demonstrate the faulty behavior using local kubectl or something? |
@romange |
@romange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how it is possible to run healthchecks with this argument?
I mean that this is possible to use outside with an argument.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this argument helps anyone as it's not trigerred by docker.
I do not know if it's true but based on gemini, K8s does not rely on docker healthchecks as they do not distinguish between liveness and readiness:
Why Kubernetes Ignores the Docker Health Check
Kubernetes needs a single, unified mechanism to manage the lifecycle of a Pod and its containers. It has three distinct probe types that offer granular control over its self-healing logic, which the single Docker Health Check cannot provide:
Liveness Probe: Should I restart this container?
Readiness Probe: Should I send traffic to this Pod? (The Docker Health Check has no equivalent for this essential function).
Startup Probe: Is the container finished starting up yet?
Because the Docker Health Check conflates "alive" and "ready," Kubernetes replaces it entirely with its own superior probing system. You should always use Kubernetes Liveness and Readiness Probes for applications running on Kubernetes.
I just wanted to say that this parameter can be used outside.
I can create an example configuration for a local cluster on how to use it with parameters and without. |
Fixes zombie process accumulation (#5844) and snapshot loading detection (#5863) in Docker healthcheck.
Key Changes:
tini
as PID 1 to reap zombie processesredis-cli
+trap cleanup EXIT
for reliable subprocess managementINFO PERSISTENCE
loading state to prevent traffic during snapshot loading/dragonfly
vs1/dragonfly
)Impact:
Fixes: #5844
Fixes: #5863