Skip to content

Commit 51edf8e

Browse files
committed
restarting kubelet does not change pod status.
1 parent 20c9001 commit 51edf8e

File tree

6 files changed

+420
-0
lines changed

6 files changed

+420
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 4781
2+
deprecated:
3+
approver: "@jpbetz"
Lines changed: 378 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,378 @@
1+
# KEP-4781: Restarting kubelet does not change pod status
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [User Stories (Optional)](#user-stories-optional)
11+
- [Story 1](#story-1)
12+
- [Risks and Mitigations](#risks-and-mitigations)
13+
- [Inconsistency with other Kubernetes components](#inconsistency-with-other-kubernetes-components)
14+
- [Delayed Health Check Updates](#delayed-health-check-updates)
15+
- [Design Details](#design-details)
16+
- [Test Plan](#test-plan)
17+
- [Prerequisite testing updates](#prerequisite-testing-updates)
18+
- [Unit tests](#unit-tests)
19+
- [Integration tests](#integration-tests)
20+
- [e2e tests](#e2e-tests)
21+
- [Graduation Criteria](#graduation-criteria)
22+
- [deprecated](#deprecated)
23+
- [GA](#ga)
24+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
25+
- [Version Skew Strategy](#version-skew-strategy)
26+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
27+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
28+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
29+
- [Monitoring Requirements](#monitoring-requirements)
30+
- [Dependencies](#dependencies)
31+
- [Scalability](#scalability)
32+
- [Troubleshooting](#troubleshooting)
33+
- [Implementation History](#implementation-history)
34+
- [Drawbacks](#drawbacks)
35+
- [Alternatives](#alternatives)
36+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
37+
<!-- /toc -->
38+
39+
## Release Signoff Checklist
40+
41+
Items marked with (R) are required *prior to targeting to a milestone / release*.
42+
43+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
44+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
45+
- [ ] (R) Design details are appropriately documented
46+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
47+
- [ ] e2e Tests for all Beta API Operations (endpoints)
48+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
49+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
50+
- [ ] (R) Graduation criteria is in place
51+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
52+
- [ ] (R) Production readiness review completed
53+
- [ ] (R) Production readiness review approved
54+
- [ ] "Implementation History" section is up-to-date for milestone
55+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
56+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
57+
58+
<!--
59+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
60+
-->
61+
62+
[kubernetes.io]: https://kubernetes.io/
63+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
64+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
65+
[kubernetes/website]: https://git.k8s.io/website
66+
67+
## Summary
68+
69+
When a kubelet restarts, for whatever reason, it usually will not have any impact on currently running pods. Today, however, a kubelet restart causes all pods on that node to have their Started and Ready statuses set to False (by the kubelet), which can disrupt services that were actually functioning normally. This KEP proposes improving Pod readiness management in the kubelet to ensure that the status of Pods is not unilaterally modified, and is instead preserved during kubelet restarts.
70+
71+
## Motivation
72+
73+
Ensuring high availability and minimizing service disruptions are critical considerations for Kubernetes clusters.
74+
When the kubelet restarts, it resets the Start and Ready states of all containers to False by default. This means that any successful probe statuses that were previously established are lost upon the restart. As a result, services may be inaccurately flagged as unavailable, despite having been operational prior to the kubelet's restart. This reset can lead to erroneous perceptions of service health and negatively impact the overall performance of the cluster, potentially triggering unnecessary alerts or load balancing changes.
75+
76+
Some users have reported that this issue has been causing them trouble. Here are some links to historical discussions related to this problem: https://github.com/kubernetes/kubernetes/issues/100277, https://github.com/kubernetes/kubernetes/issues/100277#issuecomment-1179412974, https://github.com/kubernetes/kubernetes/issues/102367
77+
78+
It's essential to implement strategies to ensure that the service states accurately reflect their operational status, even during kubelet interruptions.
79+
80+
### Goals
81+
82+
- Ensure consistency in container start and ready states across kubelet restarts.
83+
- Minimize unnecessary service disruptions caused by temporary ready state changes.
84+
85+
### Non-Goals
86+
87+
- If the kubelet fails to renew its lease beyond the nodeMonitorGracePeriod due to an excessively long restart interval, the Ready status of the containers in the pods on the node will be set to false. In this situation, we should not manually set the Ready status back to true. Instead, it should remain false, waiting for the probe to execute again and restore it.
88+
- Modify the fundamental logic of how readiness probes work.
89+
90+
## Proposal
91+
92+
### User Stories (Optional)
93+
94+
<!--
95+
Detail the things that people will be able to do if this KEP is implemented.
96+
Include as much detail as possible so that people can understand the "how" of
97+
the system. The goal here is to make this feel real for users without getting
98+
bogged down.
99+
-->
100+
101+
#### Story 1
102+
As a user of Kubernetes, I want the container's Ready state to remain consistent across kubelet restarts so that my services do not experience unnecessary downtime.
103+
However, currently, a kubelet restart causes a brief "Not Ready" storm, where the state of all Pods is set to Not Ready, impacting the availability of my services.
104+
105+
### Risks and Mitigations
106+
107+
##### Inconsistency with other Kubernetes components
108+
If other parts of Kubernetes (e.g., the API server, controllers) expect certain behavior regarding container readiness states, these changes might cause inconsistencies.
109+
110+
##### Delayed Health Check Updates
111+
By preserving the old state without immediate health checks, there is a delay in recognizing containers that have become unhealthy during or after kubelet's downtime. Services relying on Pod readiness for service discovery might continue directing traffic to Pods with containers that are no longer healthy but are still reported as Ready.
112+
113+
## Design Details
114+
115+
We will be adding a deprecated feature gate: `ChangeContainerStatusOnKubeletRestart`. This feature gate will be disabled by default. When disabled, the Kubelet will not change container statuses after a restart. Users can enable the `ChangeContainerStatusOnKubeletRestart` feature gate to restore the behavior where the Kubelet changes container statuses after a restart.
116+
117+
Regarding this feature gate, the changes we will make in the kubelet codebase are as follows:
118+
119+
* We will retrieve the `Started` field from the container status in the Pod via the API server. After the Kubelet restarts, during the first entry into `SyncPod`, we will propagate this value to the newly generated container status.
120+
121+
* We ensure that if the `Started` field in the container status is true, the container is considered started (since the startupProbe only runs during container startup and will not execute again once completed).
122+
123+
* If the Kubelet restart occurs within the `nodeMonitorGracePeriod` and the Pod’s Ready condition is set to false, we will set the container’s ready status to false. It will remain in this state until subsequent probes reset it to true.
124+
125+
* We will modify the logic in the `doProbe` function. When it detects a container that was already running before the Kubelet restarted (for the first time after restart), it will skip marking an initial Failure status. This allows the probe `result` to retain the default `Success` status. If the container’s state changes during the Kubelet restart period and causes the probe to return an abnormal result, the status will be updated to a non-Success state in the next probe cycle. Subsequent syncPod operations will then set the container’s Ready status to false.
126+
127+
**Before the Changes:**
128+
If kubelet restarts, the pod status transition process is as follows:
129+
130+
1. Kubelet uses `SyncPod` to reconcile the pod state. During the first execution of `SyncPod`, the pod has not yet been added to the `probeManager`. At this point, `SyncPod` assumes the pod has no probes configured (note: if it is a newly created pod, the first execution of `SyncPod` does not go through this step). Therefore, it sets the container's `Ready` status to true and updates it to the APIserver.
131+
132+
2. After updating the container status, `SyncPod` adds the pod to the `probeManager`. The pod then begins executing probes.
133+
134+
3. During the first execution of `doProbe`(It will skip the `initialDelaySeconds` period because the container's startup time exceeds the `initialDelaySeconds` period), `doProbe` sets the result of all probes to their `initialValue`. The `initialValue` for `readinessProbe` is `Failure`, and for `startupProbe` it is `Unknown`. Based on the probe results, it updates the `Started` and `Ready` fields of the container status in the APIserver to false.
135+
136+
The sequence diagram for this process is as follows:
137+
![Before changes](./before-changes.png)
138+
139+
140+
**After the Changes:**
141+
**Scenario 1:**
142+
After the changes, if kubelet restarts, the pod status transition process is as follows:
143+
144+
* (The first two steps are the same as before the changes and are omitted here.)
145+
1. During the first execution of `doProbe`(It will skip the `initialDelaySeconds` period because the container's startup time exceeds the `initialDelaySeconds` period), If the pod's creation time precedes the kubelet's start time by more than 10 seconds (a tolerance for clock skew), and the container's readiness state is true. `doProbe` skips the step of setting all probe results to their `initialValue` and proceeds with subsequent probe steps. This ensures that kubelet can immediately probe whether the container is still functioning properly after restarting, avoiding a situation where the container becomes unhealthy during kubelet restart but kubelet fails to update the container's `Ready` fields to false in a timely manner.
146+
147+
The sequence diagram for this process is as follows:
148+
![Scenario 1](./scenario-1.png)
149+
150+
151+
**Scenario 2:**
152+
After the change, if the kubelet is restarted for a sufficiently long time (exceeding the `nodeMonitorGracePeriod`), causing the pod's `Ready condition` to be set to false:
153+
154+
1. `Kubelet` uses `SyncPod` to reconcile the pod state. During the first execution of `SyncPod`, if kubelet detects that the pod's `Ready condition` is false, it directly sets the container's `Ready` fields to `false` and updates it to the APIserver.
155+
156+
2. After updating the container status, `SyncPod` adds the pod to the `probeManager`. The pod then begins executing probes.
157+
158+
3. The logic here is the same as in Scenario 1. Since the container's `Ready` fields is false, doProbe sets the result of all probes to their `initialValue` and updates the `Started` and `Ready` fields of the container status in the API server to false based on the probe results. Subsequent executions of `doProbe` will then transition the pod status to the desired state.
159+
160+
The sequence diagram for this process is as follows:
161+
![Scenario 2](./scenario-2.png)
162+
163+
### Test Plan
164+
165+
[ ] I/we understand the owners of the involved components may require updates to
166+
existing tests to make this code solid enough prior to committing the changes necessary
167+
to implement this enhancement.
168+
169+
##### Prerequisite testing updates
170+
171+
##### Unit tests
172+
173+
- `pkg/kubelet/prober`: `2025-08-25` - `77.4%`
174+
- `k8s.io/kubernetes/pkg/kubelet`: `2025-08-25` - `71.2%`
175+
176+
##### Integration tests
177+
178+
- <test>: <link to test coverage>
179+
180+
##### e2e tests
181+
182+
* Add an e2e test case to verify that restarting kubelet does not affect pod status when the pod has no probes.
183+
* Add an e2e test case to verify that restarting kubelet does not affect pod status when the pod has a `startupProbe`.
184+
* Add an e2e test case to verify that restarting kubelet does not affect pod status when the pod has a `readinessProbe`.
185+
* Add an e2e test case to verify that restarting kubelet does not affect pod status when the pod has both `startupProbe` and `readinessProbe`.
186+
187+
### Graduation Criteria
188+
189+
#### deprecated
190+
191+
Implement the code and add the `ChangeContainerStatusOnKubeletRestart` feature gate.
192+
Add e2e tests to ensure the functionality meets expectations.
193+
194+
#### GA
195+
196+
During the Deprecated phase, no issues were reported by users.
197+
198+
### Upgrade / Downgrade Strategy
199+
200+
### Version Skew Strategy
201+
202+
N/A
203+
204+
## Production Readiness Review Questionnaire
205+
206+
### Feature Enablement and Rollback
207+
208+
###### How can this feature be enabled / disabled in a live cluster?
209+
210+
211+
- [ ] Feature gate (also fill in values in `kep.yaml`)
212+
- Feature gate name: `ChangeContainerStatusOnKubeletRestart`
213+
- Components depending on the feature gate: `kubelet`
214+
215+
###### Does enabling the feature change any default behavior?
216+
217+
<!--
218+
Any change of default behavior may be surprising to users or break existing
219+
automations, so be extremely careful here.
220+
-->
221+
Yes, currently, when a kubelet restarts, the state of Pods and containers are reported as Not Ready. This feature changes the behavior to inherit the last state of Pods and containers, thus avoiding service inconsistencies, but may introduce delayed updates to the Not Ready state.
222+
223+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
224+
225+
Yes. By setting the feature gate to false and restarting the kubelet, the cluster will revert to the previous default behavior. This rollback is safe since the feature does not involve storage modifications to API objects.
226+
227+
###### What happens if we reenable the feature if it was previously rolled back?
228+
229+
If the feature is re-enabled, the kubelet will once again adopt the new behavior of preserving pod status during restarts. Re-enabling the feature will not cause side effects, as it is stateless and only affects the kubelet's startup logic.
230+
231+
232+
###### Are there any tests for feature enablement/disablement?
233+
234+
### Rollout, Upgrade and Rollback Planning
235+
236+
###### How can a rollout or rollback fail? Can it impact already running workloads?
237+
238+
Rolling back (disabling the feature gate) is inherently safe as it simply restores kubelet' long-standing default behavior.
239+
The likelihood of failure is extremely low, as both deployment and rollback only require a kubelet restart.
240+
Running pods (workloads) will not be restarted. The purpose of deployment is to minimize impact on workloads, while rollback restores the current "impactful yet predictable" state.
241+
242+
###### What specific metrics should inform a rollback?
243+
244+
N/A
245+
246+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
247+
248+
N/A
249+
250+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
251+
252+
N/A
253+
254+
### Monitoring Requirements
255+
256+
<!--
257+
This section must be completed when targeting beta to a release.
258+
259+
For GA, this section is required: approvers should be able to confirm the
260+
previous answers based on experience in the field.
261+
-->
262+
263+
###### How can an operator determine if the feature is in use by workloads?
264+
265+
<!--
266+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
267+
checking if there are objects with field X set) may be a last resort. Avoid
268+
logs or events for this purpose.
269+
-->
270+
271+
###### How can someone using this feature know that it is working for their instance?
272+
273+
<!--
274+
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
275+
for each individual pod.
276+
Pick one more of these and delete the rest.
277+
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
278+
and operation of this feature.
279+
Recall that end users cannot usually observe component logs or access metrics.
280+
-->
281+
282+
- [ ] Events
283+
- Event Reason:
284+
- [ ] API .status
285+
- Condition name:
286+
- Other field:
287+
- [ ] Other (treat as last resort)
288+
- Details:
289+
290+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
291+
292+
<!--
293+
This is your opportunity to define what "normal" quality of service looks like
294+
for a feature.
295+
296+
It's impossible to provide comprehensive guidance, but at the very
297+
high level (needs more precise definitions) those may be things like:
298+
- per-day percentage of API calls finishing with 5XX errors <= 1%
299+
- 99% percentile over day of absolute value from (job creation time minus expected
300+
job creation time) for cron job <= 10%
301+
- 99.9% of /health requests per day finish with 200 code
302+
303+
These goals will help you determine what you need to measure (SLIs) in the next
304+
question.
305+
-->
306+
307+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
308+
309+
<!--
310+
Pick one more of these and delete the rest.
311+
-->
312+
313+
- [ ] Metrics
314+
- Metric name:
315+
- [Optional] Aggregation method:
316+
- Components exposing the metric:
317+
- [ ] Other (treat as last resort)
318+
- Details:
319+
320+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
321+
322+
323+
### Dependencies
324+
325+
###### Does this feature depend on any specific services running in the cluster?
326+
327+
No
328+
329+
### Scalability
330+
331+
332+
###### Will enabling / using this feature result in any new API calls?
333+
334+
No
335+
336+
###### Will enabling / using this feature result in introducing new API types?
337+
338+
No
339+
340+
###### Will enabling / using this feature result in any new calls to the cloud provider?
341+
342+
No
343+
344+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
345+
346+
No
347+
348+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
349+
350+
No
351+
352+
353+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
354+
355+
No
356+
357+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
358+
359+
No
360+
361+
### Troubleshooting
362+
363+
###### How does this feature react if the API server and/or etcd is unavailable?
364+
365+
###### What are other known failure modes?
366+
367+
368+
###### What steps should be taken if SLOs are not being met to determine the problem?
369+
370+
## Implementation History
371+
372+
## Drawbacks
373+
374+
If a container becomes unhealthy during the kubelet restart, the kubelet may still report a Ready status until the Readiness probe completes its check. This can lead to other Kubernetes components making decisions based on stale information, such as directing traffic to an unhealthy Pod, resulting in service degradation or failed user requests.
375+
376+
## Alternatives
377+
378+
## Infrastructure Needed (Optional)
121 KB
Loading

0 commit comments

Comments
 (0)