Skip to content

Commit 30f3b92

Browse files
authored
Merge pull request kubernetes#3320 from xing-yang/non-graceful-node-shutdown-beta
Update non-graceful node shutdown to beta
2 parents 32bccf4 + bc19dea commit 30f3b92

File tree

3 files changed

+119
-19
lines changed

3 files changed

+119
-19
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2268
22
alpha:
3+
approver: "@deads2k"
4+
beta:
35
approver: "@deads2k"

keps/sig-storage/2268-non-graceful-shutdown/README.md

Lines changed: 113 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@ This includes the Summary and Motivation sections.
1616
- [Risks and Mitigations](#risks-and-mitigations)
1717
- [Design Details](#design-details)
1818
- [Test Plan](#test-plan)
19-
- [Unit tests](#unit-tests)
20-
- [E2E tests](#e2e-tests)
19+
- [Prerequisite testing updates](#prerequisite-testing-updates)
20+
- [Unit tests](#unit-tests)
21+
- [Integration tests](#integration-tests)
22+
- [E2E tests](#e2e-tests)
2123
- [Graduation Criteria](#graduation-criteria)
2224
- [Alpha](#alpha)
2325
- [Alpha -> Beta Graduation](#alpha---beta-graduation)
@@ -81,7 +83,7 @@ Similarly, this approach can also be applied to the case when a node is in a non
8183
### Use Cases
8284

8385
* If user wants to intentionally shutdown a node, he/she can validate whether graceful node shutdown feature works. If Kubelet is able to detect node is shutting down, it will gracefully delete pods, and new pods will be created on another running node.
84-
* If graceful shutdown is not working or node is in non-recoverable state due to hardware failure or broken OS, etc., user now can enable this feature, and add `out-of-service=nodeshutdown:NoExecute` taint which will be explained in detail below to trigger non-graceful shutdown behavior.
86+
* If graceful shutdown is not working or node is in non-recoverable state due to hardware failure or broken OS, etc., user now can enable this feature, and add `node.kubernetes.io/out-of-service=nodeshutdown:NoExecute` taint which will be explained in detail below to trigger non-graceful shutdown behavior.
8587

8688
### Goals
8789

@@ -112,7 +114,7 @@ Proposed logic change:
112114
1. [Proposed change] This proposal requires a user to apply a `out-of-service` taint on a node when the user has confirmed that this node is shutdown or in a non-recoverable state due to the hardware failure or broken OS. Note that user should only add this taint if the node is not coming back at least for some time. If the node is in the middle of restarting, this taint should not be used.
113115

114116
1. [Proposed change] In the Pod GC Controller, part of the kube-controller-manager, add a new function called gcTerminating. This function would need to go through all the Pods in terminating state, verify that the node the pod scheduled on is NotReady. If so, do the following:
115-
1. Upon seeing the `out-of-service` taint, the Pod GC Controller will forcefully delete the pods on the node if there are no matching tolation on the pods. This new `out-of-service` taint has `NoExecute` effect, meaning the pod will be evicted and a new pod will not schedule on the shutdown node unless it has a matching toleration. For example, `out-of-service=nodeshutdown:NoExecute` or `out-of-service=hardwarefailure:NoExecute`. We suggest to use `NoExecute` effect in taint to make sure pods will be evicted (deleted) and fail over to other nodes.
117+
1. Upon seeing the `out-of-service` taint, the Pod GC Controller will forcefully delete the pods on the node if there are no matching tolation on the pods. This new `out-of-service` taint has `NoExecute` effect, meaning the pod will be evicted and a new pod will not schedule on the shutdown node unless it has a matching toleration. For example, `node.kubernetes.io/out-of-service=out-of-service=nodeshutdown:NoExecute` or `node.kubernetes.io/out-of-service=out-of-service=hardwarefailure:NoExecute`. We suggest to use `NoExecute` effect in taint to make sure pods will be evicted (deleted) and fail over to other nodes.
116118
1. We'll follow taint and toleration policy. If a pod is set to tolerate all taints and effects, that means user does NOT want to evict pods when node is not ready. So GC controller will filter out those pods and only forcefully delete pods that do not have a matching toleration. If your pod tolerates the `out-of-service` taint, then it will not be terminated by the taint logic, therefore none of this applies.
117119

118120
1. [Proposed change] Once pods are selected and forcefully deleted, the attachdetach reconciler should check the `out-of-service` taint on the node. If the taint is present, the attachdetach reconciler will not wait for 6 minutes to do force detach. Instead it will force detach right away and allow `volumeAttachment` to be deleted.
@@ -144,15 +146,32 @@ To mitigate this we plan to have a high test coverage and to introduce this enha
144146

145147
### Test Plan
146148

147-
### Unit tests
149+
[x] I/we understand the owners of the involved components may require updates to
150+
existing tests to make this code solid enough prior to committing the changes necessary
151+
to implement this enhancement.
152+
153+
#### Prerequisite testing updates
154+
155+
There are existing tests for Pod GC controller: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/pod_gc.go
156+
157+
There are existing tests for attach detach controller. Creating a pod that uses PVCs which will test attach: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/create.go
158+
Deleting a pod will trigger PVC to be detached: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/delete.go
159+
160+
#### Unit tests
161+
148162
* Add unit tests to affected components in kube-controller-manager:
149163
* Add tests in Pod GC Controller for the new logic to clean up pods and the `out-of-service` taint.
150164
* Add tests in Attachdetach Controller for the changed logic that allow volumes to be forcefully detached without wait.
151165

152-
### E2E tests
153-
* New E2E tests to validate workloads move successfully to another running node when a node is shutdown.
154-
* Feature gate for `NonGracefulFailover` is disabled, feature is not active.
155-
* Feature gate for `NonGracefulFailover` is enabled. Add `out-of-service` taint after node is shutdown:
166+
#### Integration tests
167+
168+
After reviewing the tests, we decided that the best place to add a test for this feature is under test/e2e/storage:
169+
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/non_graceful_node_shutdown.go
170+
171+
#### E2E tests
172+
173+
* Added E2E tests to validate workloads move successfully to another running node when a node is shutdown: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/non_graceful_node_shutdown.go
174+
* Feature gate for `NodeOutOfServiceVolumeDetach` is enabled. Add `out-of-service` taint after node is shutdown:
156175
* Verify workloads are moved to another node successfully.
157176
* Verify the `out-of-service` taint is removed after the shutdown node is cleaned up.
158177
* Add stress and scale tests before moving from beta to GA.
@@ -161,7 +180,7 @@ We also plan to test this with different version Skews.
161180

162181
### Graduation Criteria
163182

164-
This KEP will be treated as a new feature, and will be introduced with a new feature gate, `NonGracefulFailover`.
183+
This KEP will be treated as a new feature, and will be introduced with a new feature gate, `NodeOutOfServiceVolumeDetach`.
165184

166185
This enhancement will go through the following maturity levels: alpha, beta and stable.
167186

@@ -217,7 +236,7 @@ _This section must be completed when targeting alpha to a release._
217236

218237
* **How can this feature be enabled / disabled in a live cluster?**
219238
- [ ] Feature gate (also fill in values in `kep.yaml`)
220-
- Feature gate name: NonGracefulFailover
239+
- Feature gate name: NodeOutOfServiceVolumeDetach
221240
- Components depending on the feature gate: kube-controller-manager
222241
- [ ] Other
223242
- Describe the mechanism:
@@ -266,17 +285,35 @@ _This section must be completed when targeting beta graduation to a release._
266285
* **How can a rollout fail? Can it impact already running workloads?**
267286
Try to be as paranoid as possible - e.g., what if some components will restart
268287
mid-rollout?
288+
The rollout should not fail. Feature gate only needs to be enabled on `kube-controller-manager`. So it is either enabled or disabled.
289+
In an HA cluster, assume a 1.N kube-controller-manager has feature gate enabled
290+
and takes the lease first, and the GC Controller has already forcefully deleted
291+
the pods. Then it loses it to a 1.N-1 kube-controller-manager which as feature gate
292+
disabled. In this case, the Attach Detach Controller won't have the new behavior
293+
so it will wait for 6 minutes before force detach the volume.
294+
If the 1.N kube-controller-manager has feature gate disabled while the 1.N-1
295+
kube-controller-manager has feature gate enabled, the GC Controller will not
296+
forcefully delete the pods. So the Attach Detach Controller will not be triggered
297+
to force detach. In the later case, it will still keep the old behavior.
269298

270299
* **What specific metrics should inform a rollback?**
300+
If for some reason, the user does not want the workload to failover to a
301+
different running node after the original node is shutdown and the `out-of-service` taint is applied, a rollback can be done. I don't see why it is needed though as
302+
user can prevent the failover from happening by not applying the `out-of-service`
303+
taint.
304+
Since use of this feature requires applying a taint manually by the user,
305+
it should not specifically require rollback.
271306

272307
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
273308
Describe manual testing that was done and the outcomes.
274309
Longer term, we may want to require automated upgrade/rollback tests, but we
275310
are missing a bunch of machinery and tooling and can't do that now.
311+
We will manually test upgrade from 1.25 to 1.26 and rollback from 1.26 to 1.25.
276312

277313
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
278314
fields of API types, flags, etc.?**
279315
Even if applying deprecation policies, they may still surprise some users.
316+
No.
280317

281318
### Monitoring Requirements
282319

@@ -286,15 +323,26 @@ _This section must be completed when targeting beta graduation to a release._
286323
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
287324
checking if there are objects with field X set) may be a last resort. Avoid
288325
logs or events for this purpose.
326+
An operator, the person who operates the cluster, can check if the
327+
`NodeOutOfServiceVolumeDetach` feature gate is enabled and if there is an
328+
`out-of-service` taint on the shutdown node.
329+
The usage of this feature requires the manual step of applying a taint
330+
so the operator should be the one applying it.
289331

290332
* **What are the SLIs (Service Level Indicators) an operator can use to determine
291333
the health of the service?**
292334
- [ ] Metrics
293-
- Metric name:
335+
- Metric name: We can add new metrics deleting_pods_total, deleting_pods_error_total
336+
in Pod GC Controller.
337+
For Attach Detach Controller, there's already a metric:
338+
attachdetach_controller_forced_detaches
339+
It is also useful to know how many nodes have taints. We can explore with [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) which generates metrics about the state of the objects.
294340
- [Optional] Aggregation method:
295341
- Components exposing the metric:
296342
- [ ] Other (treat as last resort)
297-
- Details:
343+
- Details: Check whether the workload moved to a different running node
344+
after the original node is shutdown and the `out-of-service` taint
345+
is applied on the shutdown node.
298346

299347
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
300348
At a high level, this usually will be in the form of "high percentile of SLI
@@ -304,6 +352,12 @@ the health of the service?**
304352
- 99% percentile over day of absolute value from (job creation time minus expected
305353
job creation time) for cron job <= 10%
306354
- 99,9% of /health requests per day finish with 200 code
355+
The failover should always happen if the feature gate is enabled, the taint
356+
is applied, and there are other running nodes.
357+
We can also check the deleting_pods_total, deleting_pods_error_total metrics
358+
in Pod GC Controller and the attachdetach_controller_forced_detaches and
359+
attachdetach_controller_forced_detaches_taint metric in the Attach Detach
360+
Controller.
307361

308362
* **Are there any missing metrics that would be useful to have to improve observability
309363
of this feature?**
@@ -320,13 +374,18 @@ _This section must be completed when targeting beta graduation to a release._
320374
optional services that are needed. For example, if this feature depends on
321375
a cloud provider API, or upon an external software-defined storage or network
322376
control plane.
377+
This feature relies on the kube-controller-manager being running. If the
378+
workload is running on a StatefulSet, it also depends on the CSI driver.
323379

324380
For each of these, fill in the following—thinking about running existing user workloads
325381
and creating new ones, as well as about cluster-level services (e.g. DNS):
326-
- [Dependency name]
382+
- [Dependency name] CSI driver
327383
- Usage description:
328-
- Impact of its outage on the feature:
384+
- Impact of its outage on the feature: If the CSI driver is not running,
385+
the pod cannot use the persistent volume any more so the workload will
386+
not be running properly.
329387
- Impact of its degraded performance or high-error rates on the feature:
388+
Workload does not work properly if the CSI driver is down.
330389

331390
### Scalability
332391

@@ -345,31 +404,39 @@ previous answers based on experience in the field._
345404
- originating component(s) (e.g. Kubelet, Feature-X-controller)
346405
focusing mostly on:
347406
- components listing and/or watching resources they didn't before
407+
Nothing new.
348408
- API calls that may be triggered by changes of some Kubernetes resources
349409
(e.g. update of object X triggers new updates of object Y)
410+
This feature will trigger pods deletion, volume detachment, new pods
411+
creation, and volume attachment calls if the new taint is applied.
350412
- periodic API calls to reconcile state (e.g. periodic fetching state,
351413
heartbeats, leader election, etc.)
414+
Nothing new.
352415

353416
* **Will enabling / using this feature result in introducing new API types?**
354417
Describe them, providing:
355418
- API type
356419
- Supported number of objects per cluster
357420
- Supported number of objects per namespace (for namespace-scoped objects)
421+
No.
358422

359423
* **Will enabling / using this feature result in any new calls to the cloud
360424
provider?**
425+
Volume detach/attach could trigger cloud provider calls.
361426

362427
* **Will enabling / using this feature result in increasing size or count of
363428
the existing API objects?**
364429
Describe them, providing:
365430
- API type(s):
366431
- Estimated increase in size: (e.g., new annotation of size 32B)
367432
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
433+
No.
368434

369435
* **Will enabling / using this feature result in increasing time taken by any
370436
operations covered by [existing SLIs/SLOs]?**
371437
Think about adding additional work or introducing new steps in between
372438
(e.g. need to do X to start a container), etc. Please describe the details.
439+
No.
373440

374441
* **Will enabling / using this feature result in non-negligible increase of
375442
resource usage (CPU, RAM, disk, IO, ...) in any components?**
@@ -378,6 +445,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
378445
volume), significant amount of data sent and/or received over network, etc.
379446
This through this both in small and large cases, again with respect to the
380447
[supported limits].
448+
No.
381449

382450
### Troubleshooting
383451

@@ -388,20 +456,50 @@ details). For now, we leave it here.
388456
_This section must be completed when targeting beta graduation to a release._
389457

390458
* **How does this feature react if the API server and/or etcd is unavailable?**
459+
If API server or etcd is not available, we can't get accurate status of node or pod.
460+
However the usage of this feature is very manual so an operator can verify
461+
before applying the taint.
391462

392463
* **What are other known failure modes?**
393464
For each of them, fill in the following information by copying the below template:
394465
- [Failure mode brief description]
395466
- Detection: How can it be detected via metrics? Stated another way:
396467
how can an operator troubleshoot without logging into a master or worker node?
468+
After applying the `out-of-service` taint, if the workload does not move
469+
to a different running node, that is an indicator something might be wrong.
470+
Note that since we removed the 6 minute wait in the Attach Detach controller,
471+
force detach will happen if the feature gate is enabled and the taint is
472+
applied without delay.
397473
- Mitigations: What can be done to stop the bleeding, especially for already
398474
running user workloads?
475+
So if the workload does not failover, it behaves the same as when this
476+
feature is not enabled. The operator should try to find out why the
477+
failover didn't happen. Check messages from GC Controller and Attach Detach
478+
Controller in the kube-controller-manager logs.
479+
Check if there are following messages from GC Controller:
480+
"failed to get node <node name>: <error>"
481+
"Garbage collecting <number> pods that are terminating on node tainted with node.kubernetes.io/out-of-service".
482+
Check if there is the following message from Attach Detach Controller:
483+
"failed to get taint specs for node <node name>: <error>"
399484
- Diagnostics: What are the useful log messages and their required logging
400485
levels that could help debug the issue?
401486
Not required until feature graduated to beta.
487+
Set log level to at least 2.
488+
For example, the following message is in GC Controller if the feature is
489+
enabled and `out-of-service` taint is applied. If the pods are forcefully
490+
deleted by the GC Controller, this message should show up.
491+
klog.V(2).Infof("garbage collecting pod %s that is terminating. Phase [%v]", pod.Name, pod.Status.Phase)
492+
There is also a message in Attach Detach Controller that checks the taint.
493+
If the taint is applied and feature gate is enabled, it force detaches the
494+
volume without waiting for 6 minutes.
495+
klog.V(2).Infof("node %q has out-of-service taint", attachedVolume.NodeName)
402496
- Testing: Are there any tests for failure mode? If not, describe why.
497+
We have unit tests that cover different combination of pod and node statuses.
403498

404499
* **What steps should be taken if SLOs are not being met to determine the problem?**
500+
In that case, we need to go through the logs and find out the root cause.
501+
Check messages from GC Controller and Attach Detach Controller in the
502+
kube-controller-manager logs as described in the above section.
405503

406504
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
407505
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

keps/sig-storage/2268-non-graceful-shutdown/kep.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,18 @@ see-also:
2020
replaces:
2121

2222
# The target maturity stage in the current dev cycle for this KEP.
23-
stage: alpha
23+
stage: beta
2424

2525
# The most recent milestone for which work toward delivery of this KEP has been
2626
# done. This can be the current (upcoming) milestone, if it is being actively
2727
# worked on.
28-
latest-milestone: "v1.24"
28+
latest-milestone: "v1.26"
2929

3030
# The milestone at which this feature was, or is targeted to be, at each stage.
3131
milestone:
3232
alpha: "v1.24"
33-
beta: "v1.25"
34-
stable: "v1.26"
33+
beta: "v1.26"
34+
stable: "v1.27"
3535

3636
# The following PRR answers are required at alpha release
3737
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)