Skip to content

Commit d63205d

Browse files
committed
Update non-graceful node shutdown to beta
1 parent 32bccf4 commit d63205d

File tree

3 files changed

+74
-12
lines changed

3 files changed

+74
-12
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2268
22
alpha:
3+
approver: "@deads2k"
4+
beta:
35
approver: "@deads2k"

keps/sig-storage/2268-non-graceful-shutdown/README.md

Lines changed: 70 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@ This includes the Summary and Motivation sections.
1616
- [Risks and Mitigations](#risks-and-mitigations)
1717
- [Design Details](#design-details)
1818
- [Test Plan](#test-plan)
19-
- [Unit tests](#unit-tests)
20-
- [E2E tests](#e2e-tests)
19+
- [Prerequisite testing updates](#prerequisite-testing-updates)
20+
- [Unit tests](#unit-tests)
21+
- [Integration tests](#integration-tests)
22+
- [E2E tests](#e2e-tests)
2123
- [Graduation Criteria](#graduation-criteria)
2224
- [Alpha](#alpha)
2325
- [Alpha -> Beta Graduation](#alpha---beta-graduation)
@@ -144,15 +146,28 @@ To mitigate this we plan to have a high test coverage and to introduce this enha
144146

145147
### Test Plan
146148

147-
### Unit tests
149+
#### Prerequisite testing updates
150+
151+
There are existing tests for Pod GC controller: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/pod_gc.go
152+
153+
There are existing tests for attach detach controller. Creating a pod that uses PVCs which will test attach: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/create.go
154+
Deleting a pod will trigger PVC to be detached: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/delete.go
155+
156+
#### Unit tests
148157
* Add unit tests to affected components in kube-controller-manager:
149158
* Add tests in Pod GC Controller for the new logic to clean up pods and the `out-of-service` taint.
150159
* Add tests in Attachdetach Controller for the changed logic that allow volumes to be forcefully detached without wait.
151160

152-
### E2E tests
161+
#### Integration tests
162+
163+
Add a test for forcefully terminating pods in https://github.com/kubernetes/kubernetes/blob/master/test/integration/garbagecollector/garbage_collector_test.go.
164+
165+
Add a test for force detach without waiting for 6 minutes in https://github.com/kubernetes/kubernetes/blob/master/test/integration/volume/attach_detach_test.go
166+
167+
#### E2E tests
153168
* New E2E tests to validate workloads move successfully to another running node when a node is shutdown.
154-
* Feature gate for `NonGracefulFailover` is disabled, feature is not active.
155-
* Feature gate for `NonGracefulFailover` is enabled. Add `out-of-service` taint after node is shutdown:
169+
* Feature gate for `NodeOutOfServiceVolumeDetach` is disabled, feature is not active.
170+
* Feature gate for `NodeOutOfServiceVolumeDetach` is enabled. Add `out-of-service` taint after node is shutdown:
156171
* Verify workloads are moved to another node successfully.
157172
* Verify the `out-of-service` taint is removed after the shutdown node is cleaned up.
158173
* Add stress and scale tests before moving from beta to GA.
@@ -217,7 +232,7 @@ _This section must be completed when targeting alpha to a release._
217232

218233
* **How can this feature be enabled / disabled in a live cluster?**
219234
- [ ] Feature gate (also fill in values in `kep.yaml`)
220-
- Feature gate name: NonGracefulFailover
235+
- Feature gate name: NodeOutOfServiceVolumeDetach
221236
- Components depending on the feature gate: kube-controller-manager
222237
- [ ] Other
223238
- Describe the mechanism:
@@ -266,17 +281,24 @@ _This section must be completed when targeting beta graduation to a release._
266281
* **How can a rollout fail? Can it impact already running workloads?**
267282
Try to be as paranoid as possible - e.g., what if some components will restart
268283
mid-rollout?
284+
The rollout should not fail. Feature gate only needs to be enabled on `kube-controller-manager`. So it is either enabled or disabled.
269285

270286
* **What specific metrics should inform a rollback?**
287+
If for some reason, the user does not want the workload to failover to a
288+
different running node after the original node is shutdown and the `out-of-service` taint is applied, a rollback can be done. I don't see why it is needed though as
289+
user can prevent the failover from happening by not applying the `out-of-service`
290+
taint.
271291

272292
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
273293
Describe manual testing that was done and the outcomes.
274294
Longer term, we may want to require automated upgrade/rollback tests, but we
275295
are missing a bunch of machinery and tooling and can't do that now.
296+
We will manually test upgrade from 1.24 to 1.25 and rollback from 1.25 to 1.24.
276297

277298
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
278299
fields of API types, flags, etc.?**
279300
Even if applying deprecation policies, they may still surprise some users.
301+
No.
280302

281303
### Monitoring Requirements
282304

@@ -286,6 +308,10 @@ _This section must be completed when targeting beta graduation to a release._
286308
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
287309
checking if there are objects with field X set) may be a last resort. Avoid
288310
logs or events for this purpose.
311+
An operator can check if the `NodeOutOfServiceVolumeDetach` feature gate is
312+
enabled and if there is an `out-of-service` taint on the shutdown node.
313+
The usage of this feature requires the manual step of applying a taint
314+
so the operator should be the one applying it.
289315

290316
* **What are the SLIs (Service Level Indicators) an operator can use to determine
291317
the health of the service?**
@@ -294,7 +320,9 @@ the health of the service?**
294320
- [Optional] Aggregation method:
295321
- Components exposing the metric:
296322
- [ ] Other (treat as last resort)
297-
- Details:
323+
- Details: Check whether the workload moved to a different running node
324+
after the original node is shutdown and the `out-of-service` taint
325+
is applied on the shutdown node.
298326

299327
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
300328
At a high level, this usually will be in the form of "high percentile of SLI
@@ -304,6 +332,8 @@ the health of the service?**
304332
- 99% percentile over day of absolute value from (job creation time minus expected
305333
job creation time) for cron job <= 10%
306334
- 99,9% of /health requests per day finish with 200 code
335+
The failover should always happen if the feature gate is enabled, the taint
336+
is applied, and there are other running nodes.
307337

308338
* **Are there any missing metrics that would be useful to have to improve observability
309339
of this feature?**
@@ -320,13 +350,17 @@ _This section must be completed when targeting beta graduation to a release._
320350
optional services that are needed. For example, if this feature depends on
321351
a cloud provider API, or upon an external software-defined storage or network
322352
control plane.
353+
If the workload is running on a StatefulSet, it depends on the CSI driver.
323354

324355
For each of these, fill in the following—thinking about running existing user workloads
325356
and creating new ones, as well as about cluster-level services (e.g. DNS):
326-
- [Dependency name]
357+
- [Dependency name] CSI driver
327358
- Usage description:
328-
- Impact of its outage on the feature:
359+
- Impact of its outage on the feature: If the CSI driver is not running,
360+
the pod cannot use the persistent volume any more so the workload will
361+
not be running properly.
329362
- Impact of its degraded performance or high-error rates on the feature:
363+
Workload does not work properly if the CSI driver is down.
330364

331365
### Scalability
332366

@@ -349,27 +383,32 @@ previous answers based on experience in the field._
349383
(e.g. update of object X triggers new updates of object Y)
350384
- periodic API calls to reconcile state (e.g. periodic fetching state,
351385
heartbeats, leader election, etc.)
386+
No.
352387

353388
* **Will enabling / using this feature result in introducing new API types?**
354389
Describe them, providing:
355390
- API type
356391
- Supported number of objects per cluster
357392
- Supported number of objects per namespace (for namespace-scoped objects)
393+
No.
358394

359395
* **Will enabling / using this feature result in any new calls to the cloud
360396
provider?**
397+
No.
361398

362399
* **Will enabling / using this feature result in increasing size or count of
363400
the existing API objects?**
364401
Describe them, providing:
365402
- API type(s):
366403
- Estimated increase in size: (e.g., new annotation of size 32B)
367404
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
405+
No.
368406

369407
* **Will enabling / using this feature result in increasing time taken by any
370408
operations covered by [existing SLIs/SLOs]?**
371409
Think about adding additional work or introducing new steps in between
372410
(e.g. need to do X to start a container), etc. Please describe the details.
411+
No.
373412

374413
* **Will enabling / using this feature result in non-negligible increase of
375414
resource usage (CPU, RAM, disk, IO, ...) in any components?**
@@ -378,6 +417,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
378417
volume), significant amount of data sent and/or received over network, etc.
379418
This through this both in small and large cases, again with respect to the
380419
[supported limits].
420+
No.
381421

382422
### Troubleshooting
383423

@@ -388,20 +428,40 @@ details). For now, we leave it here.
388428
_This section must be completed when targeting beta graduation to a release._
389429

390430
* **How does this feature react if the API server and/or etcd is unavailable?**
431+
If API server or etcd is not available, we can't get accurate status of node or pod.
432+
However the usage of this feature is very manual so an operator can verify
433+
before applying the taint.
391434

392435
* **What are other known failure modes?**
393436
For each of them, fill in the following information by copying the below template:
394437
- [Failure mode brief description]
395438
- Detection: How can it be detected via metrics? Stated another way:
396439
how can an operator troubleshoot without logging into a master or worker node?
440+
After applying the `out-of-service` taint, if the workload does not move
441+
to a different running node immediately, that is an indicator something
442+
might be wrong.
397443
- Mitigations: What can be done to stop the bleeding, especially for already
398444
running user workloads?
445+
So if the workload does not failover, it behaves the same as when this
446+
feature is not enabled. The operator should try to find out why the
447+
failover didn't happen.
399448
- Diagnostics: What are the useful log messages and their required logging
400449
levels that could help debug the issue?
401450
Not required until feature graduated to beta.
451+
Set log level to at least 4.
452+
For example, the following message is in GC Controller if the feature is
453+
enabled and `out-of-service` taint is applied. If the pods are forcefully
454+
deleted by the GC Controller, this message should show up.
455+
klog.V(4).Infof("garbage collecting pod %s that is terminating. Phase [%v]", pod.Name, pod.Status.Phase)
456+
There is also a message in Attach Detach Controller that checks the taint.
457+
If the taint is applied and feature gate is enabled, it force detaches the
458+
volume without waiting for 6 minutes.
459+
klog.V(4).Infof("node %q has out-of-service taint", attachedVolume.NodeName)
402460
- Testing: Are there any tests for failure mode? If not, describe why.
461+
We have unit tests that cover different combination of pod and node statuses.
403462

404463
* **What steps should be taken if SLOs are not being met to determine the problem?**
464+
In that case, we need to go through the logs and find out the root cause.
405465

406466
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
407467
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

keps/sig-storage/2268-non-graceful-shutdown/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ see-also:
2020
replaces:
2121

2222
# The target maturity stage in the current dev cycle for this KEP.
23-
stage: alpha
23+
stage: beta
2424

2525
# The most recent milestone for which work toward delivery of this KEP has been
2626
# done. This can be the current (upcoming) milestone, if it is being actively
2727
# worked on.
28-
latest-milestone: "v1.24"
28+
latest-milestone: "v1.25"
2929

3030
# The milestone at which this feature was, or is targeted to be, at each stage.
3131
milestone:

0 commit comments

Comments
 (0)