You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -81,7 +83,7 @@ Similarly, this approach can also be applied to the case when a node is in a non
81
83
### Use Cases
82
84
83
85
* If user wants to intentionally shutdown a node, he/she can validate whether graceful node shutdown feature works. If Kubelet is able to detect node is shutting down, it will gracefully delete pods, and new pods will be created on another running node.
84
-
* If graceful shutdown is not working or node is in non-recoverable state due to hardware failure or broken OS, etc., user now can enable this feature, and add `out-of-service=nodeshutdown:NoExecute` taint which will be explained in detail below to trigger non-graceful shutdown behavior.
86
+
* If graceful shutdown is not working or node is in non-recoverable state due to hardware failure or broken OS, etc., user now can enable this feature, and add `node.kubernetes.io/out-of-service=nodeshutdown:NoExecute` taint which will be explained in detail below to trigger non-graceful shutdown behavior.
85
87
86
88
### Goals
87
89
@@ -112,7 +114,7 @@ Proposed logic change:
112
114
1.[Proposed change] This proposal requires a user to apply a `out-of-service` taint on a node when the user has confirmed that this node is shutdown or in a non-recoverable state due to the hardware failure or broken OS. Note that user should only add this taint if the node is not coming back at least for some time. If the node is in the middle of restarting, this taint should not be used.
113
115
114
116
1.[Proposed change] In the Pod GC Controller, part of the kube-controller-manager, add a new function called gcTerminating. This function would need to go through all the Pods in terminating state, verify that the node the pod scheduled on is NotReady. If so, do the following:
115
-
1. Upon seeing the `out-of-service` taint, the Pod GC Controller will forcefully delete the pods on the node if there are no matching tolation on the pods. This new `out-of-service` taint has `NoExecute` effect, meaning the pod will be evicted and a new pod will not schedule on the shutdown node unless it has a matching toleration. For example, `out-of-service=nodeshutdown:NoExecute` or `out-of-service=hardwarefailure:NoExecute`. We suggest to use `NoExecute` effect in taint to make sure pods will be evicted (deleted) and fail over to other nodes.
117
+
1. Upon seeing the `out-of-service` taint, the Pod GC Controller will forcefully delete the pods on the node if there are no matching tolation on the pods. This new `out-of-service` taint has `NoExecute` effect, meaning the pod will be evicted and a new pod will not schedule on the shutdown node unless it has a matching toleration. For example, `node.kubernetes.io/out-of-service=out-of-service=nodeshutdown:NoExecute` or `node.kubernetes.io/out-of-service=out-of-service=hardwarefailure:NoExecute`. We suggest to use `NoExecute` effect in taint to make sure pods will be evicted (deleted) and fail over to other nodes.
116
118
1. We'll follow taint and toleration policy. If a pod is set to tolerate all taints and effects, that means user does NOT want to evict pods when node is not ready. So GC controller will filter out those pods and only forcefully delete pods that do not have a matching toleration. If your pod tolerates the `out-of-service` taint, then it will not be terminated by the taint logic, therefore none of this applies.
117
119
118
120
1.[Proposed change] Once pods are selected and forcefully deleted, the attachdetach reconciler should check the `out-of-service` taint on the node. If the taint is present, the attachdetach reconciler will not wait for 6 minutes to do force detach. Instead it will force detach right away and allow `volumeAttachment` to be deleted.
@@ -144,15 +146,32 @@ To mitigate this we plan to have a high test coverage and to introduce this enha
144
146
145
147
### Test Plan
146
148
147
-
### Unit tests
149
+
[x] I/we understand the owners of the involved components may require updates to
150
+
existing tests to make this code solid enough prior to committing the changes necessary
151
+
to implement this enhancement.
152
+
153
+
#### Prerequisite testing updates
154
+
155
+
There are existing tests for Pod GC controller: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/pod_gc.go
156
+
157
+
There are existing tests for attach detach controller. Creating a pod that uses PVCs which will test attach: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/create.go
158
+
Deleting a pod will trigger PVC to be detached: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/delete.go
159
+
160
+
#### Unit tests
161
+
148
162
* Add unit tests to affected components in kube-controller-manager:
149
163
* Add tests in Pod GC Controller for the new logic to clean up pods and the `out-of-service` taint.
150
164
* Add tests in Attachdetach Controller for the changed logic that allow volumes to be forcefully detached without wait.
151
165
152
-
### E2E tests
153
-
* New E2E tests to validate workloads move successfully to another running node when a node is shutdown.
154
-
* Feature gate for `NonGracefulFailover` is disabled, feature is not active.
155
-
* Feature gate for `NonGracefulFailover` is enabled. Add `out-of-service` taint after node is shutdown:
166
+
#### Integration tests
167
+
168
+
After reviewing the tests, we decided that the best place to add a test for this feature is under test/e2e/storage:
* Added E2E tests to validate workloads move successfully to another running node when a node is shutdown: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/non_graceful_node_shutdown.go
174
+
* Feature gate for `NodeOutOfServiceVolumeDetach` is enabled. Add `out-of-service` taint after node is shutdown:
156
175
* Verify workloads are moved to another node successfully.
157
176
* Verify the `out-of-service` taint is removed after the shutdown node is cleaned up.
158
177
* Add stress and scale tests before moving from beta to GA.
@@ -161,7 +180,7 @@ We also plan to test this with different version Skews.
161
180
162
181
### Graduation Criteria
163
182
164
-
This KEP will be treated as a new feature, and will be introduced with a new feature gate, `NonGracefulFailover`.
183
+
This KEP will be treated as a new feature, and will be introduced with a new feature gate, `NodeOutOfServiceVolumeDetach`.
165
184
166
185
This enhancement will go through the following maturity levels: alpha, beta and stable.
167
186
@@ -217,7 +236,7 @@ _This section must be completed when targeting alpha to a release._
217
236
218
237
***How can this feature be enabled / disabled in a live cluster?**
219
238
-[ ] Feature gate (also fill in values in `kep.yaml`)
220
-
- Feature gate name: NonGracefulFailover
239
+
- Feature gate name: NodeOutOfServiceVolumeDetach
221
240
- Components depending on the feature gate: kube-controller-manager
222
241
-[ ] Other
223
242
- Describe the mechanism:
@@ -266,17 +285,35 @@ _This section must be completed when targeting beta graduation to a release._
266
285
***How can a rollout fail? Can it impact already running workloads?**
267
286
Try to be as paranoid as possible - e.g., what if some components will restart
268
287
mid-rollout?
288
+
The rollout should not fail. Feature gate only needs to be enabled on `kube-controller-manager`. So it is either enabled or disabled.
289
+
In an HA cluster, assume a 1.N kube-controller-manager has feature gate enabled
290
+
and takes the lease first, and the GC Controller has already forcefully deleted
291
+
the pods. Then it loses it to a 1.N-1 kube-controller-manager which as feature gate
292
+
disabled. In this case, the Attach Detach Controller won't have the new behavior
293
+
so it will wait for 6 minutes before force detach the volume.
294
+
If the 1.N kube-controller-manager has feature gate disabled while the 1.N-1
295
+
kube-controller-manager has feature gate enabled, the GC Controller will not
296
+
forcefully delete the pods. So the Attach Detach Controller will not be triggered
297
+
to force detach. In the later case, it will still keep the old behavior.
269
298
270
299
***What specific metrics should inform a rollback?**
300
+
If for some reason, the user does not want the workload to failover to a
301
+
different running node after the original node is shutdown and the `out-of-service` taint is applied, a rollback can be done. I don't see why it is needed though as
302
+
user can prevent the failover from happening by not applying the `out-of-service`
303
+
taint.
304
+
Since use of this feature requires applying a taint manually by the user,
305
+
it should not specifically require rollback.
271
306
272
307
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
273
308
Describe manual testing that was done and the outcomes.
274
309
Longer term, we may want to require automated upgrade/rollback tests, but we
275
310
are missing a bunch of machinery and tooling and can't do that now.
311
+
We will manually test upgrade from 1.25 to 1.26 and rollback from 1.26 to 1.25.
276
312
277
313
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
278
314
fields of API types, flags, etc.?**
279
315
Even if applying deprecation policies, they may still surprise some users.
316
+
No.
280
317
281
318
### Monitoring Requirements
282
319
@@ -286,15 +323,26 @@ _This section must be completed when targeting beta graduation to a release._
286
323
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
287
324
checking if there are objects with field X set) may be a last resort. Avoid
288
325
logs or events for this purpose.
326
+
An operator, the person who operates the cluster, can check if the
327
+
`NodeOutOfServiceVolumeDetach` feature gate is enabled and if there is an
328
+
`out-of-service` taint on the shutdown node.
329
+
The usage of this feature requires the manual step of applying a taint
330
+
so the operator should be the one applying it.
289
331
290
332
***What are the SLIs (Service Level Indicators) an operator can use to determine
291
333
the health of the service?**
292
334
-[ ] Metrics
293
-
- Metric name:
335
+
- Metric name: We can add new metrics deleting_pods_total, deleting_pods_error_total
336
+
in Pod GC Controller.
337
+
For Attach Detach Controller, there's already a metric:
338
+
attachdetach_controller_forced_detaches
339
+
It is also useful to know how many nodes have taints. We can explore with [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) which generates metrics about the state of the objects.
294
340
-[Optional] Aggregation method:
295
341
- Components exposing the metric:
296
342
-[ ] Other (treat as last resort)
297
-
- Details:
343
+
- Details: Check whether the workload moved to a different running node
344
+
after the original node is shutdown and the `out-of-service` taint
345
+
is applied on the shutdown node.
298
346
299
347
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
300
348
At a high level, this usually will be in the form of "high percentile of SLI
@@ -304,6 +352,12 @@ the health of the service?**
304
352
- 99% percentile over day of absolute value from (job creation time minus expected
305
353
job creation time) for cron job <= 10%
306
354
- 99,9% of /health requests per day finish with 200 code
355
+
The failover should always happen if the feature gate is enabled, the taint
356
+
is applied, and there are other running nodes.
357
+
We can also check the deleting_pods_total, deleting_pods_error_total metrics
358
+
in Pod GC Controller and the attachdetach_controller_forced_detaches and
359
+
attachdetach_controller_forced_detaches_taint metric in the Attach Detach
360
+
Controller.
307
361
308
362
***Are there any missing metrics that would be useful to have to improve observability
309
363
of this feature?**
@@ -320,13 +374,18 @@ _This section must be completed when targeting beta graduation to a release._
320
374
optional services that are needed. For example, if this feature depends on
321
375
a cloud provider API, or upon an external software-defined storage or network
322
376
control plane.
377
+
This feature relies on the kube-controller-manager being running. If the
378
+
workload is running on a StatefulSet, it also depends on the CSI driver.
323
379
324
380
For each of these, fill in the following—thinking about running existing user workloads
325
381
and creating new ones, as well as about cluster-level services (e.g. DNS):
326
-
-[Dependency name]
382
+
-[Dependency name] CSI driver
327
383
- Usage description:
328
-
- Impact of its outage on the feature:
384
+
- Impact of its outage on the feature: If the CSI driver is not running,
385
+
the pod cannot use the persistent volume any more so the workload will
386
+
not be running properly.
329
387
- Impact of its degraded performance or high-error rates on the feature:
388
+
Workload does not work properly if the CSI driver is down.
330
389
331
390
### Scalability
332
391
@@ -345,31 +404,39 @@ previous answers based on experience in the field._
0 commit comments