You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -144,15 +146,28 @@ To mitigate this we plan to have a high test coverage and to introduce this enha
144
146
145
147
### Test Plan
146
148
147
-
### Unit tests
149
+
#### Prerequisite testing updates
150
+
151
+
There are existing tests for Pod GC controller: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/pod_gc.go
152
+
153
+
There are existing tests for attach detach controller. Creating a pod that uses PVCs which will test attach: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/create.go
154
+
Deleting a pod will trigger PVC to be detached: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/delete.go
155
+
156
+
#### Unit tests
148
157
* Add unit tests to affected components in kube-controller-manager:
149
158
* Add tests in Pod GC Controller for the new logic to clean up pods and the `out-of-service` taint.
150
159
* Add tests in Attachdetach Controller for the changed logic that allow volumes to be forcefully detached without wait.
151
160
152
-
### E2E tests
161
+
#### Integration tests
162
+
163
+
Add a test for forcefully terminating pods in https://github.com/kubernetes/kubernetes/blob/master/test/integration/garbagecollector/garbage_collector_test.go.
164
+
165
+
Add a test for force detach without waiting for 6 minutes in https://github.com/kubernetes/kubernetes/blob/master/test/integration/volume/attach_detach_test.go
166
+
167
+
#### E2E tests
153
168
* New E2E tests to validate workloads move successfully to another running node when a node is shutdown.
154
-
* Feature gate for `NonGracefulFailover` is disabled, feature is not active.
155
-
* Feature gate for `NonGracefulFailover` is enabled. Add `out-of-service` taint after node is shutdown:
169
+
* Feature gate for `NodeOutOfServiceVolumeDetach` is disabled, feature is not active.
170
+
* Feature gate for `NodeOutOfServiceVolumeDetach` is enabled. Add `out-of-service` taint after node is shutdown:
156
171
* Verify workloads are moved to another node successfully.
157
172
* Verify the `out-of-service` taint is removed after the shutdown node is cleaned up.
158
173
* Add stress and scale tests before moving from beta to GA.
@@ -217,7 +232,7 @@ _This section must be completed when targeting alpha to a release._
217
232
218
233
***How can this feature be enabled / disabled in a live cluster?**
219
234
-[ ] Feature gate (also fill in values in `kep.yaml`)
220
-
- Feature gate name: NonGracefulFailover
235
+
- Feature gate name: NodeOutOfServiceVolumeDetach
221
236
- Components depending on the feature gate: kube-controller-manager
222
237
-[ ] Other
223
238
- Describe the mechanism:
@@ -266,17 +281,24 @@ _This section must be completed when targeting beta graduation to a release._
266
281
***How can a rollout fail? Can it impact already running workloads?**
267
282
Try to be as paranoid as possible - e.g., what if some components will restart
268
283
mid-rollout?
284
+
The rollout should not fail. Feature gate only needs to be enabled on `kube-controller-manager`. So it is either enabled or disabled.
269
285
270
286
***What specific metrics should inform a rollback?**
287
+
If for some reason, the user does not want the workload to failover to a
288
+
different running node after the original node is shutdown and the `out-of-service` taint is applied, a rollback can be done. I don't see why it is needed though as
289
+
user can prevent the failover from happening by not applying the `out-of-service`
290
+
taint.
271
291
272
292
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
273
293
Describe manual testing that was done and the outcomes.
274
294
Longer term, we may want to require automated upgrade/rollback tests, but we
275
295
are missing a bunch of machinery and tooling and can't do that now.
296
+
We will manually test upgrade from 1.24 to 1.25 and rollback from 1.25 to 1.24.
276
297
277
298
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
278
299
fields of API types, flags, etc.?**
279
300
Even if applying deprecation policies, they may still surprise some users.
301
+
No.
280
302
281
303
### Monitoring Requirements
282
304
@@ -286,6 +308,10 @@ _This section must be completed when targeting beta graduation to a release._
286
308
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
287
309
checking if there are objects with field X set) may be a last resort. Avoid
288
310
logs or events for this purpose.
311
+
An operator can check if the `NodeOutOfServiceVolumeDetach` feature gate is
312
+
enabled and if there is an `out-of-service` taint on the shutdown node.
313
+
The usage of this feature requires the manual step of applying a taint
314
+
so the operator should be the one applying it.
289
315
290
316
***What are the SLIs (Service Level Indicators) an operator can use to determine
291
317
the health of the service?**
@@ -294,7 +320,9 @@ the health of the service?**
294
320
-[Optional] Aggregation method:
295
321
- Components exposing the metric:
296
322
-[ ] Other (treat as last resort)
297
-
- Details:
323
+
- Details: Check whether the workload moved to a different running node
324
+
after the original node is shutdown and the `out-of-service` taint
325
+
is applied on the shutdown node.
298
326
299
327
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
300
328
At a high level, this usually will be in the form of "high percentile of SLI
@@ -304,6 +332,8 @@ the health of the service?**
304
332
- 99% percentile over day of absolute value from (job creation time minus expected
305
333
job creation time) for cron job <= 10%
306
334
- 99,9% of /health requests per day finish with 200 code
335
+
The failover should always happen if the feature gate is enabled, the taint
336
+
is applied, and there are other running nodes.
307
337
308
338
***Are there any missing metrics that would be useful to have to improve observability
309
339
of this feature?**
@@ -320,13 +350,17 @@ _This section must be completed when targeting beta graduation to a release._
320
350
optional services that are needed. For example, if this feature depends on
321
351
a cloud provider API, or upon an external software-defined storage or network
322
352
control plane.
353
+
If the workload is running on a StatefulSet, it depends on the CSI driver.
323
354
324
355
For each of these, fill in the following—thinking about running existing user workloads
325
356
and creating new ones, as well as about cluster-level services (e.g. DNS):
326
-
-[Dependency name]
357
+
-[Dependency name] CSI driver
327
358
- Usage description:
328
-
- Impact of its outage on the feature:
359
+
- Impact of its outage on the feature: If the CSI driver is not running,
360
+
the pod cannot use the persistent volume any more so the workload will
361
+
not be running properly.
329
362
- Impact of its degraded performance or high-error rates on the feature:
363
+
Workload does not work properly if the CSI driver is down.
330
364
331
365
### Scalability
332
366
@@ -349,27 +383,32 @@ previous answers based on experience in the field._
349
383
(e.g. update of object X triggers new updates of object Y)
350
384
- periodic API calls to reconcile state (e.g. periodic fetching state,
351
385
heartbeats, leader election, etc.)
386
+
No.
352
387
353
388
***Will enabling / using this feature result in introducing new API types?**
354
389
Describe them, providing:
355
390
- API type
356
391
- Supported number of objects per cluster
357
392
- Supported number of objects per namespace (for namespace-scoped objects)
393
+
No.
358
394
359
395
***Will enabling / using this feature result in any new calls to the cloud
360
396
provider?**
397
+
No.
361
398
362
399
***Will enabling / using this feature result in increasing size or count of
363
400
the existing API objects?**
364
401
Describe them, providing:
365
402
- API type(s):
366
403
- Estimated increase in size: (e.g., new annotation of size 32B)
367
404
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
405
+
No.
368
406
369
407
***Will enabling / using this feature result in increasing time taken by any
370
408
operations covered by [existing SLIs/SLOs]?**
371
409
Think about adding additional work or introducing new steps in between
372
410
(e.g. need to do X to start a container), etc. Please describe the details.
411
+
No.
373
412
374
413
***Will enabling / using this feature result in non-negligible increase of
375
414
resource usage (CPU, RAM, disk, IO, ...) in any components?**
@@ -378,6 +417,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
378
417
volume), significant amount of data sent and/or received over network, etc.
379
418
This through this both in small and large cases, again with respect to the
380
419
[supported limits].
420
+
No.
381
421
382
422
### Troubleshooting
383
423
@@ -388,20 +428,40 @@ details). For now, we leave it here.
388
428
_This section must be completed when targeting beta graduation to a release._
389
429
390
430
***How does this feature react if the API server and/or etcd is unavailable?**
431
+
If API server or etcd is not available, we can't get accurate status of node or pod.
432
+
However the usage of this feature is very manual so an operator can verify
433
+
before applying the taint.
391
434
392
435
***What are other known failure modes?**
393
436
For each of them, fill in the following information by copying the below template:
394
437
-[Failure mode brief description]
395
438
- Detection: How can it be detected via metrics? Stated another way:
396
439
how can an operator troubleshoot without logging into a master or worker node?
440
+
After applying the `out-of-service` taint, if the workload does not move
441
+
to a different running node immediately, that is an indicator something
442
+
might be wrong.
397
443
- Mitigations: What can be done to stop the bleeding, especially for already
398
444
running user workloads?
445
+
So if the workload does not failover, it behaves the same as when this
446
+
feature is not enabled. The operator should try to find out why the
447
+
failover didn't happen.
399
448
- Diagnostics: What are the useful log messages and their required logging
400
449
levels that could help debug the issue?
401
450
Not required until feature graduated to beta.
451
+
Set log level to at least 4.
452
+
For example, the following message is in GC Controller if the feature is
453
+
enabled and `out-of-service` taint is applied. If the pods are forcefully
454
+
deleted by the GC Controller, this message should show up.
455
+
klog.V(4).Infof("garbage collecting pod %s that is terminating. Phase [%v]", pod.Name, pod.Status.Phase)
456
+
There is also a message in Attach Detach Controller that checks the taint.
457
+
If the taint is applied and feature gate is enabled, it force detaches the
458
+
volume without waiting for 6 minutes.
459
+
klog.V(4).Infof("node %q has out-of-service taint", attachedVolume.NodeName)
402
460
- Testing: Are there any tests for failure mode? If not, describe why.
461
+
We have unit tests that cover different combination of pod and node statuses.
403
462
404
463
***What steps should be taken if SLOs are not being met to determine the problem?**
464
+
In that case, we need to go through the logs and find out the root cause.
0 commit comments