You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-storage/2268-non-graceful-shutdown/README.md
+43-11Lines changed: 43 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -282,12 +282,23 @@ _This section must be completed when targeting beta graduation to a release._
282
282
Try to be as paranoid as possible - e.g., what if some components will restart
283
283
mid-rollout?
284
284
The rollout should not fail. Feature gate only needs to be enabled on `kube-controller-manager`. So it is either enabled or disabled.
285
+
In an HA cluster, assume a 1.N kube-controller-manager has feature gate enabled
286
+
and takes the lease first, and the GC Controller has already forcefully deleted
287
+
the pods. Then it loses it to a 1.N-1 kube-controller-manager which as feature gate
288
+
disabled. In this case, the Attach Detach Controller won't have the new behavior
289
+
so it will wait for 6 minutes before force detach the volume.
290
+
If the 1.N kube-controller-manager has feature gate disabled while the 1.N-1
291
+
kube-controller-manager has feature gate enabled, the GC Controller will not
292
+
forcefully delete the pods. So the Attach Detach Controller will not be triggered
293
+
to force detach. In the later case, it will still keep the old behavior.
285
294
286
295
***What specific metrics should inform a rollback?**
287
296
If for some reason, the user does not want the workload to failover to a
288
297
different running node after the original node is shutdown and the `out-of-service` taint is applied, a rollback can be done. I don't see why it is needed though as
289
298
user can prevent the failover from happening by not applying the `out-of-service`
290
299
taint.
300
+
Since use of this feature requires applying a taint manually by the user,
301
+
it should not specifically require rollback.
291
302
292
303
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
293
304
Describe manual testing that was done and the outcomes.
@@ -308,15 +319,19 @@ _This section must be completed when targeting beta graduation to a release._
308
319
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
309
320
checking if there are objects with field X set) may be a last resort. Avoid
310
321
logs or events for this purpose.
311
-
An operator can check if the `NodeOutOfServiceVolumeDetach` feature gate is
312
-
enabled and if there is an `out-of-service` taint on the shutdown node.
322
+
An operator, the person who operates the cluster, can check if the
323
+
`NodeOutOfServiceVolumeDetach` feature gate is enabled and if there is an
324
+
`out-of-service` taint on the shutdown node.
313
325
The usage of this feature requires the manual step of applying a taint
314
326
so the operator should be the one applying it.
315
327
316
328
***What are the SLIs (Service Level Indicators) an operator can use to determine
317
329
the health of the service?**
318
330
-[ ] Metrics
319
-
- Metric name:
331
+
- Metric name: We can add new metrics deleting_pods_total, deleting_pods_error_total
332
+
in Pod GC Controller.
333
+
For Attach Detach Controller, there's already a metric:
334
+
attachdetach_controller_forced_detaches.
320
335
-[Optional] Aggregation method:
321
336
- Components exposing the metric:
322
337
-[ ] Other (treat as last resort)
@@ -334,6 +349,9 @@ the health of the service?**
334
349
- 99,9% of /health requests per day finish with 200 code
335
350
The failover should always happen if the feature gate is enabled, the taint
336
351
is applied, and there are other running nodes.
352
+
We can also check the deleting_pods_total, deleting_pods_error_total metrics
353
+
in Pod GC Controller and the attachdetach_controller_forced_detaches metric
354
+
in the Attach Detach Controller.
337
355
338
356
***Are there any missing metrics that would be useful to have to improve observability
339
357
of this feature?**
@@ -350,7 +368,8 @@ _This section must be completed when targeting beta graduation to a release._
350
368
optional services that are needed. For example, if this feature depends on
351
369
a cloud provider API, or upon an external software-defined storage or network
352
370
control plane.
353
-
If the workload is running on a StatefulSet, it depends on the CSI driver.
371
+
This feature relies on the kube-controller-manager being running. If the
372
+
workload is running on a StatefulSet, it also depends on the CSI driver.
354
373
355
374
For each of these, fill in the following—thinking about running existing user workloads
356
375
and creating new ones, as well as about cluster-level services (e.g. DNS):
@@ -379,11 +398,14 @@ previous answers based on experience in the field._
0 commit comments