Skip to content

Commit f982ff9

Browse files
committed
Address review comments
1 parent d63205d commit f982ff9

File tree

1 file changed

+43
-11
lines changed
  • keps/sig-storage/2268-non-graceful-shutdown

1 file changed

+43
-11
lines changed

keps/sig-storage/2268-non-graceful-shutdown/README.md

Lines changed: 43 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -282,12 +282,23 @@ _This section must be completed when targeting beta graduation to a release._
282282
Try to be as paranoid as possible - e.g., what if some components will restart
283283
mid-rollout?
284284
The rollout should not fail. Feature gate only needs to be enabled on `kube-controller-manager`. So it is either enabled or disabled.
285+
In an HA cluster, assume a 1.N kube-controller-manager has feature gate enabled
286+
and takes the lease first, and the GC Controller has already forcefully deleted
287+
the pods. Then it loses it to a 1.N-1 kube-controller-manager which as feature gate
288+
disabled. In this case, the Attach Detach Controller won't have the new behavior
289+
so it will wait for 6 minutes before force detach the volume.
290+
If the 1.N kube-controller-manager has feature gate disabled while the 1.N-1
291+
kube-controller-manager has feature gate enabled, the GC Controller will not
292+
forcefully delete the pods. So the Attach Detach Controller will not be triggered
293+
to force detach. In the later case, it will still keep the old behavior.
285294

286295
* **What specific metrics should inform a rollback?**
287296
If for some reason, the user does not want the workload to failover to a
288297
different running node after the original node is shutdown and the `out-of-service` taint is applied, a rollback can be done. I don't see why it is needed though as
289298
user can prevent the failover from happening by not applying the `out-of-service`
290299
taint.
300+
Since use of this feature requires applying a taint manually by the user,
301+
it should not specifically require rollback.
291302

292303
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
293304
Describe manual testing that was done and the outcomes.
@@ -308,15 +319,19 @@ _This section must be completed when targeting beta graduation to a release._
308319
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
309320
checking if there are objects with field X set) may be a last resort. Avoid
310321
logs or events for this purpose.
311-
An operator can check if the `NodeOutOfServiceVolumeDetach` feature gate is
312-
enabled and if there is an `out-of-service` taint on the shutdown node.
322+
An operator, the person who operates the cluster, can check if the
323+
`NodeOutOfServiceVolumeDetach` feature gate is enabled and if there is an
324+
`out-of-service` taint on the shutdown node.
313325
The usage of this feature requires the manual step of applying a taint
314326
so the operator should be the one applying it.
315327

316328
* **What are the SLIs (Service Level Indicators) an operator can use to determine
317329
the health of the service?**
318330
- [ ] Metrics
319-
- Metric name:
331+
- Metric name: We can add new metrics deleting_pods_total, deleting_pods_error_total
332+
in Pod GC Controller.
333+
For Attach Detach Controller, there's already a metric:
334+
attachdetach_controller_forced_detaches.
320335
- [Optional] Aggregation method:
321336
- Components exposing the metric:
322337
- [ ] Other (treat as last resort)
@@ -334,6 +349,9 @@ the health of the service?**
334349
- 99,9% of /health requests per day finish with 200 code
335350
The failover should always happen if the feature gate is enabled, the taint
336351
is applied, and there are other running nodes.
352+
We can also check the deleting_pods_total, deleting_pods_error_total metrics
353+
in Pod GC Controller and the attachdetach_controller_forced_detaches metric
354+
in the Attach Detach Controller.
337355

338356
* **Are there any missing metrics that would be useful to have to improve observability
339357
of this feature?**
@@ -350,7 +368,8 @@ _This section must be completed when targeting beta graduation to a release._
350368
optional services that are needed. For example, if this feature depends on
351369
a cloud provider API, or upon an external software-defined storage or network
352370
control plane.
353-
If the workload is running on a StatefulSet, it depends on the CSI driver.
371+
This feature relies on the kube-controller-manager being running. If the
372+
workload is running on a StatefulSet, it also depends on the CSI driver.
354373

355374
For each of these, fill in the following—thinking about running existing user workloads
356375
and creating new ones, as well as about cluster-level services (e.g. DNS):
@@ -379,11 +398,14 @@ previous answers based on experience in the field._
379398
- originating component(s) (e.g. Kubelet, Feature-X-controller)
380399
focusing mostly on:
381400
- components listing and/or watching resources they didn't before
401+
Nothing new.
382402
- API calls that may be triggered by changes of some Kubernetes resources
383403
(e.g. update of object X triggers new updates of object Y)
404+
This feature will trigger pods deletion, volume detachment, new pods
405+
creation, and volume attachment calls if the new taint is applied.
384406
- periodic API calls to reconcile state (e.g. periodic fetching state,
385407
heartbeats, leader election, etc.)
386-
No.
408+
Nothing new.
387409

388410
* **Will enabling / using this feature result in introducing new API types?**
389411
Describe them, providing:
@@ -438,30 +460,40 @@ _This section must be completed when targeting beta graduation to a release._
438460
- Detection: How can it be detected via metrics? Stated another way:
439461
how can an operator troubleshoot without logging into a master or worker node?
440462
After applying the `out-of-service` taint, if the workload does not move
441-
to a different running node immediately, that is an indicator something
442-
might be wrong.
463+
to a different running node, that is an indicator something might be wrong.
464+
Note that since we removed the 6 minute wait in the Attach Detach controller,
465+
force detach will happen if the feature gate is enabled and the taint is
466+
applied without delay.
443467
- Mitigations: What can be done to stop the bleeding, especially for already
444468
running user workloads?
445469
So if the workload does not failover, it behaves the same as when this
446470
feature is not enabled. The operator should try to find out why the
447-
failover didn't happen.
471+
failover didn't happen. Check messages from GC Controller and Attach Detach
472+
Controller in the kube-controller-manager logs.
473+
Check if there are following messages from GC Controller:
474+
"failed to get node <node name>: <error>"
475+
"Garbage collecting <number> pods that are terminating on node tainted with node.kubernetes.io/out-of-service".
476+
Check if there is the following message from Attach Detach Controller:
477+
"failed to get taint specs for node <node name>: <error>"
448478
- Diagnostics: What are the useful log messages and their required logging
449479
levels that could help debug the issue?
450480
Not required until feature graduated to beta.
451-
Set log level to at least 4.
481+
Set log level to at least 2.
452482
For example, the following message is in GC Controller if the feature is
453483
enabled and `out-of-service` taint is applied. If the pods are forcefully
454484
deleted by the GC Controller, this message should show up.
455-
klog.V(4).Infof("garbage collecting pod %s that is terminating. Phase [%v]", pod.Name, pod.Status.Phase)
485+
klog.V(2).Infof("garbage collecting pod %s that is terminating. Phase [%v]", pod.Name, pod.Status.Phase)
456486
There is also a message in Attach Detach Controller that checks the taint.
457487
If the taint is applied and feature gate is enabled, it force detaches the
458488
volume without waiting for 6 minutes.
459-
klog.V(4).Infof("node %q has out-of-service taint", attachedVolume.NodeName)
489+
klog.V(2).Infof("node %q has out-of-service taint", attachedVolume.NodeName)
460490
- Testing: Are there any tests for failure mode? If not, describe why.
461491
We have unit tests that cover different combination of pod and node statuses.
462492

463493
* **What steps should be taken if SLOs are not being met to determine the problem?**
464494
In that case, we need to go through the logs and find out the root cause.
495+
Check messages from GC Controller and Attach Detach Controller in the
496+
kube-controller-manager logs as described in the above section.
465497

466498
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
467499
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

0 commit comments

Comments
 (0)