Update non-graceful node shutdown to beta

xing-yang · xing-yang · commit d63205d44166 · 2022-10-03T11:05:28.000-04:00
diff --git a/keps/prod-readiness/sig-storage/2268.yaml b/keps/prod-readiness/sig-storage/2268.yaml
@@ -1,3 +1,5 @@
 kep-number: 2268 
 alpha:
+  approver: "@deads2k"
+beta:
   approver: "@deads2k" 
diff --git a/keps/sig-storage/2268-non-graceful-shutdown/README.md b/keps/sig-storage/2268-non-graceful-shutdown/README.md
@@ -16,8 +16,10 @@ This includes the Summary and Motivation sections.
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
   - [Test Plan](#test-plan)
-  - [Unit tests](#unit-tests)
-  - [E2E tests](#e2e-tests)
+    - [Prerequisite testing updates](#prerequisite-testing-updates)
+    - [Unit tests](#unit-tests)
+    - [Integration tests](#integration-tests)
+    - [E2E tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
     - [Alpha -&gt; Beta Graduation](#alpha---beta-graduation)
@@ -144,15 +146,28 @@ To mitigate this we plan to have a high test coverage and to introduce this enha
 
 ### Test Plan
 
-### Unit tests
+#### Prerequisite testing updates
+
+There are existing tests for Pod GC controller: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/pod_gc.go
+
+There are existing tests for attach detach controller. Creating a pod that uses PVCs which will test attach: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/create.go
+Deleting a pod will trigger PVC to be detached: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/pod/delete.go
+
+#### Unit tests
 * Add unit tests to affected components in kube-controller-manager:
   * Add tests in Pod GC Controller for the new logic to clean up pods and the `out-of-service` taint.
   * Add tests in Attachdetach Controller for the changed logic that allow volumes to be forcefully detached without wait.
 
-### E2E tests
+#### Integration tests
+
+Add a test for forcefully terminating pods in https://github.com/kubernetes/kubernetes/blob/master/test/integration/garbagecollector/garbage_collector_test.go.
+
+Add a test for force detach without waiting for 6 minutes in https://github.com/kubernetes/kubernetes/blob/master/test/integration/volume/attach_detach_test.go
+
+#### E2E tests
 *   New E2E tests to validate workloads move successfully to another running node when a node is shutdown.
-    * Feature gate for `NonGracefulFailover` is disabled, feature is not active.
-    * Feature gate for `NonGracefulFailover` is enabled. Add `out-of-service` taint after node is shutdown:
+    * Feature gate for `NodeOutOfServiceVolumeDetach` is disabled, feature is not active.
+    * Feature gate for `NodeOutOfServiceVolumeDetach` is enabled. Add `out-of-service` taint after node is shutdown:
       * Verify workloads are moved to another node successfully.
       * Verify the `out-of-service` taint is removed after the shutdown node is cleaned up.
 * Add stress and scale tests before moving from beta to GA.
@@ -217,7 +232,7 @@ _This section must be completed when targeting alpha to a release._
 
 * **How can this feature be enabled / disabled in a live cluster?**
   - [ ] Feature gate (also fill in values in `kep.yaml`)
-    - Feature gate name: NonGracefulFailover
+    - Feature gate name: NodeOutOfServiceVolumeDetach
     - Components depending on the feature gate: kube-controller-manager
   - [ ] Other
     - Describe the mechanism:
@@ -266,17 +281,24 @@ _This section must be completed when targeting beta graduation to a release._
 * **How can a rollout fail? Can it impact already running workloads?**
   Try to be as paranoid as possible - e.g., what if some components will restart
    mid-rollout?
+  The rollout should not fail. Feature gate only needs to be enabled on `kube-controller-manager`. So it is either enabled or disabled.
 
 * **What specific metrics should inform a rollback?**
+  If for some reason, the user does not want the workload to failover to a
+  different running node after the original node is shutdown and the `out-of-service`  taint is applied, a rollback can be done. I don't see why it is needed though as
+  user can prevent the failover from happening by not applying the `out-of-service`
+  taint.
 
 * **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
   Describe manual testing that was done and the outcomes.
   Longer term, we may want to require automated upgrade/rollback tests, but we
   are missing a bunch of machinery and tooling and can't do that now.
+  We will manually test upgrade from 1.24 to 1.25 and rollback from 1.25 to 1.24.
 
 * **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
 fields of API types, flags, etc.?**
   Even if applying deprecation policies, they may still surprise some users.
+  No.
 
 ### Monitoring Requirements
 
@@ -286,6 +308,10 @@ _This section must be completed when targeting beta graduation to a release._
   Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
   checking if there are objects with field X set) may be a last resort. Avoid
   logs or events for this purpose.
+  An operator can check if the `NodeOutOfServiceVolumeDetach` feature gate is
+  enabled and if there is an `out-of-service` taint on the shutdown node.
+  The usage of this feature requires the manual step of applying a taint
+  so the operator should be the one applying it.
 
 * **What are the SLIs (Service Level Indicators) an operator can use to determine
 the health of the service?**
@@ -294,7 +320,9 @@ the health of the service?**
     - [Optional] Aggregation method:
     - Components exposing the metric:
   - [ ] Other (treat as last resort)
-    - Details:
+    - Details: Check whether the workload moved to a different running node
+      after the original node is shutdown and the `out-of-service` taint
+      is applied on the shutdown node.
 
 * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
   At a high level, this usually will be in the form of "high percentile of SLI
@@ -304,6 +332,8 @@ the health of the service?**
   - 99% percentile over day of absolute value from (job creation time minus expected
     job creation time) for cron job <= 10%
   - 99,9% of /health requests per day finish with 200 code
+  The failover should always happen if the feature gate is enabled, the taint
+  is applied, and there are other running nodes.
 
 * **Are there any missing metrics that would be useful to have to improve observability
 of this feature?**
@@ -320,13 +350,17 @@ _This section must be completed when targeting beta graduation to a release._
   optional services that are needed. For example, if this feature depends on
   a cloud provider API, or upon an external software-defined storage or network
   control plane.
+  If the workload is running on a StatefulSet, it depends on the CSI driver.
 
   For each of these, fill in the following—thinking about running existing user workloads
   and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
+  - [Dependency name] CSI driver
     - Usage description:
-      - Impact of its outage on the feature:
+      - Impact of its outage on the feature: If the CSI driver is not running,
+        the pod cannot use the persistent volume any more so the workload will
+        not be running properly.
       - Impact of its degraded performance or high-error rates on the feature:
+        Workload does not work properly if the CSI driver is down.
 
 ### Scalability
 
@@ -349,27 +383,32 @@ previous answers based on experience in the field._
     (e.g. update of object X triggers new updates of object Y)
   - periodic API calls to reconcile state (e.g. periodic fetching state,
     heartbeats, leader election, etc.)
+  No.
 
 * **Will enabling / using this feature result in introducing new API types?**
   Describe them, providing:
   - API type
   - Supported number of objects per cluster
   - Supported number of objects per namespace (for namespace-scoped objects)
+  No.
 
 * **Will enabling / using this feature result in any new calls to the cloud
 provider?**
+  No.
 
 * **Will enabling / using this feature result in increasing size or count of
 the existing API objects?**
   Describe them, providing:
   - API type(s):
   - Estimated increase in size: (e.g., new annotation of size 32B)
   - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+  No.
 
 * **Will enabling / using this feature result in increasing time taken by any
 operations covered by [existing SLIs/SLOs]?**
   Think about adding additional work or introducing new steps in between
   (e.g. need to do X to start a container), etc. Please describe the details.
+  No.
 
 * **Will enabling / using this feature result in non-negligible increase of
 resource usage (CPU, RAM, disk, IO, ...) in any components?**
@@ -378,6 +417,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
   volume), significant amount of data sent and/or received over network, etc.
   This through this both in small and large cases, again with respect to the
   [supported limits].
+  No.
 
 ### Troubleshooting
 
@@ -388,20 +428,40 @@ details). For now, we leave it here.
 _This section must be completed when targeting beta graduation to a release._
 
 * **How does this feature react if the API server and/or etcd is unavailable?**
+  If API server or etcd is not available, we can't get accurate status of node or pod.
+  However the usage of this feature is very manual so an operator can verify
+  before applying the taint.
 
 * **What are other known failure modes?**
   For each of them, fill in the following information by copying the below template:
   - [Failure mode brief description]
     - Detection: How can it be detected via metrics? Stated another way:
       how can an operator troubleshoot without logging into a master or worker node?
+      After applying the `out-of-service` taint, if the workload does not move
+      to a different running node immediately, that is an indicator something
+      might be wrong.
     - Mitigations: What can be done to stop the bleeding, especially for already
       running user workloads?
+      So if the workload does not failover, it behaves the same as when this
+      feature is not enabled. The operator should try to find out why the
+      failover didn't happen.
     - Diagnostics: What are the useful log messages and their required logging
       levels that could help debug the issue?
       Not required until feature graduated to beta.
+      Set log level to at least 4.
+      For example, the following message is in GC Controller if the feature is
+      enabled and `out-of-service` taint is applied. If the pods are forcefully
+      deleted by the GC Controller, this message should show up.
+      klog.V(4).Infof("garbage collecting pod %s that is terminating. Phase [%v]", pod.Name, pod.Status.Phase)
+      There is also a message in Attach Detach Controller that checks the taint.
+      If the taint is applied and feature gate is enabled, it force detaches the
+      volume without waiting for 6 minutes.
+      klog.V(4).Infof("node %q has out-of-service taint", attachedVolume.NodeName)
     - Testing: Are there any tests for failure mode? If not, describe why.
+      We have unit tests that cover different combination of pod and node statuses.
 
 * **What steps should be taken if SLOs are not being met to determine the problem?**
+  In that case, we need to go through the logs and find out the root cause.
 
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
diff --git a/keps/sig-storage/2268-non-graceful-shutdown/kep.yaml b/keps/sig-storage/2268-non-graceful-shutdown/kep.yaml
@@ -20,12 +20,12 @@ see-also:
 replaces:
 
 # The target maturity stage in the current dev cycle for this KEP.
-stage: alpha
+stage: beta 
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.24"
+latest-milestone: "v1.25"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone: