Update volume health KEP for beta

xing-yang · xing-yang · commit e99e6f04e59f · 2021-01-29T03:53:48.000Z
diff --git a/keps/prod-readiness/sig-storage/1432.yaml b/keps/prod-readiness/sig-storage/1432.yaml
@@ -0,0 +1,3 @@
+kep-number: 1432
+beta:
+  approver: "@deads2k"
diff --git a/keps/sig-storage/1432-volume-health-monitor/README.md b/keps/sig-storage/1432-volume-health-monitor/README.md
@@ -31,6 +31,11 @@
   - [E2E tests](#e2e-tests)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
   - [Feature enablement and rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
+  - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
 <!-- /toc -->
 
@@ -674,8 +679,200 @@ _This section must be completed when targeting alpha to a release._
   disable this feature is to install or unistall the sidecars, we cannot write
   tests for feature enablement/disablement.
 
+### Rollout, Upgrade and Rollback Planning
+
+_This section must be completed when targeting beta graduation to a release._
+
+* **How can a rollout fail? Can it impact already running workloads?**
+  Try to be as paranoid as possible - e.g., what if some components will restart
+   mid-rollout?
+   This feature does not have a feature gate. It is enabled when the health
+   monitoring controller and agent sidecars are deployed with the CSI driver.
+   So the only way for a rollout to fail is that deploying the health
+   monitoring controller or agent sidecars with the CSI driver fails. If
+   the health monitoring controller cannot be deployed, no events on volume
+   condition will be reported on PVCs. If the health monitoring agent cannot
+   be deployed, no event on volume condition will be reported on the pod.
+
+* **What specific metrics should inform a rollback?**
+  Currently an event will be recorded on pvc/pod when the controller/agent has successfully retrieved an abnormal volume condition from the storage system. However when other errors occur in the controller/agent, the errors will be logged but not recorded as events. Before moving to beta, the controller/agent should be modified to record an event when other errors occur.
+
+* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
+  Describe manual testing that was done and the outcomes.
+  Longer term, we may want to require automated upgrade/rollback tests, but we
+  are missing a bunch of machinery and tooling and can't do that now.
+  Manual testing will be done.
+
+* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
+fields of API types, flags, etc.?**
+  Even if applying deprecation policies, they may still surprise some users.
+  No.
+
+### Monitoring Requirements
+
+_This section must be completed when targeting beta graduation to a release._
+
+* **How can an operator determine if the feature is in use by workloads?**
+  Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
+  checking if there are objects with field X set) may be a last resort. Avoid
+  logs or events for this purpose.
+  An operator can check the metric `csi_sidecar_operations_seconds`,
+  Container Storage Interface operation duration with gRPC error code status
+  total. It is reported from CSI external-health-monitor-controller and
+  external-health-monitor-agent sidecars. For the health monitor controller
+  sidecar, `csi_sidecar_operations_seconds` will be measuring `ListVolumes` or
+  `GetVolume` RPC. For the health monitor agent sidecar,
+  `csi_sidecar_operations_seconds` will be measuring `NodeGetVolumeStats` RPC.
+  The `csi_sidecar_operations_seconds` metric should be sliced by process after
+  they are aggregated to show metrics for different sidecars.
+
+* **What are the SLIs (Service Level Indicators) an operator can use to determine
+the health of the service?**
+  - [ ] Metrics
+    - Metric name: csi_sidecar_operations_seconds
+    - [Optional] Aggregation method:
+    - Components exposing the metric: csi-external-health-monitor-controller and csi-external-health-monitor-agent sidecars
+  - [ ] Other (treat as last resort)
+    - Details:
+
+* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
+  At a high level, this usually will be in the form of "high percentile of SLI
+  per day <= X". It's impossible to provide comprehensive guidance, but at the very
+  high level (needs more precise definitions) those may be things like:
+  - per-day percentage of API calls finishing with 5XX errors <= 1%
+  - 99% percentile over day of absolute value from (job creation time minus expected
+    job creation time) for cron job <= 10%
+  - 99,9% of /health requests per day finish with 200 code
+
+  The metrics `csi_sidecar_operations_seconds` includes a gRPC status code. If the
+  status code is `OK`, the call is successful; otherwise, it is not successful. We
+  can look at the ratio of successful vs non-successful statue codes to figure out
+  the success/failure ratio.
+
+* **Are there any missing metrics that would be useful to have to improve observability
+of this feature?**
+  Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
+  implementation difficulties, etc.).
+
+### Dependencies
+
+_This section must be completed when targeting beta graduation to a release._
+
+* **Does this feature depend on any specific services running in the cluster?**
+  Think about both cluster-level services (e.g. metrics-server) as well
+  as node-level agents (e.g. specific version of CRI). Focus on external or
+  optional services that are needed. For example, if this feature depends on
+  a cloud provider API, or upon an external software-defined storage or network
+  control plane.
+
+  For each of these, fill in the following—thinking about running existing user workloads
+  and creating new ones, as well as about cluster-level services (e.g. DNS):
+  - [Dependency name]: installation of csi-external-health-monitor-controller and csi-external-health-monitor-agent sidecars
+    - Usage description:
+      - Impact of its outage on the feature: Installation of csi-external-health-monitor-controller and csi-external-health-monitor-agent sidecars are required for the feature to work. If csi-external-health-monitor-controller is not installed, abnormal volume conditions will not be reported as events on PVCs. Similarly, if csi-external-health-monitor-agent is not installed, abnormal volume conditions will not be reported as events on Pods.
+        Note that CSI driver needs to be updated to implement volume health RPCs in controller/node plugins. The minimum kubernetes version should be 1.13: https://kubernetes-csi.github.io/docs/introduction.html#kubernetes-releases. K8s v1.13 is the minimum supported version for CSI driver to work, however, different CSI drivers have different requirements on supported k8s versions so users are supposed to check documentation of the CSI drivers. If the CSI node plugin on one node has been upgraded to support volume health while it is not upgraded on 3 other nodes, then we will only expect to see volume health events on pods running on that one upgraded node.
+      - Impact of its degraded performance or high-error rates on the feature: If abnormal volume conditions are reported with degraded performance or high-error rates, that would affect how soon or how accurately users could manually react to these conditions.
+
+
+### Scalability
+
+_For alpha, this section is encouraged: reviewers should consider these questions
+and attempt to answer them._
+
+_For beta, this section is required: reviewers must answer these questions._
+
+_For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field._
+
+* **Will enabling / using this feature result in any new API calls?**
+  Describe them, providing:
+  - API call type (e.g. PATCH pods): Only events will be reported to PVCs or Pods if this feature is enabled.
+  - estimated throughput
+  - originating component(s) (e.g. Kubelet, Feature-X-controller)
+  focusing mostly on:
+  - components listing and/or watching resources they didn't before
+    csi-external-health-monitor-controller and csi-external-health-monitor-agent sidecars.
+    There is a monitor interval for the controller and one for the agent to control how often
+    to check the volume health. It is configurable with 1 minute as default. Will consider
+    changing it to 5 minutes by default to avoid overloading the K8s API server.
+    When scaled out across many nodes, low frequency checks can still produce high volumes of
+    events. To control this, we should use options on the eventrecorder to control QPS per key.
+    This way we can collapse keys and have a slow update cadence per key.
+  - API calls that may be triggered by changes of some Kubernetes resources
+    (e.g. update of object X triggers new updates of object Y)
+  - periodic API calls to reconcile state (e.g. periodic fetching state,
+    heartbeats, leader election, etc.)
+
+* **Will enabling / using this feature result in introducing new API types?**
+  Describe them, providing:
+  - API type: No
+  - Supported number of objects per cluster: No
+  - Supported number of objects per namespace (for namespace-scoped objects): No
+
+* **Will enabling / using this feature result in any new calls to the cloud
+provider?**
+  No.
+
+* **Will enabling / using this feature result in increasing size or count of
+the existing API objects?**
+  Describe them, providing:
+  - API type(s): No
+  - Estimated increase in size: (e.g., new annotation of size 32B):
+    No
+  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+    The controller reports events on PVC while the agent reports events on Pod. They work independently of each other. It is recommended that CSI driver should not report duplicate information through the controller and the agent. For example, if the controller detects a failure on one volume, it should record just one event on one PVC. If an agent detects a failure, it should record an event on every pod used by the affected PVC.
+
+    Recovery event will be reported once if the volume condition changes from abnormal back to normal.
+
+* **Will enabling / using this feature result in increasing time taken by any
+operations covered by [existing SLIs/SLOs]?**
+  Think about adding additional work or introducing new steps in between
+  (e.g. need to do X to start a container), etc. Please describe the details.
+  This feature will periodically query storage systems to get the latest volume conditions. So this will have an impact on the performance of the operations running on the storage systems.
+
+* **Will enabling / using this feature result in non-negligible increase of
+resource usage (CPU, RAM, disk, IO, ...) in any components?**
+  Things to keep in mind include: additional in-memory state, additional
+  non-trivial computations, excessive access to disks (including increased log
+  volume), significant amount of data sent and/or received over network, etc.
+  This through this both in small and large cases, again with respect to the
+  [supported limits].
+  This will increase load on the storage systems as it periodically queries them.
+
+### Troubleshooting
+
+The Troubleshooting section currently serves the `Playbook` role. We may consider
+splitting it into a dedicated `Playbook` document (potentially with some monitoring
+details). For now, we leave it here.
+
+_This section must be completed when targeting beta graduation to a release._
+
+* **How does this feature react if the API server and/or etcd is unavailable?**
+  If API server and/or etcd is unavailable, error messages will be logged and the controller/agent will not be able to report events on PVCs or Pods.
+
+* **What are other known failure modes?**
+  For each of them, fill in the following information by copying the below template:
+  - [Failure mode brief description]
+    - Detection: How can it be detected via metrics? Stated another way:
+      how can an operator troubleshoot without logging into a master or worker node?
+    - Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+    - Diagnostics: What are the useful log messages and their required logging
+      levels that could help debug the issue?
+      Not required until feature graduated to beta.
+      If there are log messages indicating abnormal volume conditions but there are no events reported, we can check the timestamp of the messages to see if events have disappeared based on TTL or if they are never reported. If there are problems on the storage systems but they are not reported in logs or events, we can check the logs of the storage systems to figure out why this has happened.
+    - Testing: Are there any tests for failure mode? If not, describe why.
+
+* **What steps should be taken if SLOs are not being met to determine the problem?**
+  If SLOs are not being met, analysis should be made to understand what have caused the problem. Debug level logging should be enabled to collect verbose logs. Look at logs to find out what might have caused the events to be missed. If it indicates an underlying problem on the storage system, then storage admin can be pulled in to help find the root cause.
+
+[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
+[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
+
 ## Implementation History
 
+- 20210117: Update KEP for Beta
+
 - 20191021: KEP updated
 
 - 20190730: KEP updated
diff --git a/keps/sig-storage/1432-volume-health-monitor/kep.yaml b/keps/sig-storage/1432-volume-health-monitor/kep.yaml
@@ -20,11 +20,11 @@ approvers:
 see-also:
 replaces:
 
-latest-milestone: "v1.19"
+latest-milestone: "v1.21"
 milestone:
   alpha: "v1.19"
-  beta: "v1.20"
-  stable: "v1.21"
+  beta: "v1.21"
+  stable: "v1.23"
 
 feature-gates:
 disable-supported: true

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+kep-number: 1432`
	`2`	`+beta:`
	`3`	`+ approver: "@deads2k"`