Skip to content

Commit e99e6f0

Browse files
committed
Update volume health KEP for beta
1 parent f70cf5b commit e99e6f0

File tree

3 files changed

+203
-3
lines changed

3 files changed

+203
-3
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 1432
2+
beta:
3+
approver: "@deads2k"

keps/sig-storage/1432-volume-health-monitor/README.md

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,11 @@
3131
- [E2E tests](#e2e-tests)
3232
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
3333
- [Feature enablement and rollback](#feature-enablement-and-rollback)
34+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
35+
- [Monitoring Requirements](#monitoring-requirements)
36+
- [Dependencies](#dependencies)
37+
- [Scalability](#scalability)
38+
- [Troubleshooting](#troubleshooting)
3439
- [Implementation History](#implementation-history)
3540
<!-- /toc -->
3641

@@ -674,8 +679,200 @@ _This section must be completed when targeting alpha to a release._
674679
disable this feature is to install or unistall the sidecars, we cannot write
675680
tests for feature enablement/disablement.
676681

682+
### Rollout, Upgrade and Rollback Planning
683+
684+
_This section must be completed when targeting beta graduation to a release._
685+
686+
* **How can a rollout fail? Can it impact already running workloads?**
687+
Try to be as paranoid as possible - e.g., what if some components will restart
688+
mid-rollout?
689+
This feature does not have a feature gate. It is enabled when the health
690+
monitoring controller and agent sidecars are deployed with the CSI driver.
691+
So the only way for a rollout to fail is that deploying the health
692+
monitoring controller or agent sidecars with the CSI driver fails. If
693+
the health monitoring controller cannot be deployed, no events on volume
694+
condition will be reported on PVCs. If the health monitoring agent cannot
695+
be deployed, no event on volume condition will be reported on the pod.
696+
697+
* **What specific metrics should inform a rollback?**
698+
Currently an event will be recorded on pvc/pod when the controller/agent has successfully retrieved an abnormal volume condition from the storage system. However when other errors occur in the controller/agent, the errors will be logged but not recorded as events. Before moving to beta, the controller/agent should be modified to record an event when other errors occur.
699+
700+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
701+
Describe manual testing that was done and the outcomes.
702+
Longer term, we may want to require automated upgrade/rollback tests, but we
703+
are missing a bunch of machinery and tooling and can't do that now.
704+
Manual testing will be done.
705+
706+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
707+
fields of API types, flags, etc.?**
708+
Even if applying deprecation policies, they may still surprise some users.
709+
No.
710+
711+
### Monitoring Requirements
712+
713+
_This section must be completed when targeting beta graduation to a release._
714+
715+
* **How can an operator determine if the feature is in use by workloads?**
716+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
717+
checking if there are objects with field X set) may be a last resort. Avoid
718+
logs or events for this purpose.
719+
An operator can check the metric `csi_sidecar_operations_seconds`,
720+
Container Storage Interface operation duration with gRPC error code status
721+
total. It is reported from CSI external-health-monitor-controller and
722+
external-health-monitor-agent sidecars. For the health monitor controller
723+
sidecar, `csi_sidecar_operations_seconds` will be measuring `ListVolumes` or
724+
`GetVolume` RPC. For the health monitor agent sidecar,
725+
`csi_sidecar_operations_seconds` will be measuring `NodeGetVolumeStats` RPC.
726+
The `csi_sidecar_operations_seconds` metric should be sliced by process after
727+
they are aggregated to show metrics for different sidecars.
728+
729+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
730+
the health of the service?**
731+
- [ ] Metrics
732+
- Metric name: csi_sidecar_operations_seconds
733+
- [Optional] Aggregation method:
734+
- Components exposing the metric: csi-external-health-monitor-controller and csi-external-health-monitor-agent sidecars
735+
- [ ] Other (treat as last resort)
736+
- Details:
737+
738+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
739+
At a high level, this usually will be in the form of "high percentile of SLI
740+
per day <= X". It's impossible to provide comprehensive guidance, but at the very
741+
high level (needs more precise definitions) those may be things like:
742+
- per-day percentage of API calls finishing with 5XX errors <= 1%
743+
- 99% percentile over day of absolute value from (job creation time minus expected
744+
job creation time) for cron job <= 10%
745+
- 99,9% of /health requests per day finish with 200 code
746+
747+
The metrics `csi_sidecar_operations_seconds` includes a gRPC status code. If the
748+
status code is `OK`, the call is successful; otherwise, it is not successful. We
749+
can look at the ratio of successful vs non-successful statue codes to figure out
750+
the success/failure ratio.
751+
752+
* **Are there any missing metrics that would be useful to have to improve observability
753+
of this feature?**
754+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
755+
implementation difficulties, etc.).
756+
757+
### Dependencies
758+
759+
_This section must be completed when targeting beta graduation to a release._
760+
761+
* **Does this feature depend on any specific services running in the cluster?**
762+
Think about both cluster-level services (e.g. metrics-server) as well
763+
as node-level agents (e.g. specific version of CRI). Focus on external or
764+
optional services that are needed. For example, if this feature depends on
765+
a cloud provider API, or upon an external software-defined storage or network
766+
control plane.
767+
768+
For each of these, fill in the following—thinking about running existing user workloads
769+
and creating new ones, as well as about cluster-level services (e.g. DNS):
770+
- [Dependency name]: installation of csi-external-health-monitor-controller and csi-external-health-monitor-agent sidecars
771+
- Usage description:
772+
- Impact of its outage on the feature: Installation of csi-external-health-monitor-controller and csi-external-health-monitor-agent sidecars are required for the feature to work. If csi-external-health-monitor-controller is not installed, abnormal volume conditions will not be reported as events on PVCs. Similarly, if csi-external-health-monitor-agent is not installed, abnormal volume conditions will not be reported as events on Pods.
773+
Note that CSI driver needs to be updated to implement volume health RPCs in controller/node plugins. The minimum kubernetes version should be 1.13: https://kubernetes-csi.github.io/docs/introduction.html#kubernetes-releases. K8s v1.13 is the minimum supported version for CSI driver to work, however, different CSI drivers have different requirements on supported k8s versions so users are supposed to check documentation of the CSI drivers. If the CSI node plugin on one node has been upgraded to support volume health while it is not upgraded on 3 other nodes, then we will only expect to see volume health events on pods running on that one upgraded node.
774+
- Impact of its degraded performance or high-error rates on the feature: If abnormal volume conditions are reported with degraded performance or high-error rates, that would affect how soon or how accurately users could manually react to these conditions.
775+
776+
777+
### Scalability
778+
779+
_For alpha, this section is encouraged: reviewers should consider these questions
780+
and attempt to answer them._
781+
782+
_For beta, this section is required: reviewers must answer these questions._
783+
784+
_For GA, this section is required: approvers should be able to confirm the
785+
previous answers based on experience in the field._
786+
787+
* **Will enabling / using this feature result in any new API calls?**
788+
Describe them, providing:
789+
- API call type (e.g. PATCH pods): Only events will be reported to PVCs or Pods if this feature is enabled.
790+
- estimated throughput
791+
- originating component(s) (e.g. Kubelet, Feature-X-controller)
792+
focusing mostly on:
793+
- components listing and/or watching resources they didn't before
794+
csi-external-health-monitor-controller and csi-external-health-monitor-agent sidecars.
795+
There is a monitor interval for the controller and one for the agent to control how often
796+
to check the volume health. It is configurable with 1 minute as default. Will consider
797+
changing it to 5 minutes by default to avoid overloading the K8s API server.
798+
When scaled out across many nodes, low frequency checks can still produce high volumes of
799+
events. To control this, we should use options on the eventrecorder to control QPS per key.
800+
This way we can collapse keys and have a slow update cadence per key.
801+
- API calls that may be triggered by changes of some Kubernetes resources
802+
(e.g. update of object X triggers new updates of object Y)
803+
- periodic API calls to reconcile state (e.g. periodic fetching state,
804+
heartbeats, leader election, etc.)
805+
806+
* **Will enabling / using this feature result in introducing new API types?**
807+
Describe them, providing:
808+
- API type: No
809+
- Supported number of objects per cluster: No
810+
- Supported number of objects per namespace (for namespace-scoped objects): No
811+
812+
* **Will enabling / using this feature result in any new calls to the cloud
813+
provider?**
814+
No.
815+
816+
* **Will enabling / using this feature result in increasing size or count of
817+
the existing API objects?**
818+
Describe them, providing:
819+
- API type(s): No
820+
- Estimated increase in size: (e.g., new annotation of size 32B):
821+
No
822+
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
823+
The controller reports events on PVC while the agent reports events on Pod. They work independently of each other. It is recommended that CSI driver should not report duplicate information through the controller and the agent. For example, if the controller detects a failure on one volume, it should record just one event on one PVC. If an agent detects a failure, it should record an event on every pod used by the affected PVC.
824+
825+
Recovery event will be reported once if the volume condition changes from abnormal back to normal.
826+
827+
* **Will enabling / using this feature result in increasing time taken by any
828+
operations covered by [existing SLIs/SLOs]?**
829+
Think about adding additional work or introducing new steps in between
830+
(e.g. need to do X to start a container), etc. Please describe the details.
831+
This feature will periodically query storage systems to get the latest volume conditions. So this will have an impact on the performance of the operations running on the storage systems.
832+
833+
* **Will enabling / using this feature result in non-negligible increase of
834+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
835+
Things to keep in mind include: additional in-memory state, additional
836+
non-trivial computations, excessive access to disks (including increased log
837+
volume), significant amount of data sent and/or received over network, etc.
838+
This through this both in small and large cases, again with respect to the
839+
[supported limits].
840+
This will increase load on the storage systems as it periodically queries them.
841+
842+
### Troubleshooting
843+
844+
The Troubleshooting section currently serves the `Playbook` role. We may consider
845+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
846+
details). For now, we leave it here.
847+
848+
_This section must be completed when targeting beta graduation to a release._
849+
850+
* **How does this feature react if the API server and/or etcd is unavailable?**
851+
If API server and/or etcd is unavailable, error messages will be logged and the controller/agent will not be able to report events on PVCs or Pods.
852+
853+
* **What are other known failure modes?**
854+
For each of them, fill in the following information by copying the below template:
855+
- [Failure mode brief description]
856+
- Detection: How can it be detected via metrics? Stated another way:
857+
how can an operator troubleshoot without logging into a master or worker node?
858+
- Mitigations: What can be done to stop the bleeding, especially for already
859+
running user workloads?
860+
- Diagnostics: What are the useful log messages and their required logging
861+
levels that could help debug the issue?
862+
Not required until feature graduated to beta.
863+
If there are log messages indicating abnormal volume conditions but there are no events reported, we can check the timestamp of the messages to see if events have disappeared based on TTL or if they are never reported. If there are problems on the storage systems but they are not reported in logs or events, we can check the logs of the storage systems to figure out why this has happened.
864+
- Testing: Are there any tests for failure mode? If not, describe why.
865+
866+
* **What steps should be taken if SLOs are not being met to determine the problem?**
867+
If SLOs are not being met, analysis should be made to understand what have caused the problem. Debug level logging should be enabled to collect verbose logs. Look at logs to find out what might have caused the events to be missed. If it indicates an underlying problem on the storage system, then storage admin can be pulled in to help find the root cause.
868+
869+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
870+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
871+
677872
## Implementation History
678873

874+
- 20210117: Update KEP for Beta
875+
679876
- 20191021: KEP updated
680877

681878
- 20190730: KEP updated

keps/sig-storage/1432-volume-health-monitor/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ approvers:
2020
see-also:
2121
replaces:
2222

23-
latest-milestone: "v1.19"
23+
latest-milestone: "v1.21"
2424
milestone:
2525
alpha: "v1.19"
26-
beta: "v1.20"
27-
stable: "v1.21"
26+
beta: "v1.21"
27+
stable: "v1.23"
2828

2929
feature-gates:
3030
disable-supported: true

0 commit comments

Comments
 (0)