You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-storage/3751-volume-attributes-class/README.md
+64-13Lines changed: 64 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -92,8 +92,8 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
92
92
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
93
93
-[ ] (R) Production readiness review completed
94
94
-[ ] (R) Production readiness review approved
95
-
-[] "Implementation History" section is up-to-date for milestone
96
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
95
+
-[X] "Implementation History" section is up-to-date for milestone
96
+
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
97
97
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
98
98
99
99
[kubernetes.io]: https://kubernetes.io/
@@ -417,9 +417,9 @@ Please see session "Kubernetes API" above.
417
417
418
418
### 5. Add new operation metrics for ModifyVolume operations
419
419
420
-
Usage metrics:
420
+
A. Count of bound/unbound PVCs per VolumeAttributesClass similar to [StorageClass](https://github.com/kubernetes/kubernetes/blob/666fc23fe4d6c84b1dde2b8d4ebf75fce466d338/pkg/controller/volume/persistentvolume/metrics/metrics.go#L98).
421
421
422
-
Count of pvcs per VolumeAttributesClass similar to [StorageClass](https://github.com/kubernetes/kubernetes/blob/666fc23fe4d6c84b1dde2b8d4ebf75fce466d338/pkg/controller/volume/persistentvolume/metrics/metrics.go#L98).
422
+
Prior to this enhancement, we loop through all PersistentVolume objects, check if `pv.Status.Phase == v1.VolumeBound` and increment the appropriate `pv.Spec.StorageClassName` bucket. For these new metrics, when the feature flag is enabled, we also increment the appropriate `pv.Spec.VolumeAttributeClassName` if it is not empty.
Operation metrics from [csiOperationsLatencyMetric](https://github.com/kubernetes-csi/csi-lib-utils/blob/597d128ce3a24d9e3fd5ff5b8d1ff4fd862e543a/metrics/metrics.go#LL250C6-L250C32) for the ModifyVolume operation.
437
+
B. Operation metrics for ControllerModifyVolume
438
+
439
+
The metrics `controller_modify_volume_total``controller_modify_volume_errors_total` can be used to issues in volume modification.
440
+
441
+
There are operation metrics from [csiOperationsLatencyMetric](https://github.com/kubernetes-csi/csi-lib-utils/blob/597d128ce3a24d9e3fd5ff5b8d1ff4fd862e543a/metrics/metrics.go#LL250C6-L250C32) for the ModifyVolume operation to report latencies.
442
+
443
+
Finally, CSI Driver Plugin maintainers can expose their own metrics.
@@ -724,10 +733,12 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
724
733
- Give a driver that does not ControllerModifyVolume, CSI volume should not be modified.
725
734
- If ControllerModifyVolume fails, PVC should have appropriate events.
726
735
736
+
[API Conformance Test PR](https://github.com/kubernetes/kubernetes/pull/121849)
737
+
727
738
##### Stress tests
728
739
729
-
- VAC protection controller with large(define large later) lists of PVCs
730
-
- Creating a large(define large later) amount of PVCs using the same VolumeAttributesClass
740
+
- VAC protection controller with largelists of PVCs (2000)
741
+
- Creating a largeamount of PVCs (2000) using the same VolumeAttributesClass
731
742
732
743
### Graduation Criteria
733
744
@@ -742,13 +753,13 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
742
753
743
754
#### Beta
744
755
745
-
- Beta in 1.30: Since this feature is an extension of the external-resizer/external-provisioner usage flow, we are going to move this to beta with enhanced e2e and test coverage. Test cases are covered in sessions above: ``e2e tests``, ``Integration tests`` etc.
756
+
- Beta in 1.31: Since this feature is an extension of the external-resizer/external-provisioner usage flow, we are going to move this to beta with enhanced e2e and test coverage. Test cases are covered in sessions above: ``e2e tests``, ``Integration tests`` etc. Controllers will handle VolumeAttributeClass feature gates being on by default, but beta API itself being disabled on cluster by default.
746
757
- Involve 3 different CSI drivers to participate in testing
747
758
- Stress test before GA
748
759
749
760
#### GA
750
761
751
-
- GA in 1.31, all major issues in the issue board should be fixed before GA.
762
+
- GA in 1.3x, all major issues in the issue board should be fixed before GA.
752
763
- No users complaining about the new behavior
753
764
754
765
### Upgrade / Downgrade Strategy
@@ -821,7 +832,7 @@ A metric `controller_modify_volume_errors_total` will indicate a problem with th
821
832
822
833
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
823
834
824
-
Upgrade and rollback will be tested when the feature gate will change to beta.
835
+
TODO Upgrade and rollback will be tested when the feature gate will change to beta.
825
836
826
837
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
827
838
@@ -838,7 +849,7 @@ count of successful and failed ControllerModifyVolume.
838
849
839
850
By inspecting a `controller_modify_volume_total` metric value. If the counter
840
851
is increasing while letting PVCs being updated retroactively the feature is enabled. And at the same time if
841
-
`controller_modify_volume_total` counter does not increase the feature
852
+
`controller_modify_volume_errors_total` counter does not increase the feature
842
853
works as expected.
843
854
844
855
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
@@ -858,9 +869,12 @@ These goals will help you determine what you need to measure (SLIs) in the next
858
869
question.
859
870
-->
860
871
872
+
- Ratio of `controller_modify_volume_errors_total`/`controller_modify_volume_total` <= 1%. (Exclude errors with `UNAVAILABLE` code which indicate some quota has been exhausted.)
873
+
- CreateVolume `csi_sidecar_operations_seconds_sum` does not increase by more than 5% when feature flags are enabled.
874
+
861
875
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
862
876
863
-
-[] Metrics
877
+
-[X] Metrics
864
878
- Metric name: `controller_modify_volume_total` and `controller_modify_volume_errors_total`
865
879
-[Optional] Aggregation method:
866
880
- Components exposing the metric: external-resizer
@@ -871,6 +885,7 @@ question.
871
885
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
872
886
implementation difficulties, etc.).
873
887
-->
888
+
No.
874
889
875
890
### Dependencies
876
891
@@ -880,7 +895,7 @@ This section must be completed when targeting beta to a release.
880
895
881
896
###### Does this feature depend on any specific services running in the cluster?
882
897
883
-
external-provisioner, external-resizer.
898
+
external-provisioner, external-resizer.
884
899
885
900
### Scalability
886
901
@@ -923,6 +938,11 @@ Yes, the feature may impact CreateVolume. We will measure this impact during bet
923
938
924
939
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
925
940
941
+
Using this feature may result in non-negligible increase of resource usage IF customers batch modify many volumes at once and CSI Controller Pod has high API Priority.
942
+
- external-resizer CPU and memory will see a non-negligible increase if users increased the number of concurrent operations via the `--workers` flag. We follow the strategy of sharing that limit between `ControllerExpandVolume` and `ControllerModifyVolume` RPCs, similar to how external-provisioner functions.
943
+
- The API-Server may see a spike of CPU when processing relevant changes.
944
+
945
+
Stress tests will determine increase in resource usage at varying amounts of concurrent volume modifications.
926
946
927
947
### Troubleshooting
928
948
@@ -939,6 +959,15 @@ details). For now, we leave it here.
939
959
940
960
###### How does this feature react if the API server and/or etcd is unavailable?
941
961
962
+
No change from today's volume provisioning workflow if API Server / etcd are unavailable.
963
+
964
+
If API server and/or etcd is unavailable, there are two scenarios for volume modification workflow
965
+
966
+
1. External-resizer detects volume needing modification before API Server is made unavailable. Calls ControllerModifyVolume. Cloud provider will modify volume, report success to external resizer. External-resizer will be unable to update PVC object until API Server back online. Error will be logged.
967
+
968
+
2. External-resizer does NOT detect volume needing modification before API Server is made unavailable. Volume modification will not take place until API Server back online.
969
+
970
+
In both cases the PVC has not been updated to reflect new VolumeAttributeClass until API Server back online.
942
971
943
972
###### What are other known failure modes?
944
973
@@ -955,9 +984,21 @@ For each of them, fill in the following information by copying the below templat
955
984
Not required until feature graduated to beta.
956
985
- Testing: Are there any tests for failure mode? If not, describe why.
957
986
-->
987
+
-->
988
+
- ControllerModifyVolume cannot modify volume to reflect new VolumeAttributeClass due to user misconfiguration or cloudprovider backend error/limits. Volume would fall back to workable default configuration but external-resizer will requeue with longer `Infeasible` interval.
989
+
- Detection: See event on PVC object. See increase in `controller_modify_volume_errors_total`
990
+
- Mitigations: No serious mitigation needed because volume would fall back to previous configuration. Can edit PVC to previous VolumeAttributeClass to prevent retry ControllerModifyVolume calls.
991
+
- Diagnostics:
992
+
- Events on PVC which include the associated [ControllerModifyVolume error](https://github.com/container-storage-interface/spec/blob/master/spec.md#controllermodifyvolume-errors) and message
993
+
- external-resizer container logs: Logs similar to "ModifyVolume failed..." (At Log Levels 2&3)
994
+
- Testing: Are there any tests for failure mode? If not, describe why.
995
+
- There are tests to that validate appropriate events/errors propagate. Otherwise
996
+
- Note: See [Modify Design](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3751-volume-attributes-class#modify-pvc) to see flow.
997
+
958
998
959
999
###### What steps should be taken if SLOs are not being met to determine the problem?
960
1000
1001
+
When SLOs are not being met, PVC events should be observed. Debug level logging should be enabled on the appropriate containers (external-resizer for volume modifications, external-provisioner for volume creations, relevant CSI Driver plugin). If problem is not determined from PVC events, operator must look at debug logs to narrow problem to CSI Driver plugin or external sidecars. It may be helpful to see if volume was modified on storage backend. If problem is in CSI Driver plugin, must reach out to CSI Driver maintainers. Storage admin can requested for help finding root cause.
961
1002
962
1003
## Implementation History
963
1004
@@ -971,6 +1012,16 @@ Major milestones might include:
971
1012
- the version of Kubernetes where the KEP graduated to general availability
972
1013
- when the KEP was retired or superseded
973
1014
-->
1015
+
- 2023-06-15 SIG Acceptance of KEP and Agreement on proposed Volume Attributes Class design ([link](https://github.com/kubernetes/enhancements/commit/8929cf618f056e447d0b2bed562af3fc134c8cbb))
1016
+
- 2023-06-26 Original demo of VolumeAttributeClass proof-of-concept
1017
+
- 2023-10-31 VolumeAttributesClass API changes merged in kubernetes/kubernetes
1018
+
- 2023-10-26 Implementation merged in kubernetes-csi/external-provisioner
1019
+
- 2023-11-09 Implementation merged in kubernetes-csi/external-resizer
1020
+
- First available release: Alpha in Kubernetes 1.29
0 commit comments