You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-storage/3751-volume-attributes-class/README.md
+61-11Lines changed: 61 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,7 +89,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
89
89
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
90
90
-[ ] (R) Production readiness review completed
91
91
-[ ] (R) Production readiness review approved
92
-
-[] "Implementation History" section is up-to-date for milestone
92
+
-[X] "Implementation History" section is up-to-date for milestone
93
93
-[ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
94
94
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
95
95
@@ -414,9 +414,9 @@ Please see session "Kubernetes API" above.
414
414
415
415
### 5. Add new operation metrics for ModifyVolume operations
416
416
417
-
Usage metrics:
417
+
A. Count of bound/unbound PVCs per VolumeAttributesClass similar to [StorageClass](https://github.com/kubernetes/kubernetes/blob/666fc23fe4d6c84b1dde2b8d4ebf75fce466d338/pkg/controller/volume/persistentvolume/metrics/metrics.go#L98).
418
418
419
-
Count of pvcs per VolumeAttributesClass similar to [StorageClass](https://github.com/kubernetes/kubernetes/blob/666fc23fe4d6c84b1dde2b8d4ebf75fce466d338/pkg/controller/volume/persistentvolume/metrics/metrics.go#L98).
419
+
Today, we loop through all PersistentVolume objects, check if `pv.Status.Phase == v1.VolumeBound` and increment the appropriate `pv.Spec.StorageClassName` bucket. For these new metrics, when the feature flag is enabled, we also increment the appropriate `pv.Spec.VolumeAttributeClassName` if it is not empty.
Operation metrics from [csiOperationsLatencyMetric](https://github.com/kubernetes-csi/csi-lib-utils/blob/597d128ce3a24d9e3fd5ff5b8d1ff4fd862e543a/metrics/metrics.go#LL250C6-L250C32) for the ModifyVolume operation.
434
+
B. Operation metrics for ControllerModifyVolume
435
+
436
+
The metrics `controller_modify_volume_total``controller_modify_volume_errors_total` can be used to issues in volume modification.
437
+
438
+
There are operation metrics from [csiOperationsLatencyMetric](https://github.com/kubernetes-csi/csi-lib-utils/blob/597d128ce3a24d9e3fd5ff5b8d1ff4fd862e543a/metrics/metrics.go#LL250C6-L250C32) for the ModifyVolume operation to report latencies.
439
+
440
+
Finally, CSI Driver Plugin maintainers can expose their own metrics.
- The behavior with feature gate and API turned on/off and mix match
631
637
- The happy path with creating and modifying volume successfully with VolumeAttributesClass
638
+
-[E2E CSI Test PR](https://github.com/kubernetes/kubernetes/pull/124151/)
639
+
- k8s-triage will be linked once test PR merged
632
640
633
641
##### e2e tests
634
642
@@ -650,10 +658,12 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
650
658
- Give a driver that does not ControllerModifyVolume, CSI volume should not be modified.
651
659
- If ControllerModifyVolume fails, PVC should have appropriate events.
652
660
661
+
[API Conformance Test PR](https://github.com/kubernetes/kubernetes/pull/121849)
662
+
653
663
##### Stress tests
654
664
655
-
- VAC protection controller with large(define large later) lists of PVCs
656
-
- Creating a large(define large later) amount of PVCs using the same VolumeAttributesClass
665
+
- VAC protection controller with largelists of PVCs (2000)
666
+
- Creating a largeamount of PVCs (2000) using the same VolumeAttributesClass
657
667
658
668
### Graduation Criteria
659
669
@@ -668,13 +678,13 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
668
678
669
679
#### Beta
670
680
671
-
- Beta in 1.30: Since this feature is an extension of the external-resizer/external-provisioner usage flow, we are going to move this to beta with enhanced e2e and test coverage. Test cases are covered in sessions above: ``e2e tests``, ``Integration tests`` etc.
681
+
- Beta in 1.31: Since this feature is an extension of the external-resizer/external-provisioner usage flow, we are going to move this to beta with enhanced e2e and test coverage. Test cases are covered in sessions above: ``e2e tests``, ``Integration tests`` etc.
672
682
- Involve 3 different CSI drivers to participate in testing
673
683
- Stress test before GA
674
684
675
685
#### GA
676
686
677
-
- GA in 1.31, all major issues in the issue board should be fixed before GA.
687
+
- GA in 1.3x, all major issues in the issue board should be fixed before GA.
678
688
- No users complaining about the new behavior
679
689
680
690
### Upgrade / Downgrade Strategy
@@ -747,7 +757,7 @@ A metric `controller_modify_volume_errors_total` will indicate a problem with th
747
757
748
758
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
749
759
750
-
Upgrade and rollback will be tested when the feature gate will change to beta.
760
+
TODO Upgrade and rollback will be tested when the feature gate will change to beta.
751
761
752
762
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
753
763
@@ -764,7 +774,7 @@ count of successful and failed ControllerModifyVolume.
764
774
765
775
By inspecting a `controller_modify_volume_total` metric value. If the counter
766
776
is increasing while letting PVCs being updated retroactively the feature is enabled. And at the same time if
767
-
`controller_modify_volume_total` counter does not increase the feature
777
+
`controller_modify_volume_errors_total` counter does not increase the feature
768
778
works as expected.
769
779
770
780
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
@@ -784,6 +794,9 @@ These goals will help you determine what you need to measure (SLIs) in the next
784
794
question.
785
795
-->
786
796
797
+
- Ratio of `controller_modify_volume_errors_total`/`controller_modify_volume_total` <= 1%
798
+
- CreateVolume `csi_sidecar_operations_seconds_sum` does not increase by more than 5% when feature flags are enabled.
799
+
787
800
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
788
801
789
802
-[ ] Metrics
@@ -797,6 +810,7 @@ question.
797
810
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
798
811
implementation difficulties, etc.).
799
812
-->
813
+
No.
800
814
801
815
### Dependencies
802
816
@@ -806,7 +820,7 @@ This section must be completed when targeting beta to a release.
806
820
807
821
###### Does this feature depend on any specific services running in the cluster?
808
822
809
-
external-provisioner, external-resizer.
823
+
external-provisioner, external-resizer.
810
824
811
825
### Scalability
812
826
@@ -849,6 +863,11 @@ Yes, the feature may impact CreateVolume. We will measure this impact during bet
849
863
850
864
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
851
865
866
+
Using this feature may result in non-negligible increase of resource usage IF customers batch modify many volumes at once and increase default external-resizer `--workers`, `-kube-api-burst`, and `-kube-api-qps` options.
867
+
- external-resizer CPU and memory will see a non-negligible increase if users increased the number of concurrent operations via the `--workers` flag. We follow the strategy of sharing that limit between `ControllerExpandVolume` and `ControllerModifyVolume` RPCs, similar to how external-provisioner functions.
868
+
- If users increased the default `-kube-api-burst` and `-kube-api-qps` of their external-resizer container, the API-Server may see a spike of CPU when processing these changes
869
+
870
+
Stress tests will determine increase in resource usage at varying amounts of concurrent volume modifications.
852
871
853
872
### Troubleshooting
854
873
@@ -865,6 +884,15 @@ details). For now, we leave it here.
865
884
866
885
###### How does this feature react if the API server and/or etcd is unavailable?
867
886
887
+
No change from today's volume provisioning workflow if API Server / etcd are unavailable.
888
+
889
+
If API server and/or etcd is unavailable, there are two scenarios for volume modification workflow
890
+
891
+
1. External-resizer detects volume needing modification before API Server is made unavailable. Calls ControllerModifyVolume. Cloud provider will modify volume, report success to external resizer. External-resizer will be unable to update PVC object until API Server back online. Error will be logged.
892
+
893
+
2. External-resizer does NOT detect volume needing modification before API Server is made unavailable. Volume modification will not take place until API Server back online.
894
+
895
+
In both cases the PVC has not been updated to reflect new VolumeAttributeClass until API Server back online.
868
896
869
897
###### What are other known failure modes?
870
898
@@ -881,9 +909,21 @@ For each of them, fill in the following information by copying the below templat
881
909
Not required until feature graduated to beta.
882
910
- Testing: Are there any tests for failure mode? If not, describe why.
883
911
-->
912
+
-->
913
+
- ControllerModifyVolume cannot modify volume to reflect new VolumeAttributeClass due to user misconfiguration or cloudprovider backend error/limits. Volume would fall back to workable default configuration but external-resizer will requeue with longer `Infeasible` interval.
914
+
- Detection: See event on PVC object. See increase in `controller_modify_volume_errors_total`
915
+
- Mitigations: No serious mitigation needed because volume would fall back to previous configuration. Can edit PVC to previous VolumeAttributeClass to prevent retry ControllerModifyVolume calls.
916
+
- Diagnostics:
917
+
- Events on PVC which include the associated [ControllerModifyVolume error](https://github.com/container-storage-interface/spec/blob/master/spec.md#controllermodifyvolume-errors) and message
918
+
- external-resizer container logs: Logs similar to "ModifyVolume failed..." (At Log Levels 2&3)
919
+
- Testing: Are there any tests for failure mode? If not, describe why.
920
+
- There are tests to that validate appropriate events/errors propagate. Otherwise
921
+
- Note: See [Modify Design](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3751-volume-attributes-class#modify-pvc) to see flow.
922
+
884
923
885
924
###### What steps should be taken if SLOs are not being met to determine the problem?
886
925
926
+
When SLOs are not being met, PVC events should be observed. Debug level logging should be enabled on the appropriate containers (external-resizer for volume modifications, external-provisioner for volume creations, relevant CSI Driver plugin). If problem is not determined from PVC events, operator must look at debug logs to narrow problem to CSI Driver plugin or external sidecars. It may be helpful to see if volume was modified on storage backend. If problem is in CSI Driver plugin, must reach out to CSI Driver maintainers. Storage admin can requested for help finding root cause.
887
927
888
928
## Implementation History
889
929
@@ -897,6 +937,16 @@ Major milestones might include:
897
937
- the version of Kubernetes where the KEP graduated to general availability
898
938
- when the KEP was retired or superseded
899
939
-->
940
+
- 2023-06-15 SIG Acceptance of KEP and Agreement on proposed Volume Attributes Class design ([link](https://github.com/kubernetes/enhancements/commit/8929cf618f056e447d0b2bed562af3fc134c8cbb))
941
+
- 2023-06-26 Original demo of VolumeAttributeClass proof-of-concept
942
+
- 2023-10-31 VolumeAttributesClass API changes merged in kubernetes/kubernetes
943
+
- 2023-10-26 Implementation merged in kubernetes-csi/external-provisioner
944
+
- 2023-11-09 Implementation merged in kubernetes-csi/external-resizer
945
+
- First available release: Alpha in Kubernetes 1.29
0 commit comments