Skip to content

Commit 316f5aa

Browse files
committed
Address comments about GA milestone
1 parent d52c7d4 commit 316f5aa

File tree

2 files changed

+26
-15
lines changed

2 files changed

+26
-15
lines changed

keps/sig-storage/1790-recover-resize-failure/README.md

Lines changed: 25 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -330,6 +330,11 @@ There are already e2e tests running that verify correctness of this feature - ht
330330
- There are already e2e tests running that verify correctness of this feature -
331331
https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-gce-cos-alpha-features
332332

333+
### GA
334+
335+
- The feature has been extensively tested in CI and deployed with drivers in real world. For example - https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/charts/latest/azuredisk-csi-driver/templates/csi-azuredisk-controller.yaml#L166
336+
- The test grid already has tests for the feature.
337+
333338
### Upgrade / Downgrade Strategy
334339

335340
Not Applicable
@@ -413,9 +418,15 @@ after expansion is complete even with older kubelet. No recovery from expansion
413418
### Monitoring requirements
414419

415420
###### How can an operator determine if the feature is in use by workloads?**
416-
Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.
417-
418-
421+
422+
The Recovery feature as such designed does not require if external operator must be able to observe
423+
if the feature is in-use by the workloads. The reason is, proposed changes already enhance
424+
user experience of resizing workflow by providing additional states such as `allocatedResourceStatus`
425+
and `allocatedResources` in pvc's status and there is nothing special about recovery in itself.
426+
427+
Operators can already observe if volume's are being resized by using `pvc.Status.Conditions` and other
428+
metrics documented below.
429+
419430
###### How can someone using this feature know that it is working for their instance?
420431

421432
Since feature requires user interaction, reducing the size of the PVC is only supported
@@ -449,15 +460,10 @@ if this feature-gate is enabled.
449460
both `expand_volume` and `volume_fs_resize` durations. Also the error rate should not increase for
450461
`storage_operation_errors_total` metric.
451462

452-
###### Are there any missing metrics that would be useful to have to improve observability if this feature?
453-
454-
We are planning to add new counter metrics that will record success and failure of recovery operations.
455-
In cases where recovery fails, the counter will forever be increasing until an admin action resolves the error.
463+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
456464

457-
Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`
458-
459-
The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often
460-
and hence it should be okay to use name of PVs that were recovered this way.
465+
We think that existing resizing related metrics are sufficient and we do as such think we need to introduce new
466+
metrics for tracking this feature.
461467

462468
### Dependencies
463469

@@ -474,7 +480,7 @@ previous answers based on experience in the field._
474480

475481
###### Will enabling / using this feature result in any new API calls
476482

477-
Potentially yes. If user expands a PVC and expansion fails on the node, then
483+
Yes - if user recovers a volume from failed expansion. When user expands a PVC and expansion fails on the node, then
478484
expansion controller in control-plane must intervene and verify if it is safe
479485
to retry expansion on the kubelet.This requires round-trips between kubelet
480486
and control-plane and hence more API calls.
@@ -484,9 +490,14 @@ previous answers based on experience in the field._
484490

485491
###### Will enabling / using this feature result in any new calls to cloud provider
486492

487-
Potentially yes. Since expansion operation is idempotent and expansion controller
493+
Since expansion operation is idempotent and expansion controller
488494
must verify if it is safe to retry expansion with lower values, there could be
489-
additional calls to CSI drivers (and hence cloudproviders).
495+
additional calls to CSI drivers (and hence cloudproviders), when user attempts recovery from
496+
failed expansion.
497+
498+
On the other hand - previously we retried expansion of failed volumes infinitiely and hence incurred
499+
significant api usage of both k8s and cloudprovider. This enhancement significantly reduces this by
500+
allowing recovery and slower pace of retries for infeasible expansion operations.
490501

491502
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
492503

keps/sig-storage/1790-recover-resize-failure/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ see-also:
2121
replaces:
2222
superseded-by:
2323

24-
latest-milestone: "v1.32"
24+
latest-milestone: "v1.34"
2525
stage: "stable"
2626

2727
milestone:

0 commit comments

Comments
 (0)