You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- The feature has been extensively tested in CI and deployed with drivers in real world. For example - https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/charts/latest/azuredisk-csi-driver/templates/csi-azuredisk-controller.yaml#L166
336
+
- The test grid already has tests for the feature.
337
+
333
338
### Upgrade / Downgrade Strategy
334
339
335
340
Not Applicable
@@ -413,9 +418,15 @@ after expansion is complete even with older kubelet. No recovery from expansion
413
418
### Monitoring requirements
414
419
415
420
###### How can an operator determine if the feature is in use by workloads?**
416
-
Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.
417
-
418
-
421
+
422
+
The Recovery feature as such designed does not require if external operator must be able to observe
423
+
if the feature is in-use by the workloads. The reason is, proposed changes already enhance
424
+
user experience of resizing workflow by providing additional states such as `allocatedResourceStatus`
425
+
and `allocatedResources` in pvc's status and there is nothing special about recovery in itself.
426
+
427
+
Operators can already observe if volume's are being resized by using `pvc.Status.Conditions` and other
428
+
metrics documented below.
429
+
419
430
###### How can someone using this feature know that it is working for their instance?
420
431
421
432
Since feature requires user interaction, reducing the size of the PVC is only supported
@@ -449,15 +460,10 @@ if this feature-gate is enabled.
449
460
both `expand_volume` and `volume_fs_resize` durations. Also the error rate should not increase for
450
461
`storage_operation_errors_total` metric.
451
462
452
-
###### Are there any missing metrics that would be useful to have to improve observability if this feature?
453
-
454
-
We are planning to add new counter metrics that will record success and failure of recovery operations.
455
-
In cases where recovery fails, the counter will forever be increasing until an admin action resolves the error.
463
+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
456
464
457
-
Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`
458
-
459
-
The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often
460
-
and hence it should be okay to use name of PVs that were recovered this way.
465
+
We think that existing resizing related metrics are sufficient and we do as such think we need to introduce new
466
+
metrics for tracking this feature.
461
467
462
468
### Dependencies
463
469
@@ -474,7 +480,7 @@ previous answers based on experience in the field._
474
480
475
481
###### Will enabling / using this feature result in any new API calls
476
482
477
-
Potentially yes. If user expands a PVC and expansion fails on the node, then
483
+
Yes - if user recovers a volume from failed expansion. When user expands a PVC and expansion fails on the node, then
478
484
expansion controller in control-plane must intervene and verify if it is safe
479
485
to retry expansion on the kubelet.This requires round-trips between kubelet
480
486
and control-plane and hence more API calls.
@@ -484,9 +490,14 @@ previous answers based on experience in the field._
484
490
485
491
###### Will enabling / using this feature result in any new calls to cloud provider
486
492
487
-
Potentially yes. Since expansion operation is idempotent and expansion controller
493
+
Since expansion operation is idempotent and expansion controller
488
494
must verify if it is safe to retry expansion with lower values, there could be
489
-
additional calls to CSI drivers (and hence cloudproviders).
495
+
additional calls to CSI drivers (and hence cloudproviders), when user attempts recovery from
496
+
failed expansion.
497
+
498
+
On the other hand - previously we retried expansion of failed volumes infinitiely and hence incurred
499
+
significant api usage of both k8s and cloudprovider. This enhancement significantly reduces this by
500
+
allowing recovery and slower pace of retries for infeasible expansion operations.
490
501
491
502
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
0 commit comments