Skip to content

Commit e74df66

Browse files
committed
Fix documentation about metrics names
1 parent e8fdef1 commit e74df66

File tree

1 file changed

+24
-8
lines changed
  • keps/sig-storage/284-enable-volume-expansion

1 file changed

+24
-8
lines changed

keps/sig-storage/284-enable-volume-expansion/README.md

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -367,7 +367,9 @@ different feature gates that control various aspects of expansion.
367367

368368
- [x] Feature gate (also fill in values in `kep.yaml`)
369369
- Feature gate name: ExpandPersistentVolumes
370-
- description: Parent feature gate that is required for expansion in general to work.
370+
- description: |
371+
This feature is required for `pvc.Spec.Resources` to be editable and must be
372+
enabled for other expansion related feature gates to work.
371373
- Components depending on the feature gate:
372374
- kube-apiserver
373375
- kubelet
@@ -423,8 +425,12 @@ some kind of terminal error then it may prevent mount operation from succeeding.
423425

424426
###### What specific metrics should inform a rollback?
425427

426-
The `volume_mount` operation failure metric - `storage_operation_errors_total{operation_name=volume_mount}`
427-
should tell us if expansion operation during volume mount is causing mount failures.
428+
The `volume_mount` operation failure metric - `storage_operation_duration_seconds{operation_name=volume_mount, status=fail-unknown}`
429+
combined with `storage_operation_duration_seconds{operation_name=volume_fs_resize, status=fail-unknown}` should tell us
430+
if expansion is failing on the node and if it is causing mount failures.
431+
432+
Also `csi_sidecar_operations_seconds` and `csi_operations_seconds` metrics with high failure rates for expansion operation should indicate
433+
that expansion is not working in the cluster and hence feature should be rolled back.
428434

429435
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
430436

@@ -444,9 +450,14 @@ A PVC that is being expanded should have `pvc.Status.Conditions` set.
444450
###### How can someone using this feature know that it is working for their instance?
445451

446452
- [x] Events
447-
- Event Reason: External resizer is resizing volume pvc-a71483ed-a5bc-48fa-9151-ca41e7e6634e
453+
- Resizing (on PVC)
454+
- Event Reason: External resizer is resizing volume pvc-a71483ed-a5bc-48fa-9151-ca41e7e6634e
455+
- VolumeResizeSuccessful (on PVC)
456+
- Event Reason: Volume resize is successful
457+
- FileSystemResizeSuccessful (on PVC)
458+
- Event Reason: Volume resize is successful. This event is emitted when resizing finishes on kubelet.
448459
- [x API .status
449-
- Condition name: "Resizing" or "FileSystemResizePending"
460+
- Condition name:
450461
- Other field:
451462
- [x] Other (treat as last resort)
452463
- Details: `pvc.Status.Capacity` should reflect user requested size after expansion is complete.
@@ -468,17 +479,22 @@ Having said that if file system requires expansion during mount then it is obvio
468479
- Metric name: storage_operation_duration_seconds{operation_name=volume_fs_resize, status=success|fail-unknown}
469480
- [Optional] Aggregation method: percentile
470481
- Components exposing the metric: kubelet
471-
- CSI operation metrics:
482+
- CSI operation metrics in controller:
472483
- Metric name: csi_sidecar_operations_seconds
473484
- [Optional] Aggregation method: percentile
474485
- Components exposing the metric: external-resizer
486+
- CSI operation metrics in kubelet:
487+
- Metric Name: csi_operations_seconds
488+
- [Optional] Aggregation method: percentile
489+
- Components exposing the metric: kubelet
475490

476491
- [ ] Other (treat as last resort)
477492
- Details:
478493

479494
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
480-
We are going to add equivalent of intree storage_operation metrics for volume expansion when
481-
expansion is performed externally via external-resizer.
495+
All the intree operations from control plane emit `storage_operation_duration_seconds{operation_name=expand_volume, status=success|fail-unknown}` metrics but CSI equivalent from external-resizer is `csi_sidecar_operations_seconds` which will be
496+
documented as alternative if CSI migration is enabled or driver being used is CSI driver.
497+
We don't need to emit new metrics but we do need to document the naming change in metric names.
482498

483499
### Dependencies
484500

0 commit comments

Comments
 (0)