Skip to content

Commit 693e55b

Browse files
committed
Add Monitoring Requirements chapter
1 parent ac09b7c commit 693e55b

File tree

1 file changed

+42
-8
lines changed
  • keps/sig-storage/3756-volume-reconstruction

1 file changed

+42
-8
lines changed

keps/sig-storage/3756-volume-reconstruction/README.md

Lines changed: 42 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -687,7 +687,9 @@ The feature can be disabled without any issues.
687687

688688
###### What happens if we reenable the feature if it was previously rolled back?
689689

690-
Nothing interesting happens.
690+
Nothing interesting happens. This feature changes how kubelet starts and how it
691+
cleans volume mounts. It has no visible effect in any API object nor structure
692+
of data / mount table in the host OS.
691693

692694
###### Are there any tests for feature enablement/disablement?
693695

@@ -773,8 +775,6 @@ For GA, this section is required: approvers should be able to confirm the
773775
previous answers based on experience in the field.
774776
-->
775777

776-
TODO whole chapter before GA.
777-
778778
###### How can an operator determine if the feature is in use by workloads?
779779

780780
<!--
@@ -783,6 +783,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
783783
logs or events for this purpose.
784784
-->
785785

786+
They can check if the FeatureGate is enabled on a node, e.g. by monitoring
787+
`kubernetes_feature_enabled` metric. Or read kubelet logs.
788+
786789
###### How can someone using this feature know that it is working for their instance?
787790

788791
<!--
@@ -819,18 +822,30 @@ These goals will help you determine what you need to measure (SLIs) in the next
819822
question.
820823
-->
821824

825+
These two metrics are populated during kubelet startup:
826+
827+
* `reconstructed_volumes_total{result="error"}` should be zero. An error here
828+
means that kubelet was not able to reconstruct its cache of mounted volumes
829+
and appropriate volume plugin was not called to clean up a volume mount.
830+
There could be a leaked file or directory on the filesystem.
831+
832+
* `force_cleaned_failed_volumes_total{result="error"}` should be zero. An error
833+
here means that kubelet was not able to unmount a volume even with all
834+
fallbacks it has. There *is* at least a leaked directory on the filesystem,
835+
there could be also a leaked mount.
836+
822837
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
823838

824839
<!--
825840
Pick one more of these and delete the rest.
826841
-->
827842

828-
- [ ] Metrics
843+
- [X] Metrics
829844
- Metric name:
830-
- [Optional] Aggregation method:
831-
- Components exposing the metric:
832-
- [ ] Other (treat as last resort)
833-
- Details:
845+
- `reconstructed_volumes_total`
846+
- `force_cleaned_failed_volumes_total`
847+
- `orphaned_volumes_cleanup_errors_total`
848+
- Components exposing the metric: kubelet
834849

835850
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
836851

@@ -839,6 +854,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
839854
implementation difficulties, etc.).
840855
-->
841856

857+
No
858+
842859
### Dependencies
843860

844861
<!--
@@ -988,6 +1005,23 @@ For each of them, fill in the following information by copying the below templat
9881005

9891006
###### What steps should be taken if SLOs are not being met to determine the problem?
9901007

1008+
Check kubelet logs. There should be errors about a failed volume reconstruction,
1009+
together with the directory where the volume was supposed to be mounted.
1010+
Ensure that:
1011+
1012+
1. There is no Pod that uses the volume on the node.
1013+
2. The directory of the volume is not mounted there.
1014+
3. The directory and all its parents up to `/var/lib/kubelet/pods/<uid>/volumes`
1015+
are removed.
1016+
4. If possible, locate global mount of the volume (if it exists) in
1017+
`/var/lib/kubelet/plugins/<volume plugin name>` and unmount + remove it.
1018+
The actual directory varies by volume plugin.
1019+
* For CSI volumes, if the CSI driver supports `NodeStageVolume` CSI call,
1020+
the location is `/var/lib/kubelet/plugins/kubernetes.io/csi/<csi driver name>/<sha256sum of pv.spec.csi.volumeHandle>/globalmount`.
1021+
Otherwise, there is no global mount directory.
1022+
* EmptyDir, Projected, DownwardAPI, Secrets and ConfigMaps do not have global
1023+
mount directory.
1024+
9911025
## Implementation History
9921026

9931027
* 1.26: Alpha version was implemented as part of

0 commit comments

Comments
 (0)