Disable metrics on GCS Fuse CSI volume counts exceeding threshold of 10#1224
Disable metrics on GCS Fuse CSI volume counts exceeding threshold of 10#1224thrivikram-karur-g wants to merge 5 commits intomainfrom
Conversation
|
@thrivikram-karur-g: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: thrivikram-karur-g The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @thrivikram-karur-g, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a potential metrics cardinality issue by introducing a threshold for GCS Fuse CSI volume metrics collection. When a pod requests more than 10 GCS Fuse CSI volumes, metrics for that pod will be automatically disabled to maintain metrics system stability. This change includes the necessary logic to count volumes, apply the threshold, and comprehensive tests to ensure correct behavior, along with improvements to the fake metrics manager for better testability. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a limit on the number of GCS FUSE volumes for which metrics are collected. If a pod has more than 10 GCS FUSE volumes, metrics collection is disabled for that pod to prevent unbounded metric cardinality. The changes include the logic to count volumes and conditionally register the metrics collector, along with corresponding unit tests. The implementation is correct and well-tested. My only suggestion is to replace the hardcoded limit of 10 with a named constant to improve code readability and maintainability.
| return "" | ||
| } | ||
|
|
||
| func (s *nodeServer) countGcsFuseVolumes(pod *corev1.Pod) int { |
There was a problem hiding this comment.
countGcsFuseVolumes only checks for CSI ephemeral volumes: https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-ephemeral
You also need to check for persistent volumes as the pod could have either: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-pv
Checking for PV is less trivial as you need to make additional API calls to get PVC / PV object. We don't just want to call
pvc, err := s.driver.config.K8sClients.GetPersistentVolumeClaim(ctx, pod.Namespace, pvcName)
pv, err := s.driver.config.K8sClients.GetPersistentVolume(ctx, pvc.Spec.VolumeName)
We instead want to Update the Clientset Interface In pkg/cloud_provider/clientset/clientset.go, add methods for fetching PVCs to the main interface (fetching PVs already exist). When you implement GetPVC, you also need to implement a pvcLister. In K8s client-go, a Lister is a tool that allows you to fetch Kubernetes objects (like Pods or PersistentVolumes) from a local, in-memory cache rather than making a direct HTTP request to the Kubernetes API server for every query. This cache is continuously kept up-to-date in the background by an Informer, which "watches" the API server for any create, update, or delete events. This prevents making API call during the critical path of volume mounting (NodePublishVolume). @uriel-guzman Recently added one for GCSFuse profiles, so he can help you with this if you have questions.
| } | ||
| } | ||
|
|
||
| func TestCountGcsFuseVolumes(t *testing.T) { |
There was a problem hiding this comment.
In addition to these CSI ephemeral volume tests, please add unit tests for persistent volumes + mixed GCSFuse volumes (meaning a pod that has both CSI ephemeral + persistent volumes in various orders). PV + CSI , CSI + PV, etc.
See https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/pull/1224/changes#r2834314334 for details for what I mean for CSI ephemeral volume vs Persistent volume
| gcsFuseVolumeCount := s.countGcsFuseVolumes(pod) | ||
|
|
||
| if gcsFuseVolumeCount > maxGcsFuseVolumesForMetrics { | ||
| klog.Warningf("Metrics collection is disabled for Pod %s/%s as the number of GCS FUSE volumes is %d, which is greater than the limit of 10.", pod.Namespace, pod.Name, gcsFuseVolumeCount) |
There was a problem hiding this comment.
Replace with :
klog.Warningf("Metrics collection is disabled for Pod %s/%s as the number of GCS FUSE volumes is %d, which is greater than the limit of %d.", pod.Namespace, pod.Name, gcsFuseVolumeCount, maxGcsFuseVolumesForMetrics)
|
|
||
| func (*FakeMetricsManager) UnregisterMetricsCollector(_ string) {} | ||
| // GetCollectors returns the map of registered collectors. | ||
| func (f *FakeMetricsManager) GetCollectors() map[string]string { |
There was a problem hiding this comment.
From gemini: returning the map directly here creates a data race. In Go, maps are reference types, so returning f.collectors means the mutex only protects the pointer return, not the map's contents. Because this test suite uses t.Parallel(), if one test reads the map while another calls RegisterMetricsCollector, the test suite will panic with a concurrent map read/write error.
Check to see if returning a copy of the map instead fixes this
| if err != nil { | ||
| // The fake clientset does not have the pod annotations, | ||
| // which will cause the sidecar check to fail. | ||
| // See if we can find a workaround. |
There was a problem hiding this comment.
"// See if we can find a workaround."
What do we need to fix here? Can you please provide details?
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR disables the metric collectors when the gcsfuse volumes count in the pod spec is more than a given threshold(10 in this case) as it is considered unbounded without that and contributes to theoretically infinite cardinality and is not accepted for making these metrics live. Hence, disabling metrics collector when the number of gcsfuse volumes exceeds given threshold has been incorporated, accordingly corresponding tests have been added.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: