Skip to content

Comments

Disable metrics on GCS Fuse CSI volume counts exceeding threshold of 10#1224

Open
thrivikram-karur-g wants to merge 5 commits intomainfrom
vikram-disable-metrics-on-volume-count
Open

Disable metrics on GCS Fuse CSI volume counts exceeding threshold of 10#1224
thrivikram-karur-g wants to merge 5 commits intomainfrom
vikram-disable-metrics-on-volume-count

Conversation

@thrivikram-karur-g
Copy link

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test

/kind feature

/kind flake

What this PR does / why we need it:
This PR disables the metric collectors when the gcsfuse volumes count in the pod spec is more than a given threshold(10 in this case) as it is considered unbounded without that and contributes to theoretically infinite cardinality and is not accepted for making these metrics live. Hence, disabling metrics collector when the number of gcsfuse volumes exceeds given threshold has been incorporated, accordingly corresponding tests have been added.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

      Metrics from GCSFuse CSI driver wouldn't be exported if the number of GCS CSI volumes are more than the current threshold of 10

@google-oss-prow
Copy link

@thrivikram-karur-g: The label(s) kind/feature cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test

/kind feature

/kind flake

What this PR does / why we need it:
This PR disables the metric collectors when the gcsfuse volumes count in the pod spec is more than a given threshold(10 in this case) as it is considered unbounded without that and contributes to theoretically infinite cardinality and is not accepted for making these metrics live. Hence, disabling metrics collector when the number of gcsfuse volumes exceeds given threshold has been incorporated, accordingly corresponding tests have been added.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

     Metrics from GCSFuse CSI driver wouldn't be exported if the number of GCS CSI volumes are more than the current threshold of 10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: thrivikram-karur-g

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @thrivikram-karur-g, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential metrics cardinality issue by introducing a threshold for GCS Fuse CSI volume metrics collection. When a pod requests more than 10 GCS Fuse CSI volumes, metrics for that pod will be automatically disabled to maintain metrics system stability. This change includes the necessary logic to count volumes, apply the threshold, and comprehensive tests to ensure correct behavior, along with improvements to the fake metrics manager for better testability.

Highlights

  • Metrics Collection Threshold: Metrics collection for GCS Fuse CSI volumes will now be disabled for any pod that utilizes more than 10 GCS Fuse CSI volumes. This change prevents unbounded cardinality issues in metrics.
  • Volume Counting Logic: A new helper function, countGcsFuseVolumes, was introduced to accurately determine the number of GCS Fuse CSI volumes within a given pod specification.
  • Enhanced Testing: New unit tests were added to validate the GCS Fuse volume counting logic and to ensure that metrics collectors are registered or disabled correctly based on the new volume count threshold and other conditions.
  • Fake Metrics Manager Improvement: The FakeMetricsManager was updated to track registered and unregistered metrics collectors, significantly improving its utility for testing metrics-related logic.
Changelog
  • pkg/cloud_provider/clientset/fake.go
    • Added AddPodVolumes method to FakeClientset for test setup.
  • pkg/csi_driver/node.go
    • Implemented a check in NodePublishVolume to disable metrics collection if a pod has more than 10 GCS Fuse CSI volumes.
    • Added countGcsFuseVolumes helper function to determine the number of GCS Fuse CSI volumes in a pod.
  • pkg/csi_driver/node_test.go
    • Added necessary imports for new test cases.
    • Introduced TestCountGcsFuseVolumes to validate the GCS Fuse volume counting logic.
    • Added TestNodePublishVolumeAssertMetricsCollectorRegistration to verify metrics collector behavior under different conditions, including the new volume count threshold.
  • pkg/metrics/fake.go
    • Enhanced FakeMetricsManager to internally track registered collectors using a map and mutex.
    • Added NewFakeMetricsManager function for proper initialization.
    • Implemented RegisterMetricsCollector and UnregisterMetricsCollector to record and remove collector registrations.
    • Provided GetCollectors method to retrieve the list of currently registered collectors for testing purposes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a limit on the number of GCS FUSE volumes for which metrics are collected. If a pod has more than 10 GCS FUSE volumes, metrics collection is disabled for that pod to prevent unbounded metric cardinality. The changes include the logic to count volumes and conditionally register the metrics collector, along with corresponding unit tests. The implementation is correct and well-tested. My only suggestion is to replace the hardcoded limit of 10 with a named constant to improve code readability and maintainability.

return ""
}

func (s *nodeServer) countGcsFuseVolumes(pod *corev1.Pod) int {
Copy link
Collaborator

@amacaskill amacaskill Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

countGcsFuseVolumes only checks for CSI ephemeral volumes: https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-ephemeral

You also need to check for persistent volumes as the pod could have either: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-pv

Checking for PV is less trivial as you need to make additional API calls to get PVC / PV object. We don't just want to call

pvc, err := s.driver.config.K8sClients.GetPersistentVolumeClaim(ctx, pod.Namespace, pvcName)
pv, err := s.driver.config.K8sClients.GetPersistentVolume(ctx, pvc.Spec.VolumeName)

We instead want to Update the Clientset Interface In pkg/cloud_provider/clientset/clientset.go, add methods for fetching PVCs to the main interface (fetching PVs already exist). When you implement GetPVC, you also need to implement a pvcLister. In K8s client-go, a Lister is a tool that allows you to fetch Kubernetes objects (like Pods or PersistentVolumes) from a local, in-memory cache rather than making a direct HTTP request to the Kubernetes API server for every query. This cache is continuously kept up-to-date in the background by an Informer, which "watches" the API server for any create, update, or delete events. This prevents making API call during the critical path of volume mounting (NodePublishVolume). @uriel-guzman Recently added one for GCSFuse profiles, so he can help you with this if you have questions.

}
}

func TestCountGcsFuseVolumes(t *testing.T) {
Copy link
Collaborator

@amacaskill amacaskill Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to these CSI ephemeral volume tests, please add unit tests for persistent volumes + mixed GCSFuse volumes (meaning a pod that has both CSI ephemeral + persistent volumes in various orders). PV + CSI , CSI + PV, etc.

See https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/pull/1224/changes#r2834314334 for details for what I mean for CSI ephemeral volume vs Persistent volume

gcsFuseVolumeCount := s.countGcsFuseVolumes(pod)

if gcsFuseVolumeCount > maxGcsFuseVolumesForMetrics {
klog.Warningf("Metrics collection is disabled for Pod %s/%s as the number of GCS FUSE volumes is %d, which is greater than the limit of 10.", pod.Namespace, pod.Name, gcsFuseVolumeCount)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with :

			klog.Warningf("Metrics collection is disabled for Pod %s/%s as the number of GCS FUSE volumes is %d, which is greater than the limit of %d.", pod.Namespace, pod.Name, gcsFuseVolumeCount, maxGcsFuseVolumesForMetrics)


func (*FakeMetricsManager) UnregisterMetricsCollector(_ string) {}
// GetCollectors returns the map of registered collectors.
func (f *FakeMetricsManager) GetCollectors() map[string]string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From gemini: returning the map directly here creates a data race. In Go, maps are reference types, so returning f.collectors means the mutex only protects the pointer return, not the map's contents. Because this test suite uses t.Parallel(), if one test reads the map while another calls RegisterMetricsCollector, the test suite will panic with a concurrent map read/write error.

Check to see if returning a copy of the map instead fixes this

if err != nil {
// The fake clientset does not have the pod annotations,
// which will cause the sidecar check to fail.
// See if we can find a workaround.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"// See if we can find a workaround."

What do we need to fix here? Can you please provide details?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants