Skip to content

Cache and print devices for debugging future outages #2141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

cemakd
Copy link
Contributor

@cemakd cemakd commented Aug 4, 2025

This is a copy of PR: #2097

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
This PR adds a cache that periodically (configured to every minute currently) looks at the /dev/disk/by-id/ directory and evaluates the symlinks there. It maintains a cache of the symlink and the real path it points to.

This will help with debugging future filesystem issues. In a past OMG, we found that our insight into changes in symlinks for specific disks hampered our ability to debug. Logging marked the real path of the disk at mount and unmount, but the change in between couldn't be detected.

This PR will print those links every minute, also logging when elements of the cache change.

An example:

I0521 00:23:51.418261      12 cache.go:62] periodic symlink cache read: /dev/disk/by-id/google-persistent-disk-0 -> /dev/sda; /dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb; /dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda; /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb

The cache will also note if a symlink is broken.

NOTE: Currently this filters out any thing in by-id/ that ends with -part[0-9]*$. This removes partitions, which are noise. Mounting partitions directly isn't well supported in GKE, but we may want to test that in the future.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

changes and printing the full list.

example: periodic symlink cache read:
/dev/disk/by-id/google-persistent-disk-0 -> /dev/sda;
/dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb;
/dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda;
/dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 4, 2025
@k8s-ci-robot k8s-ci-robot requested a review from julianKatz August 4, 2025 18:05
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cemakd
Once this PR has been reviewed and has the lgtm label, please assign msau42 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from tonyzhc August 4, 2025 18:05
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 4, 2025
@tonyzhc
Copy link
Contributor

tonyzhc commented Aug 4, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Aug 4, 2025
Copy link
Contributor

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the testing that has been done for this change?

return newDeviceCacheForNode(period, node), nil
}

func TestDeviceCache(period time.Duration, node *v1.Node) *DeviceCache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the test functions to the _test.go file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were not real test function but rather creating a DeviceCache for testing purposes used by other packages. I've renamed them to NewTestDeviceCache to distinguish them from actual test functions.

// of the string (after the last "/") and call AddVolume for that
for _, volume := range node.Status.VolumesInUse {
klog.Infof("Adding volume %s to cache", string(volume))
vID, err := pvNameFromVolumeID(string(volume))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this ignore volume types that are not PD/hyperdisk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not ignoring them, I've pushed a commit that checks if the volume is for the driver pd.csi.storage.gke.io. IIUC that should be sufficient to do the filtering.

// Evaluate the symlink
realPath, err := filepath.EvalSymlinks(device.symlink)
if err != nil {
klog.Warningf("Error evaluating symlink for volume %s: %v", volumeID, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful of logging this since this is called repeatedly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the polling period is 10 mins, would it be okay to log this? Otherwise it might be hard to figure out any problems with realpath evaluation?
I'm not sure if that's a realistic problem to log, I can remove this warning log if you think it's not necessary

@@ -275,6 +276,13 @@ func handle() {
klog.Fatalf("Failed to get node info from API server: %v", err.Error())
}

deviceCache, err := linkcache.NewDeviceCacheForNode(ctx, 1*time.Minute, *nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the time period configurable so that we can adjust it if it ends up spamming too many logs? In fact I would probably start the logging at 10 mins initially.

In the future, if we want to turn this into a metric, I would be more comfortable reducing the polling period but removing logs to avoid log spam.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a flag disk-cache-sync-period with a default value of 10 minutes when unspecified.

}

// Look at the dir for a symlink that matches the pvName
symlink := filepath.Join(d.dir, "google-"+deviceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested with nvme disks as well? It would be better to reuse functions in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/deviceutils/device-utils.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done specific testing for nvme disks but I've pushed a new commit that utilizes device-utils.go for evaluating the symlinks for a volumeID

@cemakd
Copy link
Contributor Author

cemakd commented Aug 7, 2025

Can you describe the testing that has been done for this change?

I haven't done any additional testing on this PR since I took it over from Julian. As far as I know he was struggling to get the e2e tests passing, which I fixed with a few nil pointers checks.

I'll do some more testing on this and get back to you

Change volumes map to use symlink as key and add volumeID and real
path to the deviceMappings object, this is because there can be
multiple symlinks per volume

Filter out caching and logging info on non pd/hyperdisk volumes

Failure to get real path at add volume time is no longer an error
the real path may be unavailable at this time

Fix the device_windows.go NewDeviceCacheForNode parameters
@cemakd cemakd force-pushed the logs-for-device-mappings branch from 61ea81b to 61aeffd Compare August 7, 2025 20:58
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 7, 2025

@cemakd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e-arm 61aeffd link false /test pull-gcp-compute-persistent-disk-csi-driver-e2e-arm
pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration 61aeffd link true /test pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants