-
Notifications
You must be signed in to change notification settings - Fork 167
Cache and print devices for debugging future outages #2141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache and print devices for debugging future outages #2141
Conversation
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe the testing that has been done for this change?
pkg/linkcache/devices_linux.go
Outdated
return newDeviceCacheForNode(period, node), nil | ||
} | ||
|
||
func TestDeviceCache(period time.Duration, node *v1.Node) *DeviceCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the test functions to the _test.go file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were not real test function but rather creating a DeviceCache for testing purposes used by other packages. I've renamed them to NewTestDeviceCache
to distinguish them from actual test functions.
pkg/linkcache/devices_linux.go
Outdated
// of the string (after the last "/") and call AddVolume for that | ||
for _, volume := range node.Status.VolumesInUse { | ||
klog.Infof("Adding volume %s to cache", string(volume)) | ||
vID, err := pvNameFromVolumeID(string(volume)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this ignore volume types that are not PD/hyperdisk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not ignoring them, I've pushed a commit that checks if the volume is for the driver pd.csi.storage.gke.io
. IIUC that should be sufficient to do the filtering.
pkg/linkcache/devices_linux.go
Outdated
// Evaluate the symlink | ||
realPath, err := filepath.EvalSymlinks(device.symlink) | ||
if err != nil { | ||
klog.Warningf("Error evaluating symlink for volume %s: %v", volumeID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful of logging this since this is called repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the polling period is 10 mins, would it be okay to log this? Otherwise it might be hard to figure out any problems with realpath evaluation?
I'm not sure if that's a realistic problem to log, I can remove this warning log if you think it's not necessary
cmd/gce-pd-csi-driver/main.go
Outdated
klog.Fatalf("Failed to get node info from API server: %v", err.Error()) | ||
} | ||
|
||
deviceCache, err := linkcache.NewDeviceCacheForNode(ctx, 1*time.Minute, *nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the time period configurable so that we can adjust it if it ends up spamming too many logs? In fact I would probably start the logging at 10 mins initially.
In the future, if we want to turn this into a metric, I would be more comfortable reducing the polling period but removing logs to avoid log spam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a flag disk-cache-sync-period
with a default value of 10 minutes when unspecified.
pkg/linkcache/devices_linux.go
Outdated
} | ||
|
||
// Look at the dir for a symlink that matches the pvName | ||
symlink := filepath.Join(d.dir, "google-"+deviceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested with nvme disks as well? It would be better to reuse functions in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/deviceutils/device-utils.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done specific testing for nvme disks but I've pushed a new commit that utilizes device-utils.go for evaluating the symlinks for a volumeID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've completed the testing for both NVMe and SCSI disk interface types.
I used the following machine types for the tests:
- SCSI: e2-standard-2
- NVMe: c3-standard-4
Here are the log outputs:
SCSI: The log shows both the google- and scsi- prefixed symlinks successfully resolving to the same device path (/dev/sdb).
Cache contents:
map[
/dev/disk/by-id/google-pvc-b08e725b-7bdf-499a-af8f-9bfba4c9ef69:{volumeID:projects/enginakdemir-joonix/zones/us-central1-c/disks/pvc-b08e725b-7bdf-499a-af8f-9bfba4c9ef69 realPath:/dev/sdb}
/dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-b08e725b-7bdf-499a-af8f-9bfba4c9ef69:{volumeID:projects/enginakdemir-joonix/zones/us-central1-c/disks/pvc-b08e725b-7bdf-499a-af8f-9bfba4c9ef69 realPath:/dev/sdb}
]
NVMe: The log shows the google- prefixed symlink resolving to the NVMe device path (/dev/nvme0n2), while the scsi- symlink has an empty realPath.
Cache contents:
map[
/dev/disk/by-id/google-pvc-289e075f-e731-4e75-9519-4f0e5e4832ae:{volumeID:projects/enginakdemir-joonix/zones/us-central1-c/disks/pvc-289e075f-e731-4e75-9519-4f0e5e4832ae realPath:/dev/nvme0n2}
/dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-289e075f-e731-4e75-9519-4f0e5e4832ae:{volumeID:projects/enginakdemir-joonix/zones/us-central1-c/disks/pvc-289e075f-e731-4e75-9519-4f0e5e4832ae realPath:}
]
I haven't done any additional testing on this PR since I took it over from Julian. As far as I know he was struggling to get the e2e tests passing, which I fixed with a few nil pointers checks. I'll do some more testing on this and get back to you |
61ea81b
to
61aeffd
Compare
pkg/linkcache/devices_linux.go
Outdated
} | ||
|
||
// Look at the status.volumesInUse field. For each, take the last section | ||
// of the string (after the last "/") and call AddVolume for that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment with what the expected syntax is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/linkcache/devices_linux.go
Outdated
// The first token is of the form "kubernetes.io/csi/<driver-name>" or just "<driver-name>". | ||
// We should check if it contains the driver name we are interested in. | ||
if !strings.Contains(tokens[0], driverName) { | ||
klog.V(5).Infof("Skipping volume %q because it is not a %s volume.", volumeName, driverName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be common if a customer has multiple csi drivers. We probably don't need this log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this log line, done
61aeffd
to
ada0422
Compare
changes and printing the full list. example: periodic symlink cache read: /dev/disk/by-id/google-persistent-disk-0 -> /dev/sda; /dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb; /dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda; /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb
easier to unit test.
ada0422
to
f1b9c24
Compare
Change volumes map to use symlink as key and add volumeID and real path to the deviceMappings object, this is because there can be multiple symlinks per volume Filter out caching and logging info on non pd/hyperdisk volumes Failure to get real path at add volume time is no longer an error the real path may be unavailable at this time Fix the device_windows.go NewDeviceCacheForNode parameters Add clarifying comments and remove excessive logging Move wg.Add out of goroutine to fix -verify presubmit
f1b9c24
to
9460d9d
Compare
/lgm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cemakd, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@cemakd: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/lgtm |
/cherry-pick release-1.21 |
@cemakd: new pull request created: #2185 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is a copy of PR: #2097
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR adds a cache that periodically (configured to every minute currently) looks at the /dev/disk/by-id/ directory and evaluates the symlinks there. It maintains a cache of the symlink and the real path it points to.
This will help with debugging future filesystem issues. In a past OMG, we found that our insight into changes in symlinks for specific disks hampered our ability to debug. Logging marked the real path of the disk at mount and unmount, but the change in between couldn't be detected.
This PR will print those links every minute, also logging when elements of the cache change.
An example:
The cache will also note if a symlink is broken.
NOTE: Currently this filters out any thing in by-id/ that ends with -part[0-9]*$. This removes partitions, which are noise. Mounting partitions directly isn't well supported in GKE, but we may want to test that in the future.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: