-
Notifications
You must be signed in to change notification settings - Fork 22
Add shouldSkipUninstall to avoid GPU driver teardown on restart #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shouldSkipUninstall to avoid GPU driver teardown on restart #103
Conversation
68adf6a to
1991b8c
Compare
900f54b to
1991b8c
Compare
|
@cdesiniotis I've moved |
4aa4395 to
5fba425
Compare
ff04036 to
b90a2d6
Compare
b90a2d6 to
b0b38ba
Compare
b0b38ba to
b4afc5b
Compare
Signed-off-by: Karthik Vetrivel <[email protected]>
…sure mount refresh Signed-off-by: Karthik Vetrivel <[email protected]>
5d21d39 to
84ce300
Compare
… trigger driver reinstall Signed-off-by: Karthik Vetrivel <[email protected]>
84ce300 to
9ad1573
Compare
cmd/driver-manager/main.go
Outdated
| nvidiaModuleConfigFile = "/drivers/nvidia.conf" | ||
| nvidiaUVMModuleConfigFile = "/drivers/nvidia-uvm.conf" | ||
| nvidiaModsetModuleConfigFile = "/drivers/nvidia-modeset.conf" | ||
| nvidiaPeermemModuleConfigFile = "/drivers/nvidia-peermem.conf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- I don't see an extra volume mount introduced in NVIDIA/gpu-operator#1746 for making sure the k8s-driver-manager init container has access to these files. Am I missing something? Since we already have access to /run/nvidia/driver does it make sense to get these files from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, Chris! I naively assumed that the init container had this volume mount.
Updated the paths to read from /run/nvidia/driver/drivers/*.conf instead. One thing to note is that these files are only accessible there after the driver has run at least once (as opposed to if we mount the ConfigMap as a volume in the init container). However, that should be fine because readModuleParams() handles missing files and driverModuleBuildNeeded should return true because no previous driver config should exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that these files are only accessible there after the driver has run at least once
I believe you are right, but is this guaranteed to be the case? I guess these files will still exist (and be accessible to the init container) as long as the prior driver container did not clean up its /run/nvidia/driver bind mount (which should be the case for the "non-clean restart" scenario).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is guaranteed to be the case. The files are only accessible when:
- A prior driver container ran
_mount_rootfs() - AND did not run
_unmount_rootfs()(i.e. non clean, as you mentioned)
This satisfies the intended behavior for non-clean restarts. In the configuration update scenario, the old driver should call _shutdown, meaning the rootfs is unloaded and we do a complete reinstall. It doesn't matter in this case if we can access the bind mount or not because the driver config state file shouldn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, taking a look at our shutdown() function, the bind mount will only get cleaned up if we were able to unload the module successfully. So I think we are good. Resolving this thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviving this one... what happens in the following scenario?
- Driver daemonset is deployed without any custom kernel module params (as in, no
/run/nvidia/driver/drivers/*.conffiles exist). - The user creates a custom kernel module params configmap, and configures the operator to use it.
- The operator renders a new version of the driver daemonset. The only change is that now the kernel module params configmap is mounted into the main container.
In this case, we should not take the fast-path (userspace only) install track, since we need to re-load the modules with the custom parameters. But as currently implemented, I believe the fast-path install will take place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct - the init container reads module params from /run/nvidia/driver/drivers/*.conf, which doesn't exist yet when the init container runs, so adding a ConfigMap would result in both stored and current configs reading as empty strings. This will trigger fast-track unfortunately.
Two possible fixes: (1) use the operator-computed digest approach that includes ConfigMap resourceVersion in the hash, or (2) mount the kernel module ConfigMap into the init container as well (at /drivers) so it can read the actual parameter files before the main container starts.
b91efd7 to
554586a
Compare
1bfb7e8 to
9fe353f
Compare
9fe353f to
8eb03da
Compare
8eb03da to
1ca6732
Compare
cmd/driver-manager/main.go
Outdated
| } | ||
|
|
||
| // isDesiredDriverLoaded checks if the driver is loaded and the configuration matches | ||
| func (dm *DriverManager) isDesiredDriverLoaded() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename the method to shouldUpdateDriverConfig? If this method returns true, we go don't skip the Uninstall. If it's false, we trigger the fast path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
cmd/driver-manager/main.go
Outdated
| defaultGracePeriod = 5 * time.Minute | ||
| driverRoot = "/run/nvidia/driver" | ||
| driverPIDFile = "/run/nvidia/nvidia-driver.pid" | ||
| driverConfigStateFile = "/run/nvidia/driver-config.state" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we rename this file to nvidia-driver.state / nvidia-driver.config to align with the nvidia-driver.pid file name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| configTemplate := `DRIVER_VERSION=%s | ||
| DRIVER_TYPE=%s | ||
| KERNEL_VERSION=%s | ||
| GPU_DIRECT_RDMA_ENABLED=%v | ||
| USE_HOST_MOFED=%v | ||
| KERNEL_MODULE_TYPE=%s | ||
| NVIDIA_MODULE_PARAMS=%s | ||
| NVIDIA_UVM_MODULE_PARAMS=%s | ||
| NVIDIA_MODESET_MODULE_PARAMS=%s | ||
| NVIDIA_PEERMEM_MODULE_PARAMS=%s | ||
| ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- Instead of duplicating this logic (of crafting the currently applied config file) in both the k8s-driver-manager and driver container, is there an alternative solution we should consider?
What if we offloaded this to the operator? When rendering the driver daemonset, the operator would craft the currently applied config, compute the digest of it, and pass this digest to the k8s-driver-manager and driver containers via an envvar, i.e. CURRENT_DRIVER_CONFIG=<....>. At runtime, the k8s-driver-manager and driver containers would compare the value of CURRENT_DRIVER_CONFIG with the digest stored in the /run/nvidia/driver/nvidia-driver.state file to determine whether the fast-path should be taken or not. After completing its install, the main driver container would write the current digest to /run/nvidia/driver/nvidia-driver.state.
The operator would compute the value of CURRENT_DRIVER_CONFIG by taking the digest of all the relevant envvars / settings + the resourceVersion of the kernel module parameters config map (if it exists).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a few difficulties in this approach.
-
The operator cannot know KERNEL_VERSION at render time, right? What if nodes run heterogeneous kernels that update independently.
-
The container must check
/sys/module/nvidia/refcntto verify actual kernel state, which I think that operator cannot do since it lacks access to each node's/sysfilesystem (fact-check me on this). Even with operator digests, containers still need config-building logic to write state files post-install, so we wouldn't see too much code-savings.
I think the duplication is worth having the logic live in the containers. Each container can independently validates its own node's actual state (kernel version via uname -r, module state via /sys/module/nvidia/refcnt) without depending on operator coordination.
Let me know if I misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container must check /sys/module/nvidia/refcnt to verify actual kernel state
Yes, the container's still need to check if the nvidia module is loaded or not, that will always be required. This is slightly orthogonal to the config / state file content though.
The operator cannot know KERNEL_VERSION at render time, right?
Right, this is the one hole in my proposal. Taking a step back, do we really need to include the KERNEL_VERSION in the state file? Will the nvidia modules ever still be loaded when the kernel version has been updated? If the KERNEL_VERSION changes, it means the node was rebooted and since the nvidia module does not persist a reboot, it won't be loaded into the kernel anymore -- which means the fast-path install will never be an option.
containers still need config-building logic to write state files post-install
Yes agreed. The driver container will still write to a config / state file post-install to capture the current state. My proposal not only aims to reduce code duplication, but also to reduce how tightly coupled this feature is with the relevant components. As currently implemented, if we ever add a new envvar that needs to be captured in the state file, we would always need to update the k8s-driver-manager code to ensure this envvar gets added to the state file it constructs. With my proposal, changes to k8s-driver-manager wouldn't be needed as long as the operator communicated the desired config through some envvar, like CURRENT_DRIVER_CONFIG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of decoupling the driver state management from the driver manager, but I am not sure if we should do away with persisting the KERNEL_VERSION in the state. How would we account for scenarios like Live kernel patching or similar mechanisms where the kernel maybe updated without node reboot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with live kernel patching, but from scanning https://docs.kernel.org/livepatch/livepatch.html I would not expect the actual kernel version string (as reported by uname) to change -- live patching allows you to patch certain functions in kernel code without regenerating the entire kernel image. SUSE has some docs that highlight this as well:
To determine the kernel patching status, use the klp -v patches command. The uname command's output does not change for patched kernels.
With that said, even if we determine we need to persist KERNEL_VERSION in the state file, I think my proposal could still work. The state file could be two lines:
<digest of current config>
<kernel version>
The k8s-driver-manager would build this file and compare it with the existing file at /run/nvidia/nvidia-driver.state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we offloaded this to the operator?
Thinking about this a bit more, I don't think the driver state calculation should be done in the operator layer as it would be a scope creep. I think this logic should reside in the components comprised by the driver daemonset. The state management is specific to the driver daemonset after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also prefer if we had a single complete representation of the state as opposed to the combination of the <current-config-digest> and <kernel-version>
1ca6732 to
23696d4
Compare
23696d4 to
2b7ad55
Compare
Signed-off-by: Karthik Vetrivel <[email protected]>
| return "" | ||
| } | ||
|
|
||
| return strings.Trim(string(utsname.Release[:]), " \r\n\t\u0000\uffff") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to a private method and call that method from getKernelVersion and readModuleParams method.
Let's name this method SanitizeString perhaps
This PR is a part of this endeavor:
Relevant PRs: