Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Oct 2, 2025

This PR is a part of this endeavor:

GPU Driver container should avoid re-installing drivers on spurious container restarts

Relevant PRs:

@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 68adf6a to 1991b8c Compare October 16, 2025 15:11
@karthikvetrivel karthikvetrivel marked this pull request as ready for review October 16, 2025 15:19
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 900f54b to 1991b8c Compare October 20, 2025 20:53
@karthikvetrivel
Copy link
Member Author

@cdesiniotis I've moved shouldSkipUninstall so that the operands still release /run/nvidia/driver mounts.

@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 4aa4395 to 5fba425 Compare November 6, 2025 14:09
@karthikvetrivel karthikvetrivel marked this pull request as draft November 6, 2025 14:16
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 2 times, most recently from ff04036 to b90a2d6 Compare November 20, 2025 19:25
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from b90a2d6 to b0b38ba Compare November 21, 2025 19:03
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from b0b38ba to b4afc5b Compare December 1, 2025 15:21
@karthikvetrivel karthikvetrivel marked this pull request as ready for review December 8, 2025 20:57
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 2 times, most recently from 5d21d39 to 84ce300 Compare December 8, 2025 21:42
… trigger driver reinstall

Signed-off-by: Karthik Vetrivel <[email protected]>
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 84ce300 to 9ad1573 Compare December 8, 2025 21:45
Comment on lines 54 to 57
nvidiaModuleConfigFile = "/drivers/nvidia.conf"
nvidiaUVMModuleConfigFile = "/drivers/nvidia-uvm.conf"
nvidiaModsetModuleConfigFile = "/drivers/nvidia-modeset.conf"
nvidiaPeermemModuleConfigFile = "/drivers/nvidia-peermem.conf"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- I don't see an extra volume mount introduced in NVIDIA/gpu-operator#1746 for making sure the k8s-driver-manager init container has access to these files. Am I missing something? Since we already have access to /run/nvidia/driver does it make sense to get these files from there?

Copy link
Member Author

@karthikvetrivel karthikvetrivel Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, Chris! I naively assumed that the init container had this volume mount.

Updated the paths to read from /run/nvidia/driver/drivers/*.conf instead. One thing to note is that these files are only accessible there after the driver has run at least once (as opposed to if we mount the ConfigMap as a volume in the init container). However, that should be fine because readModuleParams() handles missing files and driverModuleBuildNeeded should return true because no previous driver config should exist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that these files are only accessible there after the driver has run at least once

I believe you are right, but is this guaranteed to be the case? I guess these files will still exist (and be accessible to the init container) as long as the prior driver container did not clean up its /run/nvidia/driver bind mount (which should be the case for the "non-clean restart" scenario).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is guaranteed to be the case. The files are only accessible when:

  1. A prior driver container ran _mount_rootfs()
  2. AND did not run _unmount_rootfs() (i.e. non clean, as you mentioned)

This satisfies the intended behavior for non-clean restarts. In the configuration update scenario, the old driver should call _shutdown, meaning the rootfs is unloaded and we do a complete reinstall. It doesn't matter in this case if we can access the bind mount or not because the driver config state file shouldn't exist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, taking a look at our shutdown() function, the bind mount will only get cleaned up if we were able to unload the module successfully. So I think we are good. Resolving this thread.

Copy link
Collaborator

@cdesiniotis cdesiniotis Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviving this one... what happens in the following scenario?

  1. Driver daemonset is deployed without any custom kernel module params (as in, no /run/nvidia/driver/drivers/*.conf files exist).
  2. The user creates a custom kernel module params configmap, and configures the operator to use it.
  3. The operator renders a new version of the driver daemonset. The only change is that now the kernel module params configmap is mounted into the main container.

In this case, we should not take the fast-path (userspace only) install track, since we need to re-load the modules with the custom parameters. But as currently implemented, I believe the fast-path install will take place...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct - the init container reads module params from /run/nvidia/driver/drivers/*.conf, which doesn't exist yet when the init container runs, so adding a ConfigMap would result in both stored and current configs reading as empty strings. This will trigger fast-track unfortunately.

Two possible fixes: (1) use the operator-computed digest approach that includes ConfigMap resourceVersion in the hash, or (2) mount the kernel module ConfigMap into the init container as well (at /drivers) so it can read the actual parameter files before the main container starts.

@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from b91efd7 to 554586a Compare December 12, 2025 01:45
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 2 times, most recently from 1bfb7e8 to 9fe353f Compare December 12, 2025 01:48
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 9fe353f to 8eb03da Compare December 12, 2025 02:09
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 8eb03da to 1ca6732 Compare December 17, 2025 18:41
}

// isDesiredDriverLoaded checks if the driver is loaded and the configuration matches
func (dm *DriverManager) isDesiredDriverLoaded() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the method to shouldUpdateDriverConfig? If this method returns true, we go don't skip the Uninstall. If it's false, we trigger the fast path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

defaultGracePeriod = 5 * time.Minute
driverRoot = "/run/nvidia/driver"
driverPIDFile = "/run/nvidia/nvidia-driver.pid"
driverConfigStateFile = "/run/nvidia/driver-config.state"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we rename this file to nvidia-driver.state / nvidia-driver.config to align with the nvidia-driver.pid file name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines +702 to +709
configTemplate := `DRIVER_VERSION=%s
DRIVER_TYPE=%s
KERNEL_VERSION=%s
GPU_DIRECT_RDMA_ENABLED=%v
USE_HOST_MOFED=%v
KERNEL_MODULE_TYPE=%s
NVIDIA_MODULE_PARAMS=%s
NVIDIA_UVM_MODULE_PARAMS=%s
NVIDIA_MODESET_MODULE_PARAMS=%s
NVIDIA_PEERMEM_MODULE_PARAMS=%s
`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- Instead of duplicating this logic (of crafting the currently applied config file) in both the k8s-driver-manager and driver container, is there an alternative solution we should consider?

What if we offloaded this to the operator? When rendering the driver daemonset, the operator would craft the currently applied config, compute the digest of it, and pass this digest to the k8s-driver-manager and driver containers via an envvar, i.e. CURRENT_DRIVER_CONFIG=<....>. At runtime, the k8s-driver-manager and driver containers would compare the value of CURRENT_DRIVER_CONFIG with the digest stored in the /run/nvidia/driver/nvidia-driver.state file to determine whether the fast-path should be taken or not. After completing its install, the main driver container would write the current digest to /run/nvidia/driver/nvidia-driver.state.

The operator would compute the value of CURRENT_DRIVER_CONFIG by taking the digest of all the relevant envvars / settings + the resourceVersion of the kernel module parameters config map (if it exists).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a few difficulties in this approach.

  1. The operator cannot know KERNEL_VERSION at render time, right? What if nodes run heterogeneous kernels that update independently.

  2. The container must check /sys/module/nvidia/refcnt to verify actual kernel state, which I think that operator cannot do since it lacks access to each node's /sys filesystem (fact-check me on this). Even with operator digests, containers still need config-building logic to write state files post-install, so we wouldn't see too much code-savings.

I think the duplication is worth having the logic live in the containers. Each container can independently validates its own node's actual state (kernel version via uname -r, module state via /sys/module/nvidia/refcnt) without depending on operator coordination.

Let me know if I misunderstood.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container must check /sys/module/nvidia/refcnt to verify actual kernel state

Yes, the container's still need to check if the nvidia module is loaded or not, that will always be required. This is slightly orthogonal to the config / state file content though.

The operator cannot know KERNEL_VERSION at render time, right?

Right, this is the one hole in my proposal. Taking a step back, do we really need to include the KERNEL_VERSION in the state file? Will the nvidia modules ever still be loaded when the kernel version has been updated? If the KERNEL_VERSION changes, it means the node was rebooted and since the nvidia module does not persist a reboot, it won't be loaded into the kernel anymore -- which means the fast-path install will never be an option.

containers still need config-building logic to write state files post-install

Yes agreed. The driver container will still write to a config / state file post-install to capture the current state. My proposal not only aims to reduce code duplication, but also to reduce how tightly coupled this feature is with the relevant components. As currently implemented, if we ever add a new envvar that needs to be captured in the state file, we would always need to update the k8s-driver-manager code to ensure this envvar gets added to the state file it constructs. With my proposal, changes to k8s-driver-manager wouldn't be needed as long as the operator communicated the desired config through some envvar, like CURRENT_DRIVER_CONFIG.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of decoupling the driver state management from the driver manager, but I am not sure if we should do away with persisting the KERNEL_VERSION in the state. How would we account for scenarios like Live kernel patching or similar mechanisms where the kernel maybe updated without node reboot?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with live kernel patching, but from scanning https://docs.kernel.org/livepatch/livepatch.html I would not expect the actual kernel version string (as reported by uname) to change -- live patching allows you to patch certain functions in kernel code without regenerating the entire kernel image. SUSE has some docs that highlight this as well:

To determine the kernel patching status, use the klp -v patches command. The uname command's output does not change for patched kernels.

With that said, even if we determine we need to persist KERNEL_VERSION in the state file, I think my proposal could still work. The state file could be two lines:

<digest of current config>
<kernel version>

The k8s-driver-manager would build this file and compare it with the existing file at /run/nvidia/nvidia-driver.state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we offloaded this to the operator?

Thinking about this a bit more, I don't think the driver state calculation should be done in the operator layer as it would be a scope creep. I think this logic should reside in the components comprised by the driver daemonset. The state management is specific to the driver daemonset after all.

Copy link
Contributor

@tariq1890 tariq1890 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer if we had a single complete representation of the state as opposed to the combination of the <current-config-digest> and <kernel-version>

@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 1ca6732 to 23696d4 Compare January 7, 2026 20:08
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 23696d4 to 2b7ad55 Compare January 7, 2026 22:45
return ""
}

return strings.Trim(string(utsname.Release[:]), " \r\n\t\u0000\uffff")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to a private method and call that method from getKernelVersion and readModuleParams method.

Let's name this method SanitizeString perhaps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants