Add shouldSkipUninstall to avoid GPU driver teardown on restart #103

karthikvetrivel · 2025-10-02T15:48:15Z

This PR is a part of this endeavor:

GPU Driver container should avoid re-installing drivers on spurious container restarts

Relevant PRs:

Check if NVIDIA kernel modules are loaded to avoid modprobe gpu-operator#1746
Add fast-track to skip uninstall/install if NVIDIA driver modules present gpu-driver-container#454

cmd/driver-manager/main.go

karthikvetrivel · 2025-10-30T00:20:07Z

@cdesiniotis I've moved shouldSkipUninstall so that the operands still release /run/nvidia/driver mounts.

cmd/driver-manager/main.go

Signed-off-by: Karthik Vetrivel <[email protected]>

…sure mount refresh Signed-off-by: Karthik Vetrivel <[email protected]>

cmd/driver-manager/main.go

… trigger driver reinstall Signed-off-by: Karthik Vetrivel <[email protected]>

cmd/driver-manager/main.go

cdesiniotis · 2025-12-12T01:09:31Z

cmd/driver-manager/main.go

+	nvidiaModuleConfigFile        = "/drivers/nvidia.conf"
+	nvidiaUVMModuleConfigFile     = "/drivers/nvidia-uvm.conf"
+	nvidiaModsetModuleConfigFile  = "/drivers/nvidia-modeset.conf"
+	nvidiaPeermemModuleConfigFile = "/drivers/nvidia-peermem.conf"


Question -- I don't see an extra volume mount introduced in NVIDIA/gpu-operator#1746 for making sure the k8s-driver-manager init container has access to these files. Am I missing something? Since we already have access to /run/nvidia/driver does it make sense to get these files from there?

Great catch, Chris! I naively assumed that the init container had this volume mount.

Updated the paths to read from /run/nvidia/driver/drivers/*.conf instead. One thing to note is that these files are only accessible there after the driver has run at least once (as opposed to if we mount the ConfigMap as a volume in the init container). However, that should be fine because readModuleParams() handles missing files and driverModuleBuildNeeded should return true because no previous driver config should exist.

One thing to note is that these files are only accessible there after the driver has run at least once

I believe you are right, but is this guaranteed to be the case? I guess these files will still exist (and be accessible to the init container) as long as the prior driver container did not clean up its /run/nvidia/driver bind mount (which should be the case for the "non-clean restart" scenario).

I think it is guaranteed to be the case. The files are only accessible when:

A prior driver container ran _mount_rootfs()

AND did not run _unmount_rootfs() (i.e. non clean, as you mentioned)

This satisfies the intended behavior for non-clean restarts. In the configuration update scenario, the old driver should call _shutdown, meaning the rootfs is unloaded and we do a complete reinstall. It doesn't matter in this case if we can access the bind mount or not because the driver config state file shouldn't exist.

Yeah, taking a look at our shutdown() function, the bind mount will only get cleaned up if we were able to unload the module successfully. So I think we are good. Resolving this thread.

Reviving this one... what happens in the following scenario?

Driver daemonset is deployed without any custom kernel module params (as in, no /run/nvidia/driver/drivers/*.conf files exist).

The user creates a custom kernel module params configmap, and configures the operator to use it.

The operator renders a new version of the driver daemonset. The only change is that now the kernel module params configmap is mounted into the main container.

In this case, we should not take the fast-path (userspace only) install track, since we need to re-load the modules with the custom parameters. But as currently implemented, I believe the fast-path install will take place...

You're correct - the init container reads module params from /run/nvidia/driver/drivers/*.conf, which doesn't exist yet when the init container runs, so adding a ConfigMap would result in both stored and current configs reading as empty strings. This will trigger fast-track unfortunately.

Two possible fixes: (1) use the operator-computed digest approach that includes ConfigMap resourceVersion in the hash, or (2) mount the kernel module ConfigMap into the init container as well (at /drivers) so it can read the actual parameter files before the main container starts.

cmd/driver-manager/main.go

tariq1890 · 2026-01-07T18:24:24Z

cmd/driver-manager/main.go

+}
+
+// isDesiredDriverLoaded checks if the driver is loaded and the configuration matches
+func (dm *DriverManager) isDesiredDriverLoaded() bool {


Can we rename the method to shouldUpdateDriverConfig? If this method returns true, we go don't skip the Uninstall. If it's false, we trigger the fast path

cdesiniotis · 2026-01-07T19:09:34Z

cmd/driver-manager/main.go

-	defaultGracePeriod  = 5 * time.Minute
+	driverRoot            = "/run/nvidia/driver"
+	driverPIDFile         = "/run/nvidia/nvidia-driver.pid"
+	driverConfigStateFile = "/run/nvidia/driver-config.state"


nit: should we rename this file to nvidia-driver.state / nvidia-driver.config to align with the nvidia-driver.pid file name?

cdesiniotis · 2026-01-07T19:30:07Z

cmd/driver-manager/main.go

+	configTemplate := `DRIVER_VERSION=%s
+DRIVER_TYPE=%s
+KERNEL_VERSION=%s
+GPU_DIRECT_RDMA_ENABLED=%v
+USE_HOST_MOFED=%v
+KERNEL_MODULE_TYPE=%s
+NVIDIA_MODULE_PARAMS=%s
+NVIDIA_UVM_MODULE_PARAMS=%s
+NVIDIA_MODESET_MODULE_PARAMS=%s
+NVIDIA_PEERMEM_MODULE_PARAMS=%s
+`


Question -- Instead of duplicating this logic (of crafting the currently applied config file) in both the k8s-driver-manager and driver container, is there an alternative solution we should consider?

What if we offloaded this to the operator? When rendering the driver daemonset, the operator would craft the currently applied config, compute the digest of it, and pass this digest to the k8s-driver-manager and driver containers via an envvar, i.e. CURRENT_DRIVER_CONFIG=<....>. At runtime, the k8s-driver-manager and driver containers would compare the value of CURRENT_DRIVER_CONFIG with the digest stored in the /run/nvidia/driver/nvidia-driver.state file to determine whether the fast-path should be taken or not. After completing its install, the main driver container would write the current digest to /run/nvidia/driver/nvidia-driver.state.

The operator would compute the value of CURRENT_DRIVER_CONFIG by taking the digest of all the relevant envvars / settings + the resourceVersion of the kernel module parameters config map (if it exists).

There might be a few difficulties in this approach.

The operator cannot know KERNEL_VERSION at render time, right? What if nodes run heterogeneous kernels that update independently.

The container must check /sys/module/nvidia/refcnt to verify actual kernel state, which I think that operator cannot do since it lacks access to each node's /sys filesystem (fact-check me on this). Even with operator digests, containers still need config-building logic to write state files post-install, so we wouldn't see too much code-savings.

I think the duplication is worth having the logic live in the containers. Each container can independently validates its own node's actual state (kernel version via uname -r, module state via /sys/module/nvidia/refcnt) without depending on operator coordination.

Let me know if I misunderstood.

The container must check /sys/module/nvidia/refcnt to verify actual kernel state

Yes, the container's still need to check if the nvidia module is loaded or not, that will always be required. This is slightly orthogonal to the config / state file content though.

The operator cannot know KERNEL_VERSION at render time, right?

Right, this is the one hole in my proposal. Taking a step back, do we really need to include the KERNEL_VERSION in the state file? Will the nvidia modules ever still be loaded when the kernel version has been updated? If the KERNEL_VERSION changes, it means the node was rebooted and since the nvidia module does not persist a reboot, it won't be loaded into the kernel anymore -- which means the fast-path install will never be an option.

containers still need config-building logic to write state files post-install

Yes agreed. The driver container will still write to a config / state file post-install to capture the current state. My proposal not only aims to reduce code duplication, but also to reduce how tightly coupled this feature is with the relevant components. As currently implemented, if we ever add a new envvar that needs to be captured in the state file, we would always need to update the k8s-driver-manager code to ensure this envvar gets added to the state file it constructs. With my proposal, changes to k8s-driver-manager wouldn't be needed as long as the operator communicated the desired config through some envvar, like CURRENT_DRIVER_CONFIG.

I like the idea of decoupling the driver state management from the driver manager, but I am not sure if we should do away with persisting the KERNEL_VERSION in the state. How would we account for scenarios like Live kernel patching or similar mechanisms where the kernel maybe updated without node reboot?

I am not very familiar with live kernel patching, but from scanning https://docs.kernel.org/livepatch/livepatch.html I would not expect the actual kernel version string (as reported by uname) to change -- live patching allows you to patch certain functions in kernel code without regenerating the entire kernel image. SUSE has some docs that highlight this as well:

To determine the kernel patching status, use the klp -v patches command. The uname command's output does not change for patched kernels.

With that said, even if we determine we need to persist KERNEL_VERSION in the state file, I think my proposal could still work. The state file could be two lines:

<digest of current config> <kernel version>

The k8s-driver-manager would build this file and compare it with the existing file at /run/nvidia/nvidia-driver.state.

What if we offloaded this to the operator?

Thinking about this a bit more, I don't think the driver state calculation should be done in the operator layer as it would be a scope creep. I think this logic should reside in the components comprised by the driver daemonset. The state management is specific to the driver daemonset after all.

I would also prefer if we had a single complete representation of the state as opposed to the combination of the <current-config-digest> and <kernel-version>

cmd/driver-manager/main.go

Signed-off-by: Karthik Vetrivel <[email protected]>

tariq1890 · 2026-01-07T22:57:46Z

cmd/driver-manager/main.go

+		return ""
+	}
+
+	return strings.Trim(string(utsname.Release[:]), " \r\n\t\u0000\uffff")


Let's move this to a private method and call that method from getKernelVersion and readModuleParams method.

Let's name this method SanitizeString perhaps

karthikvetrivel marked this pull request as draft October 2, 2025 15:48

This was referenced Oct 2, 2025

Add fast-track to skip uninstall/install if NVIDIA driver modules present NVIDIA/gpu-driver-container#454

Closed

Check if NVIDIA kernel modules are loaded to avoid modprobe NVIDIA/gpu-operator#1746

Open

cdesiniotis reviewed Oct 7, 2025

View reviewed changes

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 68adf6a to 1991b8c Compare October 16, 2025 15:11

karthikvetrivel marked this pull request as ready for review October 16, 2025 15:19

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 900f54b to 1991b8c Compare October 20, 2025 20:53

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 4aa4395 to 5fba425 Compare November 6, 2025 14:09

karthikvetrivel marked this pull request as draft November 6, 2025 14:16

shivamerla reviewed Nov 7, 2025

View reviewed changes

cmd/driver-manager/main.go Show resolved Hide resolved

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 2 times, most recently from ff04036 to b90a2d6 Compare November 20, 2025 19:25

karthikvetrivel requested a review from cdesiniotis November 20, 2025 19:32

shivamerla reviewed Nov 20, 2025

View reviewed changes

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

shivamerla reviewed Nov 20, 2025

View reviewed changes

cmd/driver-manager/main.go Show resolved Hide resolved

shivamerla reviewed Nov 20, 2025

View reviewed changes

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from b90a2d6 to b0b38ba Compare November 21, 2025 19:03

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from b0b38ba to b4afc5b Compare December 1, 2025 15:21

karthikvetrivel added 2 commits December 5, 2025 00:53

Add shouldSkipUninstall to avoid GPU driver teardown on restart

2cabf99

Signed-off-by: Karthik Vetrivel <[email protected]>

refactor: move shouldSkipUninstall check after operand eviction to en…

a540c4f

…sure mount refresh Signed-off-by: Karthik Vetrivel <[email protected]>

tariq1890 reviewed Dec 8, 2025

View reviewed changes

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

karthikvetrivel marked this pull request as ready for review December 8, 2025 20:57

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 2 times, most recently from 5d21d39 to 84ce300 Compare December 8, 2025 21:42

feat: detect config changes (version, kernel, module params, RDMA) to…

9ad1573

… trigger driver reinstall Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 84ce300 to 9ad1573 Compare December 8, 2025 21:45

tariq1890 reviewed Dec 8, 2025

View reviewed changes

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

cdesiniotis reviewed Dec 12, 2025

View reviewed changes

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from b91efd7 to 554586a Compare December 12, 2025 01:45

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch 2 times, most recently from 1bfb7e8 to 9fe353f Compare December 12, 2025 01:48

cdesiniotis reviewed Dec 12, 2025

View reviewed changes

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 9fe353f to 8eb03da Compare December 12, 2025 02:09

tariq1890 reviewed Dec 12, 2025

View reviewed changes

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 8eb03da to 1ca6732 Compare December 17, 2025 18:41

karthikvetrivel requested review from cdesiniotis and tariq1890 December 17, 2025 18:42

tariq1890 reviewed Jan 7, 2026

View reviewed changes

cdesiniotis reviewed Jan 7, 2026

View reviewed changes

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 1ca6732 to 23696d4 Compare January 7, 2026 20:08

tariq1890 reviewed Jan 7, 2026

View reviewed changes

cmd/driver-manager/main.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 23696d4 to 2b7ad55 Compare January 7, 2026 22:45

Remove unneeded fallback in config comparison build

2b7ad55

Signed-off-by: Karthik Vetrivel <[email protected]>

tariq1890 reviewed Jan 7, 2026

View reviewed changes

karthikvetrivel closed this Jan 23, 2026

Add shouldSkipUninstall to avoid GPU driver teardown on restart #103

Add shouldSkipUninstall to avoid GPU driver teardown on restart #103

Uh oh!

Conversation

karthikvetrivel commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karthikvetrivel commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tariq1890 Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

karthikvetrivel commented Oct 2, 2025 •

edited

Loading

karthikvetrivel Dec 12, 2025 •

edited

Loading

cdesiniotis Jan 7, 2026 •

edited

Loading

tariq1890 Jan 7, 2026 •

edited

Loading