Skip vfio-pci unbind when GPUs already bound in VFIO mode #146

karthikvetrivel · 2026-01-07T17:09:46Z

Description

Prevents unnecessary GPU unbind/rebind operations during rolling updates of the vfio-manager DaemonSet. Currently, k8s-driver-manager unconditionally unbinds all GPUs from vfio-pci on startup, even when the desired state is already vfio-pci. This disrupts active VM workloads using GPU passthrough (KubeVirt, Kata Containers).

This fix checks the node's nvidia.com/gpu.workload.config label to determine if the node is in VFIO mode (vm-passthrough or vm-vgpu). If so, it verifies all GPUs are already bound to vfio-pci variants before proceeding with unbind. If they're already in the correct state, the unbind operation is skipped entirely.

Testing

Scenario 1: Rolling update with no state change

Node label: nvidia.com/gpu.workload.config=vm-passthrough
GPU already bound to vfio-pci

Result: Unbind was skipped, GPU remained bound through rolling updates

time=2026-01-07T18:03:45Z level=info msg=All 1 GPUs are on vfio-pci variants
time=2026-01-07T18:03:45Z level=info msg=All GPUs already bound to vfio-pci variants, skipping unbind

Scenario 2: State transition (nvidia → vfio)

Node label: nvidia.com/gpu.workload.config=vm-passthrough
GPU unbound (no driver) (exercises the same state transition as if we were passing from NVIDIA to vfio driver)

Result: Unbind proceeded as expected

time=2026-01-07T18:37:07Z level=info msg=GPU 0000:65:00.0 is bound to  (not vfio)
time=2026-01-07T18:37:07Z level=info msg=Unbinding vfio-pci driver from all devices

Signed-off-by: Karthik Vetrivel <[email protected]>

cdesiniotis · 2026-01-29T00:12:16Z

cmd/driver-manager/main.go

 }
+
+func (dm *DriverManager) isVFIOWorkloadConfig() bool {
+	workloadConfig, err := dm.kubeClient.GetNodeLabelValue(dm.config.nodeName, gpuWorkloadConfigLabelKey)


There is a hole in this implementation currently. It is possible for the vfio-manager pod to run on nodes that do not have the nvidia.com/gpu.workload.config=vm-passthrough label set. For example, if users have set sandboxWorkloads.defaultWorkload=vm-passthrough in ClusterPolicy, then vfio-manager will get deployed by default on GPU nodes if the nvidia.com/gpu.workload.config label is not present.

For this implementation to be complete, we would need to take the value of sandboxWorkloads.defaultWorkload into account, similar to what is done in the validator: https://github.com/NVIDIA/gpu-operator/blob/2104c0e1ff1893012c3d72f5c09f2c345bc4313c/cmd/nvidia-validator/main.go#L477

What are your thoughts on this alternative solution -- In the gpu-operator, when we are rendering the vfio-manager daemonset, we know it will only run for the vm-passthrough use case. Could we not just pass the "workload type" as an envvar in the init container? The k8s-driver-manager code would use the newly introduced envvar to determine what to do with GPUs that are already bound to the vfio-pci driver.

cdesiniotis · 2026-01-29T00:12:52Z

cmd/driver-manager/main.go

+		return false
+	}
+
+	return strings.HasPrefix(workloadConfig, "vm-")


We only want to return true when the workload config is vm-passthrough (not vm-vgpu)

Skip vfio-pci unbind when GPUs already bound in VFIO mode

750db0f

Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel marked this pull request as draft January 7, 2026 17:09

karthikvetrivel marked this pull request as ready for review January 7, 2026 20:56

karthikvetrivel requested review from cdesiniotis and tariq1890 January 7, 2026 20:56

cdesiniotis reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip vfio-pci unbind when GPUs already bound in VFIO mode #146

Skip vfio-pci unbind when GPUs already bound in VFIO mode #146

karthikvetrivel commented Jan 7, 2026 •

edited

Loading

Uh oh!

cdesiniotis Jan 29, 2026

Uh oh!

cdesiniotis Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Skip vfio-pci unbind when GPUs already bound in VFIO mode #146

Are you sure you want to change the base?

Skip vfio-pci unbind when GPUs already bound in VFIO mode #146

Conversation

karthikvetrivel commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Scenario 1: Rolling update with no state change

Scenario 2: State transition (nvidia → vfio)

Uh oh!

cdesiniotis Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karthikvetrivel commented Jan 7, 2026 •

edited

Loading