fix: always reschedule GPU operands after vGPU config by karthikvetrivel · Pull Request #164 · NVIDIA/vgpu-device-manager

karthikvetrivel · 2026-03-04T21:20:33Z

Problem

See #162 for context.

TL;DR:

Before applying a vGPU configuration, updateConfig() shuts down GPU Operator components (sandbox-device-plugin, sandbox-validator) by setting their node labels to paused-for-vgpu-change. Previously, rescheduleGPUOperands() was only
called at the end of updateConfig() on the success path. If handleMIGConfiguration() or applyConfig() failed, the function returned early without rescheduling, leaving the labels in the paused-for-vgpu-change state permanently.

This caused a deadlock: the sandbox-validator never starts, so it never creates the vgpu-manager-ready validation file, so the vgpu-device-manager pod's init container blocks forever — even after the root cause is resolved.

Fix

Moved rescheduleGPUOperands() from an explicit call at the end of updateConfig() into a defer immediately after shutdownGPUOperands() succeeds. This guarantees operands are rescheduled on all exit paths (success, MIG config
failure, vGPU apply failure).

This follows the same pattern used by the k8s-mig-manager in its setState() method, which reschedules GPU Operator components even when state == failed.

Test plan

Set nvidia.com/vgpu.config=A2-1Q on a non-A2 GPU node to trigger a config failure
Verified vgpu-device-manager shut down sandbox-device-plugin and sandbox-validator
Verified config failed: vGPU type A2-1Q is not supported on GPU
Verified operands rescheduled via defer (log: Restarting all GPU operands previously shutdown in Kubernetes)
Verified nvidia.com/vgpu.config.state=failed set correctly
Verified node labels restored: sandbox-device-plugin=true, sandbox-validator=true
Verified both sandbox pods returned to Running state
Restored config to default — applied successfully

Signed-off-by: Karthik Vetrivel <kvvetrivel@gmail.com> Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

fix: always reschedule GPU operands after vGPU config

c474036

Signed-off-by: Karthik Vetrivel <kvvetrivel@gmail.com> Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

karthikvetrivel marked this pull request as draft March 4, 2026 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: always reschedule GPU operands after vGPU config#164

fix: always reschedule GPU operands after vGPU config#164
karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
karthikvetrivel:fix/reschedule-operands-on-failure

karthikvetrivel commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

karthikvetrivel commented Mar 4, 2026

Problem

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant