Skip to content

fix: always reschedule GPU operands after vGPU config#164

Draft
karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
karthikvetrivel:fix/reschedule-operands-on-failure
Draft

fix: always reschedule GPU operands after vGPU config#164
karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
karthikvetrivel:fix/reschedule-operands-on-failure

Conversation

@karthikvetrivel
Copy link
Copy Markdown
Member

Problem

See #162 for context.

TL;DR:

Before applying a vGPU configuration, updateConfig() shuts down GPU Operator components (sandbox-device-plugin, sandbox-validator) by setting their node labels to paused-for-vgpu-change. Previously, rescheduleGPUOperands() was only
called at the end of updateConfig() on the success path. If handleMIGConfiguration() or applyConfig() failed, the function returned early without rescheduling, leaving the labels in the paused-for-vgpu-change state permanently.

This caused a deadlock: the sandbox-validator never starts, so it never creates the vgpu-manager-ready validation file, so the vgpu-device-manager pod's init container blocks forever — even after the root cause is resolved.

Fix

Moved rescheduleGPUOperands() from an explicit call at the end of updateConfig() into a defer immediately after shutdownGPUOperands() succeeds. This guarantees operands are rescheduled on all exit paths (success, MIG config
failure, vGPU apply failure).

This follows the same pattern used by the k8s-mig-manager in its setState() method, which reschedules GPU Operator components even when state == failed.

Test plan

  • Set nvidia.com/vgpu.config=A2-1Q on a non-A2 GPU node to trigger a config failure
  • Verified vgpu-device-manager shut down sandbox-device-plugin and sandbox-validator
  • Verified config failed: vGPU type A2-1Q is not supported on GPU
  • Verified operands rescheduled via defer (log: Restarting all GPU operands previously shutdown in Kubernetes)
  • Verified nvidia.com/vgpu.config.state=failed set correctly
  • Verified node labels restored: sandbox-device-plugin=true, sandbox-validator=true
  • Verified both sandbox pods returned to Running state
  • Restored config to default — applied successfully

Signed-off-by: Karthik Vetrivel <kvvetrivel@gmail.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel marked this pull request as draft March 4, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant