fix: always reschedule GPU operands after vGPU config#164
Draft
karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
Draft
fix: always reschedule GPU operands after vGPU config#164karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: Karthik Vetrivel <kvvetrivel@gmail.com> Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
See #162 for context.
TL;DR:
Before applying a vGPU configuration,
updateConfig()shuts down GPU Operator components (sandbox-device-plugin, sandbox-validator) by setting their node labels topaused-for-vgpu-change. Previously,rescheduleGPUOperands()was onlycalled at the end of
updateConfig()on the success path. IfhandleMIGConfiguration()orapplyConfig()failed, the function returned early without rescheduling, leaving the labels in thepaused-for-vgpu-changestate permanently.This caused a deadlock: the sandbox-validator never starts, so it never creates the
vgpu-manager-readyvalidation file, so the vgpu-device-manager pod's init container blocks forever — even after the root cause is resolved.Fix
Moved
rescheduleGPUOperands()from an explicit call at the end ofupdateConfig()into adeferimmediately aftershutdownGPUOperands()succeeds. This guarantees operands are rescheduled on all exit paths (success, MIG configfailure, vGPU apply failure).
This follows the same pattern used by the k8s-mig-manager in its
setState()method, which reschedules GPU Operator components even whenstate == failed.Test plan
nvidia.com/vgpu.config=A2-1Qon a non-A2 GPU node to trigger a config failurevGPU type A2-1Q is not supported on GPUdefer(log:Restarting all GPU operands previously shutdown in Kubernetes)nvidia.com/vgpu.config.state=failedset correctlysandbox-device-plugin=true,sandbox-validator=trueRunningstatedefault— applied successfully