Wait for parent devices (VFs) before applying vGPU config #2002
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Wait for VFs in vGPU Manager Validation
Problem
When switching workloads from
vm-passthroughtovm-vgpu, the vgpu-device-manager can start before vgpu-manager finishes creating Virtual Functions (VFs), causing "no parent devices found" errors. This race condition is more pronounced on nodes with many GPUs (8+) where VF creation takes longer.Root Cause
The
vgpu-manager-readystatus file is created whennvidia-smiworks, but VF creation viasriov-manage -ehappens after that. The vgpu-device-manager needs VFs to exist before it can configure vGPU types on them.The
sriov-managescript enables VFs by writingsriov_totalvfstosriov_numvfsfor each GPU sequentially. On systems with 8 GPUs (160 total VFs), this can take 30+ seconds to complete.Solution
Add
waitForVFs()toVGPUManager.validate()in the nvidia-validator. After thenvidia-smicheck passes, poll SR-IOV sysfs via thenvpcilibrary until all VFs are created:SriovInfo.IsPF())sum(NumVFs) == sum(TotalVFs)across all GPUsTesting
Verified on DGX A100 (8 GPUs, 160 VFs):
Tested by building a standalone binary that calls waitForVFs() directly, then enabling VFs via
sriov_numvfswhile the test was running to simulate the vgpu-manager'ssriov-manage -ebehavior.VFs were enabled by binding each GPU to
pci-pf-stuband writing tosriov_numvfs.