Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Dec 17, 2025

Wait for VFs in vGPU Manager Validation

Problem

When switching workloads from vm-passthrough to vm-vgpu, the vgpu-device-manager can start before vgpu-manager finishes creating Virtual Functions (VFs), causing "no parent devices found" errors. This race condition is more pronounced on nodes with many GPUs (8+) where VF creation takes longer.

Root Cause

The vgpu-manager-ready status file is created when nvidia-smi works, but VF creation via sriov-manage -e happens after that. The vgpu-device-manager needs VFs to exist before it can configure vGPU types on them.

The sriov-manage script enables VFs by writing sriov_totalvfs to sriov_numvfs for each GPU sequentially. On systems with 8 GPUs (160 total VFs), this can take 30+ seconds to complete.

Solution

Add waitForVFs() to VGPUManager.validate() in the nvidia-validator. After the nvidia-smi check passes, poll SR-IOV sysfs via the nvpci library until all VFs are created:

  • Find all NVIDIA PCI devices that are SR-IOV Physical Functions (SriovInfo.IsPF())
  • Poll until sum(NumVFs) == sum(TotalVFs) across all GPUs

Testing

Verified on DGX A100 (8 GPUs, 160 VFs):

time="2025-12-18T00:32:43+02:00" level=info msg="Waiting for VFs: 0/160 enabled across 8 GPU(s)"
time="2025-12-18T00:32:45+02:00" level=info msg="Waiting for VFs: 0/160 enabled across 8 GPU(s)"
time="2025-12-18T00:32:47+02:00" level=info msg="Waiting for VFs: 0/160 enabled across 8 GPU(s)"
time="2025-12-18T00:32:52+02:00" level=info msg="All 160 VF(s) enabled on 8 NVIDIA GPU(s)"

Tested by building a standalone binary that calls waitForVFs() directly, then enabling VFs via sriov_numvfs while the test was running to simulate the vgpu-manager's sriov-manage -e behavior.

VFs were enabled by binding each GPU to pci-pf-stub and writing to sriov_numvfs.

@karthikvetrivel karthikvetrivel force-pushed the fix-vgpu-dm-wait-for-vfs branch 2 times, most recently from 705fc2b to bb23f84 Compare December 17, 2025 21:50
@karthikvetrivel karthikvetrivel force-pushed the fix-vgpu-dm-wait-for-vfs branch from bb23f84 to 8097fde Compare December 17, 2025 22:44
@karthikvetrivel karthikvetrivel marked this pull request as ready for review December 17, 2025 22:55
@karthikvetrivel karthikvetrivel merged commit 4011723 into NVIDIA:main Dec 18, 2025
15 of 16 checks passed
@cdesiniotis cdesiniotis added this to the v26.x milestone Jan 6, 2026
@cdesiniotis cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants