Skip to content

Create VGPU changes with VFIO Framework#139

Open
JunAr7112 wants to merge 12 commits intoNVIDIA:mainfrom
JunAr7112:vfio_changes
Open

Create VGPU changes with VFIO Framework#139
JunAr7112 wants to merge 12 commits intoNVIDIA:mainfrom
JunAr7112:vfio_changes

Conversation

@JunAr7112
Copy link
Copy Markdown
Contributor

No description provided.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Nov 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Arjun <agadiyar@nvidia.com>
Signed-off-by: Arjun <agadiyar@nvidia.com>
Copy link
Copy Markdown
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JunAr7112, this is a good start. As we make iterations on this and get more familiar with the internals here, it may be valuable to create a new internal/vgpu package that hides away the vfio vs mdev framework complexity. We need to think through what the right interface would be, but I imagine we will need methods for 1) getting all vGPU devices, 2) getting all parent devices (of which you can create a vGPU device on top of), 3) creating a vGPU device. The pkg/vgpu/config.go file, which is concerned with getting / setting a particular vGPU config, can invoke these methods without having to know what vfio / mdev is.

@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 5 times, most recently from 8846795 to 5daf473 Compare November 24, 2025 16:54
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 6 times, most recently from 15a9586 to b1fd32d Compare December 4, 2025 17:17
Signed-off-by: Arjun <agadiyar@nvidia.com>
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 2 times, most recently from ae21d52 to 50e2185 Compare December 5, 2025 22:43
Signed-off-by: Arjun <agadiyar@nvidia.com>
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 2 times, most recently from 6bc8da3 to d134551 Compare January 15, 2026 23:32
if ret != nvml.SUCCESS {
continue
}
vgpuConfig[typeName]++
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- does the name reported by NVML, e.g. vgpuTypeId.GetName(), align exactly with the name we were using before? (the names stored in vgpuDev.MDEVType)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a helper function to ensure they exactly align

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the name reported by NVML contains the product prefix? For example, is NVML returning NVIDIA A100-4C as the type name for the A100-4C device?

Copy link
Copy Markdown
Contributor Author

@JunAr7112 JunAr7112 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I checked this manually earlier and added the parseVGPUTypeName(rawName string) to verify that we would only be getting A100-4C. The name reported by NVML included a prefix.

vfnum := 0
numVF := int(device.SriovInfo.PhysicalFunction.NumVFs)
for vfnum < numVF {
vfAddr := filepath.Join(HostPCIDevicesRoot, device.Address, "virtfn"+strconv.Itoa(vfnum), "nvidia")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the nvidia directory should be included in this path. The "path to the VF" is simply /sys/bus/pci/devices/<BDF>/virtfn<N>. Other parts of the code are not intuitive to me because vfAddr includes the nvidia directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local-agadiyar@ipp1-2284:/sys/bus/pci/devices/0000:41:00.0/virtfn0/nvidia$ cat current_vgpu_type
687
local-agadiyar@ipp1-2284:/sys/bus/pci/devices/0000:41:00.0/virtfn0/nvidia$ cat creatable_vgpu_types
ID : vGPU Name

The current_vgpu_type and creatable_vgpu_types files are located in the nvidia folder. This way we don't need to append nvidia onto another address variable

return nil, fmt.Errorf("virtual function %d at address %s does not exist", vfnum, vfAddr)
}
parentDevices = append(parentDevices, &ParentDevice{
NvidiaPCIDevice: device,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Shouldn't NvidiaPCIDevice represent the VF (IIUC device is currently a PF)? If so, then I don't see the need to have VirtualFunctionaPath as a separate field. If we instead used the VF here, I think that would simplify the code in a few places and make this easier to read. Note, the nvpci.NvidiaPCIDevice type allows you to go from the VF to the backing PF via device.SriovInfo.VirtualFunction.PhysicalFunction.

Copy link
Copy Markdown
Contributor Author

@JunAr7112 JunAr7112 Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is that nvpci doesn't have a built in way to get virtual functions from the physical function. That is why I am storing the physical device ( via nvdevices, err := m.nvlib.Nvpci.GetGPUs() ) as well as the path to the virtual function.

}
devices := []*Device{}
for _, parentDevice := range parentDevices {
vgpuTypeNumberBytes, err := os.ReadFile(filepath.Join(parentDevice.VirtualFunctionPath, "current_vgpu_type"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated in a prior comment, if parentDevice was just of type nvpci.NvidiaPCIDevice (and represented the VF), this code would be replaced by:

vgpuTypeNumberBytes, err := os.ReadFile(filepath.Join(parentDevice.NvidiaPCIDevice.Path, "current_vgpu_type"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.

// ParentDevice represents an NVIDIA parent PCI device.
type ParentDevice struct {
*nvpci.NvidiaPCIDevice
VirtualFunctionPath string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this field needed? Shouldn't NvidiaPCIDevice.Path represent the path to the virtual function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment. nvpci.NvidiaPCIDevice is storing the physical function

@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 4 times, most recently from 3754bb7 to cfffa9f Compare January 20, 2026 18:51
}

type nvlibVGPUConfigManager struct {
nvlib nvlib.Interface
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- does nvlib.Interface need to exist anymore?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we are no longer using the nvlib.Interface. I don't think we need it for any of the other projects either

if ret != nvml.SUCCESS {
continue
}
vgpuConfig[typeName]++
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the name reported by NVML contains the product prefix? For example, is NVML returning NVIDIA A100-4C as the type name for the A100-4C device?

Comment on lines +131 to +148
found := false
for _, vgpuTypeId := range supportedVGPUs {
rawName, ret := vgpuTypeId.GetName()
if ret != nvml.SUCCESS {
continue
}
name := parseVGPUTypeName(rawName)
if name == key {
found = true
sanitizedConfig[key] = val
break
}
if name == strippedKey {
found = true
sanitizedConfig[strippedKey] = val
break
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve readability, what if we constructed a map, named supportedVgpuTypes, prior as such:

supportedVgpuTypes := map[string]bool{}
for _, vgpu := range supportedVGPUs {
  name, ret := vgpu.GetName()
  if ret != nvml.SUCCESS {
    continue
  }
  name = parseVGPUTypeName(name)
  supportedVgpuNames[name] = true
}

Then this for loop would simplify to

Suggested change
found := false
for _, vgpuTypeId := range supportedVGPUs {
rawName, ret := vgpuTypeId.GetName()
if ret != nvml.SUCCESS {
continue
}
name := parseVGPUTypeName(rawName)
if name == key {
found = true
sanitizedConfig[key] = val
break
}
if name == strippedKey {
found = true
sanitizedConfig[strippedKey] = val
break
}
}
if _, ok := supportedVgpuTypes[key]; ok {
sanitizedConfig[key] = val
} else if _, ok := supportedVgpuTypes[strippedKey]; ok {
sanitizedConfig[strippedKey] = val
} else {
return fmt.Errorf("vGPU type %s is not supported on GPU (index=%d, address=%s)", key, gpu, device.Address)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I broke this into two for loops.

Signed-off-by: Arjun <agadiyar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants