-
Notifications
You must be signed in to change notification settings - Fork 0
gpu passthrough #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu passthrough #17
Conversation
✱ Stainless preview buildsThis PR will update the ❗ hypeman-go studio
❗ hypeman-cli studio
⏳ These are partial results; builds are still running. This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
Mesa DescriptionTL;DRImplemented comprehensive GPU and PCI device passthrough functionality, enabling virtual machines to directly utilize host hardware. Why we made these changesTo allow VMs to leverage dedicated hardware resources like GPUs, improving performance for demanding workloads and expanding the capabilities of instances. What changed?
Validation
Description generated by Mesa. Update settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performed full review of 9e69646...1f0e661
Analysis
-
IOMMU Group Safety (HIGH SEVERITY) - The implementation only checks a single device when binding to VFIO, ignoring other devices in the same IOMMU group. This creates a security vulnerability where unintended devices could be accessible to VMs, potentially allowing data exfiltration or system compromise.
-
Unbound Devices in IOMMU Groups (MEDIUM SEVERITY) - The safety check allows devices with no driver bound to pass validation, potentially violating isolation. Groups containing driverless devices should be explicitly rejected unless specifically allowed.
-
Driver Override Clearing Issues (MEDIUM SEVERITY) - Using "\n" to clear driver_override is non-standard and may not properly clear the override, causing devices to remain bound to vfio-pci or fail to rebind to their original drivers.
-
Fragile Path Management (MEDIUM SEVERITY) - Device directory path uses parent directory traversal, creating an implicit and brittle path structure that could break with path changes.
-
Weak Error Handling During Device Unbinding (MEDIUM SEVERITY) - When instances are deleted, errors during device unbinding are only logged as warnings, potentially leaving devices in an inconsistent state without proper recovery.
Tip
Help
Slash Commands:
/review- Request a full code review/review latest- Review only changes since the last review/describe- Generate PR description. This will update the PR body or issue comment depending on your configuration/help- Get help with Mesa commands and configuration options
24 files reviewed | 0 comments | Edit Agent Settings • Read Docs
sjmiller609
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits, but I think worth checking on:
- Any cleanup that should happen at server start, especially for tainted states, especially for per-VM resources?
- Related, any state we save that could always be derived instead of saved?
- Do we actually want / need ability to create devices on the host? why not just always use all available devices, configurable list of devices initialized on server startup, or something along those lines?
| @@ -0,0 +1,178 @@ | |||
| #!/bin/bash | |||
| # | |||
| # gpu-reset.sh - Reset GPU state after failed passthrough tests or hangs | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self, compare if any cleanup logic could go on server start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - adding startup reconciliation to this PR. Will detect orphaned AttachedTo states when instances were killed outside the API.
| m.mu.RLock() | ||
| defer m.mu.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are locks necessary on read actions like list and get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RLock protects against concurrent directory iteration during creates/deletes. Cheap read lock, will add a comment explaining.
| func (m *manager) saveDevice(device *Device) error { | ||
| data, err := json.MarshalIndent(device, "", " ") | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| return os.WriteFile(m.paths.DeviceMetadata(device.Id), data, 0644) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: check on data being saved. Could some information be derived instead of saved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BoundToVFIO already derived on read (lines 85, 181, 188). Adding startup reconciliation for AttachedTo - same pattern as CH state lesson.
| type Device struct { | ||
| Id string `json:"id"` // cuid2 identifier | ||
| Name string `json:"name"` // user-provided globally unique name | ||
| Type DeviceType `json:"type"` // gpu or pci | ||
| PCIAddress string `json:"pci_address"` // e.g., "0000:a2:00.0" | ||
| VendorID string `json:"vendor_id"` // e.g., "10de" | ||
| DeviceID string `json:"device_id"` // e.g., "27b8" | ||
| IOMMUGroup int `json:"iommu_group"` // IOMMU group number | ||
| BoundToVFIO bool `json:"bound_to_vfio"` // whether device is bound to vfio-pci | ||
| AttachedTo *string `json:"attached_to"` // instance ID if attached, nil otherwise | ||
| CreatedAt time.Time `json:"created_at"` | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the information getting saved to the metadata file. Is there anything here that we should derive instead of save? Reasoning for derive versus save being if the state changed outside of the control / more reliable. Ran into that with CH states, better to just grab the state from VMM api instead of mirroring the CH state to metadata file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly right. BoundToVFIO derived on read. Adding startup reconciliation to handle stale AttachedTo.
lib/devices/vfio.go
Outdated
| // Give it a moment to exit | ||
| time.Sleep(500 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitrary sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, replacing with polling loop (pgrep check with timeout).
| // Try systemctl first (works as root) | ||
| cmd := exec.Command("systemctl", "stop", "nvidia-persistenced") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: review lifecycle of gpu device. Is it a one-time setup needed for a GPU, and what's the per-VM thing we need to do? If one-time setup is what needs special permissions then could we move one time setup to a cmd/ script run by operators to hook up a GPU to the system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. One-time setup (register+bind) needs elevated privs, per-VM just passes VFIO group. Will create follow-up for operator tooling separation.
|
|
||
| // GetDeviceSysfsPath returns the sysfs path for a PCI device (used by cloud-hypervisor) | ||
| func GetDeviceSysfsPath(pciAddress string) string { | ||
| return filepath.Join(sysfsDevicesPath, pciAddress) + "/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another path example, not sure if want to move or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reasoning - system path, stays here.
lib/paths/paths.go
Outdated
| // devices/ | ||
| // {id}/ | ||
| // metadata.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think it makes sense for just the data dir stuff in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - lib/paths for data_dir only.
Add foundational types for GPU/PCI device passthrough: - Device, AvailableDevice, CreateDeviceRequest structs - Error types (ErrNotFound, ErrInUse, ErrAlreadyExists, etc.) - Device path helpers in lib/paths
Add low-level device operations: - discovery.go: Scan PCI bus, detect IOMMU groups, identify GPU devices - vfio.go: Bind/unbind devices to vfio-pci driver for VM passthrough
Add the main device management logic: - Manager interface with CRUD operations for devices - CreateDevice, GetDevice, DeleteDevice, ListDevices - MarkAttached/MarkDetached for instance lifecycle - BindToVFIO/UnbindFromVFIO for driver management - Persistence via JSON metadata files
Add support for NVIDIA GPU passthrough in the VM boot chain: - versions.go: Add Kernel_20251213 with NVIDIA module/driver lib URLs - initrd.go: Download and extract NVIDIA kernel modules and driver libs - init_script.go: Load NVIDIA modules at boot, inject driver libs into containers This enables containers to use CUDA without bundling driver versions.
Add InstanceLivenessChecker adapter to allow the devices package to query instance state without circular imports. Used during startup to detect orphaned device attachments from crashed VMs. - liveness.go: Adapter implementing devices.InstanceLivenessChecker - liveness_test.go: Unit tests - reconcile_test.go: Device reconciliation tests - types.go: Add Devices field to StoredMetadata and CreateInstanceRequest
Wire up device management throughout the instance lifecycle: - create.go: Validate devices, auto-bind to VFIO, pass to VM config - delete.go: Detach devices, auto-unbind from VFIO - configdisk.go: Add HAS_GPU config flag for GPU instances - manager.go: Add deviceManager dependency - providers.go: Add ProvideDeviceManager - wire.go/wire_gen.go: Wire up DeviceManager in DI - api.go: Add DeviceManager to ApiService struct
Add REST API for device management and supporting documentation:
API endpoints:
- GET/POST /devices - List and register devices
- GET/DELETE /devices/{id} - Get and delete devices
- GET /devices/available - Discover passthrough-capable devices
- instances.go: Accept devices param in CreateInstance
Documentation:
- GPU.md: GPU passthrough architecture and driver injection
- README.md: Device management usage guide
- scripts/gpu-reset.sh: GPU reset utility
Tests and fixtures:
- gpu_e2e_test.go, gpu_inference_test.go, gpu_module_test.go
- testdata/ollama-cuda/ - CUDA test container
Also adds build-preview-cli Makefile target.
The initrd now includes NVIDIA kernel modules, firmware, and driver libraries (~238MB total). With 512MB VMs, the kernel couldn't unpack the initrd into tmpfs without running out of space. Increase test VM memory from 512MB to 2GB to provide sufficient room for the initrd contents plus normal VM operation.
The HAS_GPU flag was being set unconditionally when any device was attached, regardless of device type. This would trigger NVIDIA module loading in the VM init script even for non-GPU PCI devices. Now iterates through attached devices and checks each device's type, only setting HAS_GPU=1 if at least one device is DeviceTypeGPU.
lib/devices/manager.go
Outdated
| runningInstances[instanceID] = true | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: False positive warnings for instances without GPU devices
The detectSuspiciousVMMProcesses function uses ListAllInstanceDevices to build the set of known running instances, but ListAllInstanceDevices only returns instances that have devices attached (line 75: if len(inst.Devices) > 0). This causes legitimate cloud-hypervisor processes for instances without GPU devices to be incorrectly flagged as "untracked" with remediation advice to run gpu-reset.sh. Operators following this advice could inadvertently disrupt running VMs that simply don't use GPU passthrough.
Additional Locations (1)
…PU devices detectSuspiciousVMMProcesses was using ListAllInstanceDevices to build the set of known running instances, but that method only returns instances with devices attached. This caused legitimate cloud-hypervisor processes for instances without GPU passthrough to be incorrectly flagged as 'untracked' with misleading advice to run gpu-reset.sh. Fix: Call IsInstanceRunning directly for each discovered process instead of pre-building a map from ListAllInstanceDevices. This correctly identifies all running instances regardless of device attachment.
sjmiller609
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
| ### IOMMU Requirements | ||
|
|
||
| - **IOMMU must be enabled** in BIOS and kernel (`intel_iommu=on` or `amd_iommu=on`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check + log warning if not available on startup maybe
| The following kernel modules must be loaded: | ||
| ```bash | ||
| modprobe vfio_pci | ||
| modprobe vfio_iommu_type1 | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this (regarding startup warning for debuggability)
| For best GPU performance, enable huge pages on the host: | ||
|
|
||
| ```bash | ||
| echo 1024 > /proc/sys/vm/nr_hugepages | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add to warning startup log and / or install script?
lib/devices/manager.go
Outdated
| // detectSuspiciousVMMProcesses logs warnings about cloud-hypervisor processes | ||
| // that don't match known instances. This is log-only (no killing). | ||
| func (m *manager) detectSuspiciousVMMProcesses(ctx context.Context, stats *reconcileStats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it's not related to devices and so should probably live in a different module, like the instance module maybe.
lib/system/initrd.go
Outdated
| // Add NVIDIA kernel modules (for GPU passthrough support) | ||
| if err := m.addNvidiaModules(ctx, rootfsDir, arch); err != nil { | ||
| // Log but don't fail - NVIDIA modules are optional (not available on all architectures) | ||
| fmt.Printf("initrd: skipping NVIDIA modules: %v\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this use context logger?
lib/system/initrd.go
Outdated
| // Add userspace driver libraries (libcuda.so, libnvidia-ml.so, nvidia-smi, etc.) | ||
| // These are injected into containers at boot time - see lib/devices/GPU.md | ||
| if err := m.addNvidiaDriverLibs(ctx, rootfsDir, arch); err != nil { | ||
| fmt.Printf("initrd: warning: could not add nvidia driver libs: %v\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context logger?
lib/system/initrd.go
Outdated
| return fmt.Errorf("extract nvidia driver libs: %w", err) | ||
| } | ||
|
|
||
| fmt.Printf("initrd: added NVIDIA driver libraries from %s\n", url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw when using context logger, should show up automatically with
hypeman logs --source=hypeman <VM id>
| gpuSection := "" | ||
| for _, deviceID := range inst.Devices { | ||
| device, err := m.deviceManager.GetDevice(ctx, deviceID) | ||
| if err == nil && device.Type == devices.DeviceTypeGPU { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you already noticed but should probably not load gpu drivers if not gpu device
Check and warn on startup if: - IOMMU is not enabled (no groups in /sys/kernel/iommu_groups) - VFIO modules not loaded (vfio_pci, vfio_iommu_type1) - Huge pages not configured (info hint when devices exist)
This function is about instance lifecycle, not device management. Moving it to the instances module where it belongs. The implementation uses IsInstanceRunning (which queries all instances) rather than ListAllInstanceDevices (which only returns instances with devices) to avoid false positives for non-GPU VMs.
|
hey steven, addressed your feedback: startup warnings for gpu prerequisites (README.md:165, README.md:175, GPU.md:162)
detectSuspiciousVMMProcesses location (manager.go:656)
context loggers in initrd (initrd.go:65, 208, 255)
HAS_GPU flag (configdisk.go:115)
|
Replace fmt.Printf calls with proper context loggers so messages appear in structured logs with consistent formatting.
Note
Introduce full GPU/PCI passthrough: device management + APIs, instance integration, initrd NVIDIA support, startup reconciliation, and comprehensive tests/docs.
lib/devices): discovery, VFIO bind/unbind, persistence, reconciliation, and liveness checks./devices,/devices/{id},/devices/available) with OpenAPI and generated client updates.VmConfig.Devices); include devices in metadata and config disk.HAS_GPU=1.DeviceManagerthrough app, DI, and tests; adapt paths for device metadata.Written by Cursor Bugbot for commit 4d23e73. This will update automatically on new commits. Configure here.