gpu passthrough #17

rgarcia · 2025-11-28T13:56:53Z

Note

Introduce full GPU/PCI passthrough: device management + APIs, instance integration, initrd NVIDIA support, startup reconciliation, and comprehensive tests/docs.

Devices & API:
- Add device management (lib/devices): discovery, VFIO bind/unbind, persistence, reconciliation, and liveness checks.
- New REST endpoints (/devices, /devices/{id}, /devices/available) with OpenAPI and generated client updates.
Instances:
- Support attaching devices on create; auto-bind to VFIO and mark attached; auto-unbind on delete.
- Pass devices to cloud-hypervisor (VmConfig.Devices); include devices in metadata and config disk.
System/Initrd:
- Build initrd with NVIDIA kernel modules and driver libs; init script loads modules, creates device nodes, and injects libs into guest/container when HAS_GPU=1.
- Kernel/version wiring for NVIDIA assets.
Startup/Reconciliation:
- On API start, reconcile device state, clear orphans, and integrate instance liveness.
Wiring & Providers:
- Wire DeviceManager through app, DI, and tests; adapt paths for device metadata.
Tests & Docs:
- Add extensive GPU E2E, NVML, and inference tests; docs for device/GPU usage and troubleshooting.
Schemas/Clients/Deps:
- Update OpenAPI schemas/clients; add new paths and types; refresh dependencies needed for features.

^{Written by Cursor Bugbot for commit 4d23e73. This will update automatically on new commits. Configure here.}

github-actions · 2025-11-28T13:57:12Z

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: gpu passthrough

❗ hypeman-go studio

Code was not generated because there was a fatal error.

❗ hypeman-cli studio

Code was not generated because there was a fatal error.

⏳ These are partial results; builds are still running.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-16 16:49:58 UTC

mesa-dot-dev · 2025-11-28T13:57:38Z

Mesa Description

TL;DR

Implemented comprehensive GPU and PCI device passthrough functionality, enabling virtual machines to directly utilize host hardware.

Why we made these changes

To allow VMs to leverage dedicated hardware resources like GPUs, improving performance for demanding workloads and expanding the capabilities of instances.

What changed?

API Endpoints: Added new API handlers (cmd/api/api/devices.go) for managing devices, including listing, discovering, creating, retrieving, and deleting.
Device Management (lib/devices):
- Introduced a Manager for handling passthrough devices, including binding/unbinding via VFIOBinder.
- Added utilities for PCI device discovery, validation, and detailed information retrieval.
- Defined core data structures and error types for device management.
- Included a gpu-reset.sh script for NVIDIA GPU state recovery.
Instance Integration:
- Modified CreateInstance (cmd/api/api/instances.go, lib/instances/create.go) to parse device references, validate, and automatically bind/attach devices to VMs.
- Updated DeleteInstance (lib/instances/delete.go) to detach and unbind devices during VM cleanup.
- Extended StoredMetadata and CreateInstanceRequest (lib/instances/types.go) to track attached devices.
Dependency Injection: Updated cmd/api/wire.go, cmd/api/wire_gen.go, and lib/providers/providers.go to integrate the new DeviceManager.
OpenAPI Specification: openapi.yaml was updated with new schemas and endpoints for device management, and the InstanceCreate schema now supports specifying devices.
Testing: Added a new end-to-end test (lib/devices/gpu_e2e_test.go) for GPU passthrough validation and extensive unit tests for device management utilities (lib/devices/manager_test.go).

Validation

End-to-end test TestGPUPassthrough validates NVIDIA GPU passthrough, including discovery, registration, VM creation, in-guest verification, and proper driver binding/unbinding.
Unit tests cover device name/PCI address validation, device type determination, sysfs path construction, and error handling.
TestExecInstanceNonTTY updated for new log retrieval mechanism.

^{Description generated by Mesa. Update settings}

mesa-dot-dev

Performed full review of 9e69646...1f0e661

Analysis

IOMMU Group Safety (HIGH SEVERITY) - The implementation only checks a single device when binding to VFIO, ignoring other devices in the same IOMMU group. This creates a security vulnerability where unintended devices could be accessible to VMs, potentially allowing data exfiltration or system compromise.
Unbound Devices in IOMMU Groups (MEDIUM SEVERITY) - The safety check allows devices with no driver bound to pass validation, potentially violating isolation. Groups containing driverless devices should be explicitly rejected unless specifically allowed.
Driver Override Clearing Issues (MEDIUM SEVERITY) - Using "\n" to clear driver_override is non-standard and may not properly clear the override, causing devices to remain bound to vfio-pci or fail to rebind to their original drivers.
Fragile Path Management (MEDIUM SEVERITY) - Device directory path uses parent directory traversal, creating an implicit and brittle path structure that could break with path changes.
Weak Error Handling During Device Unbinding (MEDIUM SEVERITY) - When instances are deleted, errors during device unbinding are only logged as warnings, potentially leaving devices in an inconsistent state without proper recovery.

Tip

Help

Slash Commands:

/review - Request a full code review
/review latest - Review only changes since the last review
/describe - Generate PR description. This will update the PR body or issue comment depending on your configuration
/help - Get help with Mesa commands and configuration options

^{24 files reviewed | 0 comments | Edit Agent Settings • Read Docs}

sjmiller609

Mostly nits, but I think worth checking on:

Any cleanup that should happen at server start, especially for tainted states, especially for per-VM resources?
Related, any state we save that could always be derived instead of saved?
Do we actually want / need ability to create devices on the host? why not just always use all available devices, configurable list of devices initialized on server startup, or something along those lines?

sjmiller609 · 2025-11-28T16:57:46Z

lib/devices/scripts/gpu-reset.sh

@@ -0,0 +1,178 @@
+#!/bin/bash
+#
+# gpu-reset.sh - Reset GPU state after failed passthrough tests or hangs


Note to self, compare if any cleanup logic could go on server start.

Good idea - adding startup reconciliation to this PR. Will detect orphaned AttachedTo states when instances were killed outside the API.

lib/devices/discovery.go

sjmiller609 · 2025-11-28T17:01:16Z

lib/devices/manager.go

+	m.mu.RLock()
+	defer m.mu.RUnlock()


are locks necessary on read actions like list and get?

RLock protects against concurrent directory iteration during creates/deletes. Cheap read lock, will add a comment explaining.

sjmiller609 · 2025-11-28T17:03:50Z

lib/devices/manager.go

+func (m *manager) saveDevice(device *Device) error {
+	data, err := json.MarshalIndent(device, "", "  ")
+	if err != nil {
+		return err
+	}
+
+	return os.WriteFile(m.paths.DeviceMetadata(device.Id), data, 0644)


Note: check on data being saved. Could some information be derived instead of saved?

BoundToVFIO already derived on read (lines 85, 181, 188). Adding startup reconciliation for AttachedTo - same pattern as CH state lesson.

sjmiller609 · 2025-11-28T17:06:26Z

lib/devices/types.go

+type Device struct {
+	Id          string     `json:"id"`           // cuid2 identifier
+	Name        string     `json:"name"`         // user-provided globally unique name
+	Type        DeviceType `json:"type"`         // gpu or pci
+	PCIAddress  string     `json:"pci_address"`  // e.g., "0000:a2:00.0"
+	VendorID    string     `json:"vendor_id"`    // e.g., "10de"
+	DeviceID    string     `json:"device_id"`    // e.g., "27b8"
+	IOMMUGroup  int        `json:"iommu_group"`  // IOMMU group number
+	BoundToVFIO bool       `json:"bound_to_vfio"` // whether device is bound to vfio-pci
+	AttachedTo  *string    `json:"attached_to"`  // instance ID if attached, nil otherwise
+	CreatedAt   time.Time  `json:"created_at"`
+}


I think this is the information getting saved to the metadata file. Is there anything here that we should derive instead of save? Reasoning for derive versus save being if the state changed outside of the control / more reliable. Ran into that with CH states, better to just grab the state from VMM api instead of mirroring the CH state to metadata file.

Exactly right. BoundToVFIO derived on read. Adding startup reconciliation to handle stale AttachedTo.

sjmiller609 · 2025-11-28T17:12:06Z

lib/devices/vfio.go

+	// Give it a moment to exit
+	time.Sleep(500 * time.Millisecond)


arbitrary sleep?

Yeah, replacing with polling loop (pgrep check with timeout).

sjmiller609 · 2025-11-28T17:16:07Z

lib/devices/vfio.go

+	// Try systemctl first (works as root)
+	cmd := exec.Command("systemctl", "stop", "nvidia-persistenced")


note: review lifecycle of gpu device. Is it a one-time setup needed for a GPU, and what's the per-VM thing we need to do? If one-time setup is what needs special permissions then could we move one time setup to a cmd/ script run by operators to hook up a GPU to the system?

Great question. One-time setup (register+bind) needs elevated privs, per-VM just passes VFIO group. Will create follow-up for operator tooling separation.

sjmiller609 · 2025-11-28T17:16:53Z

lib/devices/vfio.go

+
+// GetDeviceSysfsPath returns the sysfs path for a PCI device (used by cloud-hypervisor)
+func GetDeviceSysfsPath(pciAddress string) string {
+	return filepath.Join(sysfsDevicesPath, pciAddress) + "/"


another path example, not sure if want to move or not

Same reasoning - system path, stays here.

sjmiller609 · 2025-11-28T17:18:32Z

lib/paths/paths.go

+//	  devices/
+//	    {id}/
+//	      metadata.json


yeah I think it makes sense for just the data dir stuff in this file

Agreed - lib/paths for data_dir only.

openapi.yaml

lib/instances/create.go

lib/devices/vfio.go

lib/instances/create.go

Add foundational types for GPU/PCI device passthrough: - Device, AvailableDevice, CreateDeviceRequest structs - Error types (ErrNotFound, ErrInUse, ErrAlreadyExists, etc.) - Device path helpers in lib/paths

Add low-level device operations: - discovery.go: Scan PCI bus, detect IOMMU groups, identify GPU devices - vfio.go: Bind/unbind devices to vfio-pci driver for VM passthrough

Add the main device management logic: - Manager interface with CRUD operations for devices - CreateDevice, GetDevice, DeleteDevice, ListDevices - MarkAttached/MarkDetached for instance lifecycle - BindToVFIO/UnbindFromVFIO for driver management - Persistence via JSON metadata files

Add support for NVIDIA GPU passthrough in the VM boot chain: - versions.go: Add Kernel_20251213 with NVIDIA module/driver lib URLs - initrd.go: Download and extract NVIDIA kernel modules and driver libs - init_script.go: Load NVIDIA modules at boot, inject driver libs into containers This enables containers to use CUDA without bundling driver versions.

Add InstanceLivenessChecker adapter to allow the devices package to query instance state without circular imports. Used during startup to detect orphaned device attachments from crashed VMs. - liveness.go: Adapter implementing devices.InstanceLivenessChecker - liveness_test.go: Unit tests - reconcile_test.go: Device reconciliation tests - types.go: Add Devices field to StoredMetadata and CreateInstanceRequest

Wire up device management throughout the instance lifecycle: - create.go: Validate devices, auto-bind to VFIO, pass to VM config - delete.go: Detach devices, auto-unbind from VFIO - configdisk.go: Add HAS_GPU config flag for GPU instances - manager.go: Add deviceManager dependency - providers.go: Add ProvideDeviceManager - wire.go/wire_gen.go: Wire up DeviceManager in DI - api.go: Add DeviceManager to ApiService struct

Add REST API for device management and supporting documentation: API endpoints: - GET/POST /devices - List and register devices - GET/DELETE /devices/{id} - Get and delete devices - GET /devices/available - Discover passthrough-capable devices - instances.go: Accept devices param in CreateInstance Documentation: - GPU.md: GPU passthrough architecture and driver injection - README.md: Device management usage guide - scripts/gpu-reset.sh: GPU reset utility Tests and fixtures: - gpu_e2e_test.go, gpu_inference_test.go, gpu_module_test.go - testdata/ollama-cuda/ - CUDA test container Also adds build-preview-cli Makefile target.

The initrd now includes NVIDIA kernel modules, firmware, and driver libraries (~238MB total). With 512MB VMs, the kernel couldn't unpack the initrd into tmpfs without running out of space. Increase test VM memory from 512MB to 2GB to provide sufficient room for the initrd contents plus normal VM operation.

lib/instances/create.go

lib/instances/configdisk.go

The HAS_GPU flag was being set unconditionally when any device was attached, regardless of device type. This would trigger NVIDIA module loading in the VM init script even for non-GPU PCI devices. Now iterates through attached devices and checks each device's type, only setting HAS_GPU=1 if at least one device is DeviceTypeGPU.

cursor · 2025-12-15T21:00:18Z

lib/devices/manager.go

+				runningInstances[instanceID] = true
+			}
+		}
+	}


Bug: False positive warnings for instances without GPU devices

The detectSuspiciousVMMProcesses function uses ListAllInstanceDevices to build the set of known running instances, but ListAllInstanceDevices only returns instances that have devices attached (line 75: if len(inst.Devices) > 0). This causes legitimate cloud-hypervisor processes for instances without GPU devices to be incorrectly flagged as "untracked" with remediation advice to run gpu-reset.sh. Operators following this advice could inadvertently disrupt running VMs that simply don't use GPU passthrough.

Additional Locations (1)

lib/instances/liveness.go#L72-L78

…PU devices detectSuspiciousVMMProcesses was using ListAllInstanceDevices to build the set of known running instances, but that method only returns instances with devices attached. This caused legitimate cloud-hypervisor processes for instances without GPU passthrough to be incorrectly flagged as 'untracked' with misleading advice to run gpu-reset.sh. Fix: Call IsInstanceRunning directly for each discovered process instead of pre-building a map from ListAllInstanceDevices. This correctly identifies all running instances regardless of device attachment.

sjmiller609

Looks great!

lib/devices/README.md

sjmiller609 · 2025-12-15T20:41:54Z

lib/devices/README.md

+### IOMMU Requirements
+
+- **IOMMU must be enabled** in BIOS and kernel (`intel_iommu=on` or `amd_iommu=on`)


check + log warning if not available on startup maybe

sjmiller609 · 2025-12-15T20:42:19Z

lib/devices/README.md

+The following kernel modules must be loaded:
+```bash
+modprobe vfio_pci
+modprobe vfio_iommu_type1
+```


Same with this (regarding startup warning for debuggability)

sjmiller609 · 2025-12-15T20:46:14Z

lib/devices/GPU.md

+For best GPU performance, enable huge pages on the host:
+
+```bash
+echo 1024 > /proc/sys/vm/nr_hugepages
+```


add to warning startup log and / or install script?

lib/devices/discovery.go

sjmiller609 · 2025-12-15T21:08:22Z

lib/devices/manager.go

+// detectSuspiciousVMMProcesses logs warnings about cloud-hypervisor processes
+// that don't match known instances. This is log-only (no killing).
+func (m *manager) detectSuspiciousVMMProcesses(ctx context.Context, stats *reconcileStats) {


This seems like it's not related to devices and so should probably live in a different module, like the instance module maybe.

sjmiller609 · 2025-12-15T21:12:19Z

lib/system/initrd.go

+	// Add NVIDIA kernel modules (for GPU passthrough support)
+	if err := m.addNvidiaModules(ctx, rootfsDir, arch); err != nil {
+		// Log but don't fail - NVIDIA modules are optional (not available on all architectures)
+		fmt.Printf("initrd: skipping NVIDIA modules: %v\n", err)


should this use context logger?

sjmiller609 · 2025-12-15T21:13:38Z

lib/system/initrd.go

+	// Add userspace driver libraries (libcuda.so, libnvidia-ml.so, nvidia-smi, etc.)
+	// These are injected into containers at boot time - see lib/devices/GPU.md
+	if err := m.addNvidiaDriverLibs(ctx, rootfsDir, arch); err != nil {
+		fmt.Printf("initrd: warning: could not add nvidia driver libs: %v\n", err)


context logger?

sjmiller609 · 2025-12-15T21:14:03Z

lib/system/initrd.go

+		return fmt.Errorf("extract nvidia driver libs: %w", err)
+	}
+
+	fmt.Printf("initrd: added NVIDIA driver libraries from %s\n", url)


context logger?

btw when using context logger, should show up automatically with

hypeman logs --source=hypeman <VM id>

sjmiller609 · 2025-12-15T21:15:55Z

lib/instances/configdisk.go

+	gpuSection := ""
+	for _, deviceID := range inst.Devices {
+		device, err := m.deviceManager.GetDevice(ctx, deviceID)
+		if err == nil && device.Type == devices.DeviceTypeGPU {


Looks like you already noticed but should probably not load gpu drivers if not gpu device

Check and warn on startup if: - IOMMU is not enabled (no groups in /sys/kernel/iommu_groups) - VFIO modules not loaded (vfio_pci, vfio_iommu_type1) - Huge pages not configured (info hint when devices exist)

This function is about instance lifecycle, not device management. Moving it to the instances module where it belongs. The implementation uses IsInstanceRunning (which queries all instances) rather than ListAllInstanceDevices (which only returns instances with devices) to avoid false positives for non-GPU VMs.

rgarcia · 2025-12-16T16:24:35Z

hey steven, addressed your feedback:

startup warnings for gpu prerequisites (README.md:165, README.md:175, GPU.md:162)

good call, added startup warnings for iommu and vfio modules. also added a hint for huge pages when devices are registered. a260091

detectSuspiciousVMMProcesses location (manager.go:656)

yeah you're right, moved this to lib/instances/liveness.go where it belongs. also fixes the false positive bug from cursor bot since it now properly checks all instances, not just ones with devices. 8d610e9

context loggers in initrd (initrd.go:65, 208, 255)

switched all the fmt.Printf calls to context loggers. since this is server-level logging (not per-instance), it'll just go to the main hypeman log but with proper structured format. 05fa5a4

HAS_GPU flag (configdisk.go:115)

already fixed in cc0efea, now only sets HAS_GPU=1 when there's actually a gpu device attached 👍

Replace fmt.Printf calls with proper context loggers so messages appear in structured logs with consistent formatting.

rgarcia requested a review from sjmiller609 November 28, 2025 13:57

mesa-dot-dev bot reviewed Nov 28, 2025

View reviewed changes

sjmiller609 approved these changes Nov 28, 2025

View reviewed changes

cursor bot reviewed Dec 10, 2025

View reviewed changes

lib/instances/create.go Outdated Show resolved Hide resolved

lib/instances/create.go Outdated Show resolved Hide resolved

cursor bot reviewed Dec 11, 2025

View reviewed changes

lib/devices/vfio.go Show resolved Hide resolved

cursor bot reviewed Dec 11, 2025

View reviewed changes

lib/instances/create.go Show resolved Hide resolved

Rafael Garcia added 7 commits December 13, 2025 21:43

feat(devices): add lib/devices package types, errors, and paths

daef0f5

Add foundational types for GPU/PCI device passthrough: - Device, AvailableDevice, CreateDeviceRequest structs - Error types (ErrNotFound, ErrInUse, ErrAlreadyExists, etc.) - Device path helpers in lib/paths

feat(devices): add PCI device discovery and VFIO binding

7202c75

Add low-level device operations: - discovery.go: Scan PCI bus, detect IOMMU groups, identify GPU devices - vfio.go: Bind/unbind devices to vfio-pci driver for VM passthrough

rgarcia force-pushed the devices branch from 8aec436 to f563ada Compare December 13, 2025 21:58

Rafael Garcia added 2 commits December 14, 2025 04:08

remove slop test

975fdd9

cursor bot reviewed Dec 15, 2025

View reviewed changes

lib/instances/create.go Show resolved Hide resolved

Rafael Garcia added 3 commits December 15, 2025 19:39

remove outdated comment

361998f

markattached bug

cc0efea

remove preview script

4fa81c4

cursor bot reviewed Dec 15, 2025

View reviewed changes

lib/instances/configdisk.go Show resolved Hide resolved

sjmiller609 self-requested a review December 15, 2025 20:22

cursor bot reviewed Dec 15, 2025

View reviewed changes

sjmiller609 approved these changes Dec 15, 2025

View reviewed changes

Rafael Garcia added 2 commits December 16, 2025 16:15

devices: add startup validation warnings for GPU prerequisites

a260091

Check and warn on startup if: - IOMMU is not enabled (no groups in /sys/kernel/iommu_groups) - VFIO modules not loaded (vfio_pci, vfio_iommu_type1) - Huge pages not configured (info hint when devices exist)

system: use context loggers in initrd building

4d23e73

Replace fmt.Printf calls with proper context loggers so messages appear in structured logs with consistent formatting.

rgarcia force-pushed the devices branch from 05fa5a4 to 4d23e73 Compare December 16, 2025 16:36

rgarcia merged commit 4b0c8f3 into main Dec 16, 2025
3 of 4 checks passed

rgarcia deleted the devices branch December 16, 2025 16:39

		// Give it a moment to exit
		time.Sleep(500 * time.Millisecond)

		// Try systemctl first (works as root)
		cmd := exec.Command("systemctl", "stop", "nvidia-persistenced")

		### IOMMU Requirements

		- IOMMU must be enabled in BIOS and kernel (`intel_iommu=on` or `amd_iommu=on`)

gpu passthrough #17

gpu passthrough #17

Uh oh!

Conversation

rgarcia commented Nov 28, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

mesa-dot-dev bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mesa Description

TL;DR

Why we made these changes

What changed?

Validation

Uh oh!

mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

Analysis

Uh oh!

sjmiller609 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Dec 15, 2025

Choose a reason for hiding this comment

Bug: False positive warnings for instances without GPU devices

Uh oh!

sjmiller609 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgarcia commented Nov 28, 2025 •

edited by cursor bot

Loading

github-actions bot commented Nov 28, 2025 •

edited

Loading

mesa-dot-dev bot commented Nov 28, 2025 •

edited

Loading