Skip to content

Conversation

@rgarcia
Copy link
Contributor

@rgarcia rgarcia commented Nov 28, 2025

Note

Introduce full GPU/PCI passthrough: device management + APIs, instance integration, initrd NVIDIA support, startup reconciliation, and comprehensive tests/docs.

  • Devices & API:
    • Add device management (lib/devices): discovery, VFIO bind/unbind, persistence, reconciliation, and liveness checks.
    • New REST endpoints (/devices, /devices/{id}, /devices/available) with OpenAPI and generated client updates.
  • Instances:
    • Support attaching devices on create; auto-bind to VFIO and mark attached; auto-unbind on delete.
    • Pass devices to cloud-hypervisor (VmConfig.Devices); include devices in metadata and config disk.
  • System/Initrd:
    • Build initrd with NVIDIA kernel modules and driver libs; init script loads modules, creates device nodes, and injects libs into guest/container when HAS_GPU=1.
    • Kernel/version wiring for NVIDIA assets.
  • Startup/Reconciliation:
    • On API start, reconcile device state, clear orphans, and integrate instance liveness.
  • Wiring & Providers:
    • Wire DeviceManager through app, DI, and tests; adapt paths for device metadata.
  • Tests & Docs:
    • Add extensive GPU E2E, NVML, and inference tests; docs for device/GPU usage and troubleshooting.
  • Schemas/Clients/Deps:
    • Update OpenAPI schemas/clients; add new paths and types; refresh dependencies needed for features.

Written by Cursor Bugbot for commit 4d23e73. This will update automatically on new commits. Configure here.

@rgarcia rgarcia requested a review from sjmiller609 November 28, 2025 13:57
@github-actions
Copy link

github-actions bot commented Nov 28, 2025

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: gpu passthrough
hypeman-go studio

Code was not generated because there was a fatal error.

hypeman-cli studio

Code was not generated because there was a fatal error.

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-16 16:49:58 UTC

@mesa-dot-dev
Copy link

mesa-dot-dev bot commented Nov 28, 2025

Mesa Description

TL;DR

Implemented comprehensive GPU and PCI device passthrough functionality, enabling virtual machines to directly utilize host hardware.

Why we made these changes

To allow VMs to leverage dedicated hardware resources like GPUs, improving performance for demanding workloads and expanding the capabilities of instances.

What changed?

  • API Endpoints: Added new API handlers (cmd/api/api/devices.go) for managing devices, including listing, discovering, creating, retrieving, and deleting.
  • Device Management (lib/devices):
    • Introduced a Manager for handling passthrough devices, including binding/unbinding via VFIOBinder.
    • Added utilities for PCI device discovery, validation, and detailed information retrieval.
    • Defined core data structures and error types for device management.
    • Included a gpu-reset.sh script for NVIDIA GPU state recovery.
  • Instance Integration:
    • Modified CreateInstance (cmd/api/api/instances.go, lib/instances/create.go) to parse device references, validate, and automatically bind/attach devices to VMs.
    • Updated DeleteInstance (lib/instances/delete.go) to detach and unbind devices during VM cleanup.
    • Extended StoredMetadata and CreateInstanceRequest (lib/instances/types.go) to track attached devices.
  • Dependency Injection: Updated cmd/api/wire.go, cmd/api/wire_gen.go, and lib/providers/providers.go to integrate the new DeviceManager.
  • OpenAPI Specification: openapi.yaml was updated with new schemas and endpoints for device management, and the InstanceCreate schema now supports specifying devices.
  • Testing: Added a new end-to-end test (lib/devices/gpu_e2e_test.go) for GPU passthrough validation and extensive unit tests for device management utilities (lib/devices/manager_test.go).

Validation

  • End-to-end test TestGPUPassthrough validates NVIDIA GPU passthrough, including discovery, registration, VM creation, in-guest verification, and proper driver binding/unbinding.
  • Unit tests cover device name/PCI address validation, device type determination, sysfs path construction, and error handling.
  • TestExecInstanceNonTTY updated for new log retrieval mechanism.

Description generated by Mesa. Update settings

Copy link

@mesa-dot-dev mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performed full review of 9e69646...1f0e661

Analysis

  1. IOMMU Group Safety (HIGH SEVERITY) - The implementation only checks a single device when binding to VFIO, ignoring other devices in the same IOMMU group. This creates a security vulnerability where unintended devices could be accessible to VMs, potentially allowing data exfiltration or system compromise.

  2. Unbound Devices in IOMMU Groups (MEDIUM SEVERITY) - The safety check allows devices with no driver bound to pass validation, potentially violating isolation. Groups containing driverless devices should be explicitly rejected unless specifically allowed.

  3. Driver Override Clearing Issues (MEDIUM SEVERITY) - Using "\n" to clear driver_override is non-standard and may not properly clear the override, causing devices to remain bound to vfio-pci or fail to rebind to their original drivers.

  4. Fragile Path Management (MEDIUM SEVERITY) - Device directory path uses parent directory traversal, creating an implicit and brittle path structure that could break with path changes.

  5. Weak Error Handling During Device Unbinding (MEDIUM SEVERITY) - When instances are deleted, errors during device unbinding are only logged as warnings, potentially leaving devices in an inconsistent state without proper recovery.

Tip

Help

Slash Commands:

  • /review - Request a full code review
  • /review latest - Review only changes since the last review
  • /describe - Generate PR description. This will update the PR body or issue comment depending on your configuration
  • /help - Get help with Mesa commands and configuration options

24 files reviewed | 0 comments | Edit Agent SettingsRead Docs

Copy link
Collaborator

@sjmiller609 sjmiller609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits, but I think worth checking on:

  • Any cleanup that should happen at server start, especially for tainted states, especially for per-VM resources?
  • Related, any state we save that could always be derived instead of saved?
  • Do we actually want / need ability to create devices on the host? why not just always use all available devices, configurable list of devices initialized on server startup, or something along those lines?

@@ -0,0 +1,178 @@
#!/bin/bash
#
# gpu-reset.sh - Reset GPU state after failed passthrough tests or hangs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self, compare if any cleanup logic could go on server start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - adding startup reconciliation to this PR. Will detect orphaned AttachedTo states when instances were killed outside the API.

Comment on lines +62 to +103
m.mu.RLock()
defer m.mu.RUnlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are locks necessary on read actions like list and get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RLock protects against concurrent directory iteration during creates/deletes. Cheap read lock, will add a comment explaining.

Comment on lines +365 to +753
func (m *manager) saveDevice(device *Device) error {
data, err := json.MarshalIndent(device, "", " ")
if err != nil {
return err
}

return os.WriteFile(m.paths.DeviceMetadata(device.Id), data, 0644)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: check on data being saved. Could some information be derived instead of saved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoundToVFIO already derived on read (lines 85, 181, 188). Adding startup reconciliation for AttachedTo - same pattern as CH state lesson.

Comment on lines +17 to +28
type Device struct {
Id string `json:"id"` // cuid2 identifier
Name string `json:"name"` // user-provided globally unique name
Type DeviceType `json:"type"` // gpu or pci
PCIAddress string `json:"pci_address"` // e.g., "0000:a2:00.0"
VendorID string `json:"vendor_id"` // e.g., "10de"
DeviceID string `json:"device_id"` // e.g., "27b8"
IOMMUGroup int `json:"iommu_group"` // IOMMU group number
BoundToVFIO bool `json:"bound_to_vfio"` // whether device is bound to vfio-pci
AttachedTo *string `json:"attached_to"` // instance ID if attached, nil otherwise
CreatedAt time.Time `json:"created_at"`
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the information getting saved to the metadata file. Is there anything here that we should derive instead of save? Reasoning for derive versus save being if the state changed outside of the control / more reliable. Ran into that with CH states, better to just grab the state from VMM api instead of mirroring the CH state to metadata file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right. BoundToVFIO derived on read. Adding startup reconciliation to handle stale AttachedTo.

Comment on lines 194 to 195
// Give it a moment to exit
time.Sleep(500 * time.Millisecond)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arbitrary sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, replacing with polling loop (pgrep check with timeout).

Comment on lines +175 to +179
// Try systemctl first (works as root)
cmd := exec.Command("systemctl", "stop", "nvidia-persistenced")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: review lifecycle of gpu device. Is it a one-time setup needed for a GPU, and what's the per-VM thing we need to do? If one-time setup is what needs special permissions then could we move one time setup to a cmd/ script run by operators to hook up a GPU to the system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. One-time setup (register+bind) needs elevated privs, per-VM just passes VFIO group. Will create follow-up for operator tooling separation.


// GetDeviceSysfsPath returns the sysfs path for a PCI device (used by cloud-hypervisor)
func GetDeviceSysfsPath(pciAddress string) string {
return filepath.Join(sysfsDevicesPath, pciAddress) + "/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another path example, not sure if want to move or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning - system path, stays here.

Comment on lines 29 to 31
// devices/
// {id}/
// metadata.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think it makes sense for just the data dir stuff in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - lib/paths for data_dir only.

Rafael Garcia added 7 commits December 13, 2025 21:43
Add foundational types for GPU/PCI device passthrough:
- Device, AvailableDevice, CreateDeviceRequest structs
- Error types (ErrNotFound, ErrInUse, ErrAlreadyExists, etc.)
- Device path helpers in lib/paths
Add low-level device operations:
- discovery.go: Scan PCI bus, detect IOMMU groups, identify GPU devices
- vfio.go: Bind/unbind devices to vfio-pci driver for VM passthrough
Add the main device management logic:
- Manager interface with CRUD operations for devices
- CreateDevice, GetDevice, DeleteDevice, ListDevices
- MarkAttached/MarkDetached for instance lifecycle
- BindToVFIO/UnbindFromVFIO for driver management
- Persistence via JSON metadata files
Add support for NVIDIA GPU passthrough in the VM boot chain:
- versions.go: Add Kernel_20251213 with NVIDIA module/driver lib URLs
- initrd.go: Download and extract NVIDIA kernel modules and driver libs
- init_script.go: Load NVIDIA modules at boot, inject driver libs into containers

This enables containers to use CUDA without bundling driver versions.
Add InstanceLivenessChecker adapter to allow the devices package to query
instance state without circular imports. Used during startup to detect
orphaned device attachments from crashed VMs.

- liveness.go: Adapter implementing devices.InstanceLivenessChecker
- liveness_test.go: Unit tests
- reconcile_test.go: Device reconciliation tests
- types.go: Add Devices field to StoredMetadata and CreateInstanceRequest
Wire up device management throughout the instance lifecycle:
- create.go: Validate devices, auto-bind to VFIO, pass to VM config
- delete.go: Detach devices, auto-unbind from VFIO
- configdisk.go: Add HAS_GPU config flag for GPU instances
- manager.go: Add deviceManager dependency
- providers.go: Add ProvideDeviceManager
- wire.go/wire_gen.go: Wire up DeviceManager in DI
- api.go: Add DeviceManager to ApiService struct
Add REST API for device management and supporting documentation:

API endpoints:
- GET/POST /devices - List and register devices
- GET/DELETE /devices/{id} - Get and delete devices
- GET /devices/available - Discover passthrough-capable devices
- instances.go: Accept devices param in CreateInstance

Documentation:
- GPU.md: GPU passthrough architecture and driver injection
- README.md: Device management usage guide
- scripts/gpu-reset.sh: GPU reset utility

Tests and fixtures:
- gpu_e2e_test.go, gpu_inference_test.go, gpu_module_test.go
- testdata/ollama-cuda/ - CUDA test container

Also adds build-preview-cli Makefile target.
Rafael Garcia added 2 commits December 14, 2025 04:08
The initrd now includes NVIDIA kernel modules, firmware, and driver
libraries (~238MB total). With 512MB VMs, the kernel couldn't unpack
the initrd into tmpfs without running out of space.

Increase test VM memory from 512MB to 2GB to provide sufficient room
for the initrd contents plus normal VM operation.
@sjmiller609 sjmiller609 self-requested a review December 15, 2025 20:22
The HAS_GPU flag was being set unconditionally when any device was
attached, regardless of device type. This would trigger NVIDIA module
loading in the VM init script even for non-GPU PCI devices.

Now iterates through attached devices and checks each device's type,
only setting HAS_GPU=1 if at least one device is DeviceTypeGPU.
runningInstances[instanceID] = true
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: False positive warnings for instances without GPU devices

The detectSuspiciousVMMProcesses function uses ListAllInstanceDevices to build the set of known running instances, but ListAllInstanceDevices only returns instances that have devices attached (line 75: if len(inst.Devices) > 0). This causes legitimate cloud-hypervisor processes for instances without GPU devices to be incorrectly flagged as "untracked" with remediation advice to run gpu-reset.sh. Operators following this advice could inadvertently disrupt running VMs that simply don't use GPU passthrough.

Additional Locations (1)

Fix in Cursor Fix in Web

…PU devices

detectSuspiciousVMMProcesses was using ListAllInstanceDevices to build the
set of known running instances, but that method only returns instances with
devices attached. This caused legitimate cloud-hypervisor processes for
instances without GPU passthrough to be incorrectly flagged as 'untracked'
with misleading advice to run gpu-reset.sh.

Fix: Call IsInstanceRunning directly for each discovered process instead of
pre-building a map from ListAllInstanceDevices. This correctly identifies
all running instances regardless of device attachment.
Copy link
Collaborator

@sjmiller609 sjmiller609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Comment on lines +163 to +165
### IOMMU Requirements

- **IOMMU must be enabled** in BIOS and kernel (`intel_iommu=on` or `amd_iommu=on`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check + log warning if not available on startup maybe

Comment on lines +171 to +175
The following kernel modules must be loaded:
```bash
modprobe vfio_pci
modprobe vfio_iommu_type1
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this (regarding startup warning for debuggability)

Comment on lines +158 to +162
For best GPU performance, enable huge pages on the host:

```bash
echo 1024 > /proc/sys/vm/nr_hugepages
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to warning startup log and / or install script?

Comment on lines 654 to 656
// detectSuspiciousVMMProcesses logs warnings about cloud-hypervisor processes
// that don't match known instances. This is log-only (no killing).
func (m *manager) detectSuspiciousVMMProcesses(ctx context.Context, stats *reconcileStats) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's not related to devices and so should probably live in a different module, like the instance module maybe.

// Add NVIDIA kernel modules (for GPU passthrough support)
if err := m.addNvidiaModules(ctx, rootfsDir, arch); err != nil {
// Log but don't fail - NVIDIA modules are optional (not available on all architectures)
fmt.Printf("initrd: skipping NVIDIA modules: %v\n", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this use context logger?

// Add userspace driver libraries (libcuda.so, libnvidia-ml.so, nvidia-smi, etc.)
// These are injected into containers at boot time - see lib/devices/GPU.md
if err := m.addNvidiaDriverLibs(ctx, rootfsDir, arch); err != nil {
fmt.Printf("initrd: warning: could not add nvidia driver libs: %v\n", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context logger?

return fmt.Errorf("extract nvidia driver libs: %w", err)
}

fmt.Printf("initrd: added NVIDIA driver libraries from %s\n", url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context logger?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw when using context logger, should show up automatically with

hypeman logs --source=hypeman <VM id>

gpuSection := ""
for _, deviceID := range inst.Devices {
device, err := m.deviceManager.GetDevice(ctx, deviceID)
if err == nil && device.Type == devices.DeviceTypeGPU {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you already noticed but should probably not load gpu drivers if not gpu device

Rafael Garcia added 2 commits December 16, 2025 16:15
Check and warn on startup if:
- IOMMU is not enabled (no groups in /sys/kernel/iommu_groups)
- VFIO modules not loaded (vfio_pci, vfio_iommu_type1)
- Huge pages not configured (info hint when devices exist)
This function is about instance lifecycle, not device management.
Moving it to the instances module where it belongs.

The implementation uses IsInstanceRunning (which queries all instances)
rather than ListAllInstanceDevices (which only returns instances with
devices) to avoid false positives for non-GPU VMs.
@rgarcia
Copy link
Contributor Author

rgarcia commented Dec 16, 2025

hey steven, addressed your feedback:

startup warnings for gpu prerequisites (README.md:165, README.md:175, GPU.md:162)

  • good call, added startup warnings for iommu and vfio modules. also added a hint for huge pages when devices are registered. a260091

detectSuspiciousVMMProcesses location (manager.go:656)

  • yeah you're right, moved this to lib/instances/liveness.go where it belongs. also fixes the false positive bug from cursor bot since it now properly checks all instances, not just ones with devices. 8d610e9

context loggers in initrd (initrd.go:65, 208, 255)

  • switched all the fmt.Printf calls to context loggers. since this is server-level logging (not per-instance), it'll just go to the main hypeman log but with proper structured format. 05fa5a4

HAS_GPU flag (configdisk.go:115)

  • already fixed in cc0efea, now only sets HAS_GPU=1 when there's actually a gpu device attached 👍

Replace fmt.Printf calls with proper context loggers so messages
appear in structured logs with consistent formatting.
@rgarcia rgarcia merged commit 4b0c8f3 into main Dec 16, 2025
3 of 4 checks passed
@rgarcia rgarcia deleted the devices branch December 16, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants