Skip to content

Commit 4b0c8f3

Browse files
rgarciaRafael Garcia
andauthored
gpu passthrough (#17)
* feat(devices): add lib/devices package types, errors, and paths Add foundational types for GPU/PCI device passthrough: - Device, AvailableDevice, CreateDeviceRequest structs - Error types (ErrNotFound, ErrInUse, ErrAlreadyExists, etc.) - Device path helpers in lib/paths * feat(devices): add PCI device discovery and VFIO binding Add low-level device operations: - discovery.go: Scan PCI bus, detect IOMMU groups, identify GPU devices - vfio.go: Bind/unbind devices to vfio-pci driver for VM passthrough * feat(devices): add device manager core Add the main device management logic: - Manager interface with CRUD operations for devices - CreateDevice, GetDevice, DeleteDevice, ListDevices - MarkAttached/MarkDetached for instance lifecycle - BindToVFIO/UnbindFromVFIO for driver management - Persistence via JSON metadata files * feat(system): add kernel/initrd NVIDIA GPU support Add support for NVIDIA GPU passthrough in the VM boot chain: - versions.go: Add Kernel_20251213 with NVIDIA module/driver lib URLs - initrd.go: Download and extract NVIDIA kernel modules and driver libs - init_script.go: Load NVIDIA modules at boot, inject driver libs into containers This enables containers to use CUDA without bundling driver versions. * feat(instances): add instance liveness checker for device reconciliation Add InstanceLivenessChecker adapter to allow the devices package to query instance state without circular imports. Used during startup to detect orphaned device attachments from crashed VMs. - liveness.go: Adapter implementing devices.InstanceLivenessChecker - liveness_test.go: Unit tests - reconcile_test.go: Device reconciliation tests - types.go: Add Devices field to StoredMetadata and CreateInstanceRequest * feat(instances): integrate devices with instance lifecycle Wire up device management throughout the instance lifecycle: - create.go: Validate devices, auto-bind to VFIO, pass to VM config - delete.go: Detach devices, auto-unbind from VFIO - configdisk.go: Add HAS_GPU config flag for GPU instances - manager.go: Add deviceManager dependency - providers.go: Add ProvideDeviceManager - wire.go/wire_gen.go: Wire up DeviceManager in DI - api.go: Add DeviceManager to ApiService struct * feat(api): add devices API endpoints and documentation Add REST API for device management and supporting documentation: API endpoints: - GET/POST /devices - List and register devices - GET/DELETE /devices/{id} - Get and delete devices - GET /devices/available - Discover passthrough-capable devices - instances.go: Accept devices param in CreateInstance Documentation: - GPU.md: GPU passthrough architecture and driver injection - README.md: Device management usage guide - scripts/gpu-reset.sh: GPU reset utility Tests and fixtures: - gpu_e2e_test.go, gpu_inference_test.go, gpu_module_test.go - testdata/ollama-cuda/ - CUDA test container Also adds build-preview-cli Makefile target. * test: increase VM memory to 2GB to accommodate large initrd The initrd now includes NVIDIA kernel modules, firmware, and driver libraries (~238MB total). With 512MB VMs, the kernel couldn't unpack the initrd into tmpfs without running out of space. Increase test VM memory from 512MB to 2GB to provide sufficient room for the initrd contents plus normal VM operation. * remove slop test * remove outdated comment * markattached bug * remove preview script * fix(configdisk): only set HAS_GPU=1 for actual GPU devices The HAS_GPU flag was being set unconditionally when any device was attached, regardless of device type. This would trigger NVIDIA module loading in the VM init script even for non-GPU PCI devices. Now iterates through attached devices and checks each device's type, only setting HAS_GPU=1 if at least one device is DeviceTypeGPU. * fix(devices): prevent false positive warnings for instances without GPU devices detectSuspiciousVMMProcesses was using ListAllInstanceDevices to build the set of known running instances, but that method only returns instances with devices attached. This caused legitimate cloud-hypervisor processes for instances without GPU passthrough to be incorrectly flagged as 'untracked' with misleading advice to run gpu-reset.sh. Fix: Call IsInstanceRunning directly for each discovered process instead of pre-building a map from ListAllInstanceDevices. This correctly identifies all running instances regardless of device attachment. * devices: add startup validation warnings for GPU prerequisites Check and warn on startup if: - IOMMU is not enabled (no groups in /sys/kernel/iommu_groups) - VFIO modules not loaded (vfio_pci, vfio_iommu_type1) - Huge pages not configured (info hint when devices exist) * instances: move detectSuspiciousVMMProcesses to liveness.go This function is about instance lifecycle, not device management. Moving it to the instances module where it belongs. The implementation uses IsInstanceRunning (which queries all instances) rather than ListAllInstanceDevices (which only returns instances with devices) to avoid false positives for non-GPU VMs. * system: use context loggers in initrd building Replace fmt.Printf calls with proper context loggers so messages appear in structured logs with consistent formatting. --------- Co-authored-by: Rafael Garcia <[email protected]>
1 parent bd634e5 commit 4b0c8f3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+7224
-168
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
SHELL := /bin/bash
2-
.PHONY: oapi-generate generate-vmm-client generate-wire generate-all dev build test install-tools gen-jwt download-ch-binaries download-ch-spec ensure-ch-binaries build-caddy-binaries build-caddy ensure-caddy-binaries release-prep clean
2+
.PHONY: oapi-generate generate-vmm-client generate-wire generate-all dev build test install-tools gen-jwt download-ch-binaries download-ch-spec ensure-ch-binaries build-caddy-binaries build-caddy ensure-caddy-binaries build-preview-cli release-prep clean
33

44
# Directory where local binaries will be installed
55
BIN_DIR ?= $(CURDIR)/bin

cmd/api/api/api.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ package api
22

33
import (
44
"github.com/onkernel/hypeman/cmd/api/config"
5+
"github.com/onkernel/hypeman/lib/devices"
56
"github.com/onkernel/hypeman/lib/images"
67
"github.com/onkernel/hypeman/lib/ingress"
78
"github.com/onkernel/hypeman/lib/instances"
@@ -17,6 +18,7 @@ type ApiService struct {
1718
InstanceManager instances.Manager
1819
VolumeManager volumes.Manager
1920
NetworkManager network.Manager
21+
DeviceManager devices.Manager
2022
IngressManager ingress.Manager
2123
}
2224

@@ -29,6 +31,7 @@ func New(
2931
instanceManager instances.Manager,
3032
volumeManager volumes.Manager,
3133
networkManager network.Manager,
34+
deviceManager devices.Manager,
3235
ingressManager ingress.Manager,
3336
) *ApiService {
3437
return &ApiService{
@@ -37,6 +40,7 @@ func New(
3740
InstanceManager: instanceManager,
3841
VolumeManager: volumeManager,
3942
NetworkManager: networkManager,
43+
DeviceManager: deviceManager,
4044
IngressManager: ingressManager,
4145
}
4246
}

cmd/api/api/api_test.go

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import (
99
"time"
1010

1111
"github.com/onkernel/hypeman/cmd/api/config"
12+
"github.com/onkernel/hypeman/lib/devices"
1213
"github.com/onkernel/hypeman/lib/images"
1314
"github.com/onkernel/hypeman/lib/instances"
1415
mw "github.com/onkernel/hypeman/lib/middleware"
@@ -34,11 +35,12 @@ func newTestService(t *testing.T) *ApiService {
3435

3536
systemMgr := system.NewManager(p)
3637
networkMgr := network.NewManager(p, cfg, nil)
38+
deviceMgr := devices.NewManager(p)
3739
volumeMgr := volumes.NewManager(p, 0, nil) // 0 = unlimited storage
3840
limits := instances.ResourceLimits{
3941
MaxOverlaySize: 100 * 1024 * 1024 * 1024, // 100GB
4042
}
41-
instanceMgr := instances.NewManager(p, imageMgr, systemMgr, networkMgr, volumeMgr, limits, nil, nil)
43+
instanceMgr := instances.NewManager(p, imageMgr, systemMgr, networkMgr, deviceMgr, volumeMgr, limits, nil, nil)
4244

4345
// Register cleanup for orphaned Cloud Hypervisor processes
4446
t.Cleanup(func() {
@@ -50,6 +52,7 @@ func newTestService(t *testing.T) *ApiService {
5052
ImageManager: imageMgr,
5153
InstanceManager: instanceMgr,
5254
VolumeManager: volumeMgr,
55+
DeviceManager: deviceMgr,
5356
}
5457
}
5558

cmd/api/api/devices.go

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
package api
2+
3+
import (
4+
"context"
5+
"errors"
6+
7+
"github.com/onkernel/hypeman/lib/devices"
8+
"github.com/onkernel/hypeman/lib/oapi"
9+
)
10+
11+
// ListDevices returns all registered devices
12+
func (s *ApiService) ListDevices(ctx context.Context, request oapi.ListDevicesRequestObject) (oapi.ListDevicesResponseObject, error) {
13+
deviceList, err := s.DeviceManager.ListDevices(ctx)
14+
if err != nil {
15+
return oapi.ListDevices500JSONResponse{
16+
Code: "internal_error",
17+
Message: err.Error(),
18+
}, nil
19+
}
20+
21+
result := make([]oapi.Device, len(deviceList))
22+
for i, d := range deviceList {
23+
result[i] = deviceToOAPI(d)
24+
}
25+
26+
return oapi.ListDevices200JSONResponse(result), nil
27+
}
28+
29+
// ListAvailableDevices discovers passthrough-capable devices on the host
30+
func (s *ApiService) ListAvailableDevices(ctx context.Context, request oapi.ListAvailableDevicesRequestObject) (oapi.ListAvailableDevicesResponseObject, error) {
31+
available, err := s.DeviceManager.ListAvailableDevices(ctx)
32+
if err != nil {
33+
return oapi.ListAvailableDevices500JSONResponse{
34+
Code: "internal_error",
35+
Message: err.Error(),
36+
}, nil
37+
}
38+
39+
result := make([]oapi.AvailableDevice, len(available))
40+
for i, d := range available {
41+
result[i] = availableDeviceToOAPI(d)
42+
}
43+
44+
return oapi.ListAvailableDevices200JSONResponse(result), nil
45+
}
46+
47+
// CreateDevice registers a new device for passthrough
48+
func (s *ApiService) CreateDevice(ctx context.Context, request oapi.CreateDeviceRequestObject) (oapi.CreateDeviceResponseObject, error) {
49+
var name string
50+
if request.Body.Name != nil {
51+
name = *request.Body.Name
52+
}
53+
req := devices.CreateDeviceRequest{
54+
Name: name,
55+
PCIAddress: request.Body.PciAddress,
56+
}
57+
58+
device, err := s.DeviceManager.CreateDevice(ctx, req)
59+
if err != nil {
60+
switch {
61+
case errors.Is(err, devices.ErrInvalidName):
62+
return oapi.CreateDevice400JSONResponse{
63+
Code: "invalid_name",
64+
Message: err.Error(),
65+
}, nil
66+
case errors.Is(err, devices.ErrInvalidPCIAddress):
67+
return oapi.CreateDevice400JSONResponse{
68+
Code: "invalid_pci_address",
69+
Message: err.Error(),
70+
}, nil
71+
case errors.Is(err, devices.ErrDeviceNotFound):
72+
return oapi.CreateDevice404JSONResponse{
73+
Code: "device_not_found",
74+
Message: err.Error(),
75+
}, nil
76+
case errors.Is(err, devices.ErrAlreadyExists), errors.Is(err, devices.ErrNameExists):
77+
return oapi.CreateDevice409JSONResponse{
78+
Code: "conflict",
79+
Message: err.Error(),
80+
}, nil
81+
default:
82+
return oapi.CreateDevice500JSONResponse{
83+
Code: "internal_error",
84+
Message: err.Error(),
85+
}, nil
86+
}
87+
}
88+
89+
return oapi.CreateDevice201JSONResponse(deviceToOAPI(*device)), nil
90+
}
91+
92+
// GetDevice returns a device by ID or name
93+
func (s *ApiService) GetDevice(ctx context.Context, request oapi.GetDeviceRequestObject) (oapi.GetDeviceResponseObject, error) {
94+
device, err := s.DeviceManager.GetDevice(ctx, request.Id)
95+
if err != nil {
96+
if errors.Is(err, devices.ErrNotFound) {
97+
return oapi.GetDevice404JSONResponse{
98+
Code: "not_found",
99+
Message: "device not found",
100+
}, nil
101+
}
102+
return oapi.GetDevice500JSONResponse{
103+
Code: "internal_error",
104+
Message: err.Error(),
105+
}, nil
106+
}
107+
108+
return oapi.GetDevice200JSONResponse(deviceToOAPI(*device)), nil
109+
}
110+
111+
// DeleteDevice unregisters a device
112+
func (s *ApiService) DeleteDevice(ctx context.Context, request oapi.DeleteDeviceRequestObject) (oapi.DeleteDeviceResponseObject, error) {
113+
err := s.DeviceManager.DeleteDevice(ctx, request.Id)
114+
if err != nil {
115+
switch {
116+
case errors.Is(err, devices.ErrNotFound):
117+
return oapi.DeleteDevice404JSONResponse{
118+
Code: "not_found",
119+
Message: "device not found",
120+
}, nil
121+
case errors.Is(err, devices.ErrInUse):
122+
return oapi.DeleteDevice409JSONResponse{
123+
Code: "in_use",
124+
Message: "device is attached to an instance",
125+
}, nil
126+
default:
127+
return oapi.DeleteDevice500JSONResponse{
128+
Code: "internal_error",
129+
Message: err.Error(),
130+
}, nil
131+
}
132+
}
133+
134+
return oapi.DeleteDevice204Response{}, nil
135+
}
136+
137+
// Helper functions
138+
139+
func deviceToOAPI(d devices.Device) oapi.Device {
140+
deviceType := oapi.DeviceType(d.Type)
141+
return oapi.Device{
142+
Id: d.Id,
143+
Name: &d.Name,
144+
Type: deviceType,
145+
PciAddress: d.PCIAddress,
146+
VendorId: d.VendorID,
147+
DeviceId: d.DeviceID,
148+
IommuGroup: d.IOMMUGroup,
149+
BoundToVfio: d.BoundToVFIO,
150+
AttachedTo: d.AttachedTo,
151+
CreatedAt: d.CreatedAt,
152+
}
153+
}
154+
155+
func availableDeviceToOAPI(d devices.AvailableDevice) oapi.AvailableDevice {
156+
return oapi.AvailableDevice{
157+
PciAddress: d.PCIAddress,
158+
VendorId: d.VendorID,
159+
DeviceId: d.DeviceID,
160+
VendorName: &d.VendorName,
161+
DeviceName: &d.DeviceName,
162+
IommuGroup: d.IOMMUGroup,
163+
CurrentDriver: d.CurrentDriver,
164+
}
165+
}
166+
167+

cmd/api/api/instances.go

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,12 @@ func (s *ApiService) CreateInstance(ctx context.Context, request oapi.CreateInst
9696
networkEnabled = *request.Body.Network.Enabled
9797
}
9898

99+
// Parse devices (GPU passthrough)
100+
var deviceRefs []string
101+
if request.Body.Devices != nil {
102+
deviceRefs = *request.Body.Devices
103+
}
104+
99105
// Parse volumes
100106
var volumes []instances.VolumeAttachment
101107
if request.Body.Volumes != nil {
@@ -139,6 +145,7 @@ func (s *ApiService) CreateInstance(ctx context.Context, request oapi.CreateInst
139145
Vcpus: vcpus,
140146
Env: env,
141147
NetworkEnabled: networkEnabled,
148+
Devices: deviceRefs,
142149
Volumes: volumes,
143150
}
144151

cmd/api/main.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,18 @@ func run() error {
172172
}
173173
logger.Info("Network manager initialized")
174174

175+
// Reconcile device state (clears orphaned attachments from crashed VMs)
176+
// Set up liveness checker so device reconciliation can accurately detect orphaned attachments
177+
logger.Info("Reconciling device state...")
178+
livenessChecker := instances.NewLivenessChecker(app.InstanceManager)
179+
if livenessChecker != nil {
180+
app.DeviceManager.SetLivenessChecker(livenessChecker)
181+
}
182+
if err := app.DeviceManager.ReconcileDevices(app.Ctx); err != nil {
183+
logger.Error("failed to reconcile device state", "error", err)
184+
return fmt.Errorf("reconcile device state: %w", err)
185+
}
186+
175187
// Initialize ingress manager (starts Caddy daemon and DNS server for dynamic upstreams)
176188
logger.Info("Initializing ingress manager...")
177189
if err := app.IngressManager.Initialize(app.Ctx); err != nil {

cmd/api/wire.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import (
99
"github.com/google/wire"
1010
"github.com/onkernel/hypeman/cmd/api/api"
1111
"github.com/onkernel/hypeman/cmd/api/config"
12+
"github.com/onkernel/hypeman/lib/devices"
1213
"github.com/onkernel/hypeman/lib/images"
1314
"github.com/onkernel/hypeman/lib/ingress"
1415
"github.com/onkernel/hypeman/lib/instances"
@@ -27,6 +28,7 @@ type application struct {
2728
ImageManager images.Manager
2829
SystemManager system.Manager
2930
NetworkManager network.Manager
31+
DeviceManager devices.Manager
3032
InstanceManager instances.Manager
3133
VolumeManager volumes.Manager
3234
IngressManager ingress.Manager
@@ -44,6 +46,7 @@ func initializeApp() (*application, func(), error) {
4446
providers.ProvideImageManager,
4547
providers.ProvideSystemManager,
4648
providers.ProvideNetworkManager,
49+
providers.ProvideDeviceManager,
4750
providers.ProvideInstanceManager,
4851
providers.ProvideVolumeManager,
4952
providers.ProvideIngressManager,

cmd/api/wire_gen.go

Lines changed: 8 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)