Skip to content

Conversation

@sjmiller609
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add vGPU support

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-typescript studio · code · diff

Your SDK built successfully.
generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/322e702cedb32a7b5119490d8b124481f18fbd6d/dist.tar.gz
hypeman-go studio · code · diff

Your SDK built successfully.
generate ⚠️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@c068f9b7358502cd722e6cc30a654c07044c0d53
hypeman-cli studio · conflict

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-05 22:28:29 UTC

if err != nil {
continue
}
instances, err := strconv.Atoi(strings.TrimSpace(string(data)))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error from DestroyMdev is being silently ignored during cleanup. Consider logging this error to aid debugging when orphaned mdevs fail to clean up.

Suggested change
instances, err := strconv.Atoi(strings.TrimSpace(string(data)))
if err := DestroyMdev(mdev.UUID); err != nil {
// Log but continue - best effort cleanup
fmt.Fprintf(os.Stderr, "failed to destroy orphaned mdev %s: %v\n", mdev.UUID, err)
continue
}

for i, p := range parts {
if strings.HasPrefix(p, "0000:") && i+1 < len(parts) && parts[i+1] == uuid {
vfAddress = p
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider wrapping this error with additional context about which VF was being targeted - this will help debugging when mdev creation fails in production.

Suggested change
break
if err := os.WriteFile(createPath, []byte(mdevUUID), 0200); err != nil {
return nil, fmt.Errorf("create mdev on VF %s: %w", targetVF, err)
}

}

// getProfileNameFromType resolves internal type (nvidia-556) to profile name (L40S-1Q)
func getProfileNameFromType(profileType, vfAddress string) string {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error from mdevctl undefine is silently discarded. While this is "best effort", if it fails unexpectedly (e.g., mdevctl not installed), subsequent sysfs removal might also fail. Consider logging the error at debug level if mdevctl is available but fails.

log.ErrorContext(ctx, "failed to create mdev", "profile", req.GPU.Profile, "error", err)
return nil, fmt.Errorf("create vGPU mdev for profile %s: %w", req.GPU.Profile, err)
}
gpuProfile = req.GPU.Profile
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition here: if multiple instances request the same profile concurrently, both could succeed at CreateMdev targeting the same available VF before either completes. Consider adding a mutex or using file-based locking around mdev creation to prevent this.

})
}

return vfs, nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls ListMdevDevices() for every VF during discovery, which could result in O(n*m) operations where n is VFs and m is mdevs. Consider listing mdevs once and building a lookup map to improve performance on hosts with many VFs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some performance issue when I try to call the resources endpoint so this is probably why, needs investigating

}

switch mode {
case devices.GPUModeVGPU:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DetectHostGPUMode() internally calls DiscoverAvailableDevices() which does filesystem I/O. Then for vGPU mode, getVGPUStatus() calls DiscoverVFs() doing more I/O. Consider caching the mode detection result or combining detection with status retrieval to reduce redundant syscalls on every /resources API call.

Base automatically changed from resources to main January 5, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants