-
Notifications
You must be signed in to change notification settings - Fork 0
Add vGPU support #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add vGPU support #52
Conversation
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs. ✅ hypeman-typescript studio · code · diff
✅ hypeman-go studio · code · diff
⏳ These are partial results; builds are still running. This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
| if err != nil { | ||
| continue | ||
| } | ||
| instances, err := strconv.Atoi(strings.TrimSpace(string(data))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error from DestroyMdev is being silently ignored during cleanup. Consider logging this error to aid debugging when orphaned mdevs fail to clean up.
| instances, err := strconv.Atoi(strings.TrimSpace(string(data))) | |
| if err := DestroyMdev(mdev.UUID); err != nil { | |
| // Log but continue - best effort cleanup | |
| fmt.Fprintf(os.Stderr, "failed to destroy orphaned mdev %s: %v\n", mdev.UUID, err) | |
| continue | |
| } |
| for i, p := range parts { | ||
| if strings.HasPrefix(p, "0000:") && i+1 < len(parts) && parts[i+1] == uuid { | ||
| vfAddress = p | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider wrapping this error with additional context about which VF was being targeted - this will help debugging when mdev creation fails in production.
| break | |
| if err := os.WriteFile(createPath, []byte(mdevUUID), 0200); err != nil { | |
| return nil, fmt.Errorf("create mdev on VF %s: %w", targetVF, err) | |
| } |
| } | ||
|
|
||
| // getProfileNameFromType resolves internal type (nvidia-556) to profile name (L40S-1Q) | ||
| func getProfileNameFromType(profileType, vfAddress string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error from mdevctl undefine is silently discarded. While this is "best effort", if it fails unexpectedly (e.g., mdevctl not installed), subsequent sysfs removal might also fail. Consider logging the error at debug level if mdevctl is available but fails.
| log.ErrorContext(ctx, "failed to create mdev", "profile", req.GPU.Profile, "error", err) | ||
| return nil, fmt.Errorf("create vGPU mdev for profile %s: %w", req.GPU.Profile, err) | ||
| } | ||
| gpuProfile = req.GPU.Profile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition here: if multiple instances request the same profile concurrently, both could succeed at CreateMdev targeting the same available VF before either completes. Consider adding a mutex or using file-based locking around mdev creation to prevent this.
| }) | ||
| } | ||
|
|
||
| return vfs, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls ListMdevDevices() for every VF during discovery, which could result in O(n*m) operations where n is VFs and m is mdevs. Consider listing mdevs once and building a lookup map to improve performance on hosts with many VFs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some performance issue when I try to call the resources endpoint so this is probably why, needs investigating
| } | ||
|
|
||
| switch mode { | ||
| case devices.GPUModeVGPU: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DetectHostGPUMode() internally calls DiscoverAvailableDevices() which does filesystem I/O. Then for vGPU mode, getVGPUStatus() calls DiscoverVFs() doing more I/O. Consider caching the mode detection result or combining detection with status retrieval to reduce redundant syscalls on every /resources API call.
No description provided.