Skip to content

feat(scheduler): add node nouse gpuuuid function#1206

Open
ZhengW22 wants to merge 2 commits intoProject-HAMi:masterfrom
ZhengW22:master
Open

feat(scheduler): add node nouse gpuuuid function#1206
ZhengW22 wants to merge 2 commits intoProject-HAMi:masterfrom
ZhengW22:master

Conversation

@ZhengW22
Copy link

What type of PR is this?

What this PR does / why we need it:
This PR adds the capability to disable GPUs at the node level by applying annotations to nodes. GPUs matching the specified UUIDs will no longer be allocated to any pods.

The implementation works by setting the used count of the corresponding node GPUs to their maximum capacity when calculating nodeUsage, effectively occupying those resources. This approach maintains compatibility with scheduling logic for different types of GPU cards.

Which issue(s) this PR fixes:
No.

Special notes for your reviewer:
No.

Does this PR introduce a user-facing change?:
No.

@hami-robot hami-robot bot requested a review from chaunceyjiang July 15, 2025 08:30
@hami-robot hami-robot bot requested a review from ouyangluwei163 July 15, 2025 08:30
@github-actions github-actions bot added the kind/feature new function label Jul 15, 2025
@hami-robot hami-robot bot added the size/L label Jul 15, 2025
@ZhengW22
Copy link
Author

I recreate the pr #1154 to add the sign-off info.

@archlitchi
Copy link
Member

please fix the go-lint

@ZhengW22
Copy link
Author

~> make verify
hack/verify-all.sh

  • bash hack/../hack/verify-staticcheck.sh
    Using golangci-lint version:
    golangci-lint has version 2.2.1 built with go1.24.4 from (unknown, modified: ?, mod sum: "h1:01r5ueY3oq8gtqgA5TGtBcS+LYZ/dEzZ59/AN1NsT2E=") on (unknown)
    0 issues.
    Congratulations! All Go source files have passed staticcheck.
  • bash hack/../hack/verify-license.sh
    ++ dirname hack/../hack/verify-license.sh
  • REPO_ROOT=hack/../hack/..
  • cd hack/../hack/..
    ++ which addlicense
  • [[ /home/s123zz123/Code/Library/go/bin/addlicense == '' ]]
    ++ which addlicense
  • ADDLICENSE_BIN=/home/s123zz123/Code/Library/go/bin/addlicense
    ++ /home/s123zz123/Code/Library/go/bin/addlicense -check -ignore 'benchmarks/' -ignore 'charts/' -ignore 'docs/' -ignore 'docker/' -ignore 'examples/' -ignore 'lib/' -ignore 'libvgpu/' -ignore 'third_party/' -ignore 'vendor/' -ignore '_output/' -ignore '.github/' -ignore '/.md' -ignore '**/.yaml' -ignore '/*.yml' -ignore '/*.json' -ignore '.idea/**' .
  • missing_license_header_files=
  • [[ -n '' ]]
  • echo 'Congratulations! All files have passed license header check.'
    Congratulations! All files have passed license header check.
  • bash hack/../hack/verify-import-aliases.sh
    checking-imports:
    /home/s123zz123/Code/fork/HAMi/cmd
    /home/s123zz123/Code/fork/HAMi/cmd/device-plugin
    /home/s123zz123/Code/fork/HAMi/cmd/device-plugin/nvidia
    /home/s123zz123/Code/fork/HAMi/cmd/scheduler
    /home/s123zz123/Code/fork/HAMi/cmd/vGPUmonitor
    /home/s123zz123/Code/fork/HAMi/cmd/vGPUmonitor/noderpc
    /home/s123zz123/Code/fork/HAMi/cmd/vGPUmonitor/testcollector
    /home/s123zz123/Code/fork/HAMi/pkg
    /home/s123zz123/Code/fork/HAMi/pkg/device
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/cdi
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/info
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/mig
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/plugin
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/plugin/manager
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/rm
    /home/s123zz123/Code/fork/HAMi/pkg/device/ascend
    /home/s123zz123/Code/fork/HAMi/pkg/device/cambricon
    /home/s123zz123/Code/fork/HAMi/pkg/device/common
    /home/s123zz123/Code/fork/HAMi/pkg/device/enflame
    /home/s123zz123/Code/fork/HAMi/pkg/device/hygon
    /home/s123zz123/Code/fork/HAMi/pkg/device/iluvatar
    /home/s123zz123/Code/fork/HAMi/pkg/device/kunlun
    /home/s123zz123/Code/fork/HAMi/pkg/device/metax
    /home/s123zz123/Code/fork/HAMi/pkg/device/mthreads
    /home/s123zz123/Code/fork/HAMi/pkg/device/nvidia
    /home/s123zz123/Code/fork/HAMi/pkg/k8sutil
    /home/s123zz123/Code/fork/HAMi/pkg/monitor
    /home/s123zz123/Code/fork/HAMi/pkg/monitor/nvidia
    /home/s123zz123/Code/fork/HAMi/pkg/monitor/nvidia/v0
    /home/s123zz123/Code/fork/HAMi/pkg/monitor/nvidia/v1
    /home/s123zz123/Code/fork/HAMi/pkg/oci
    /home/s123zz123/Code/fork/HAMi/pkg/scheduler
    /home/s123zz123/Code/fork/HAMi/pkg/scheduler/config
    /home/s123zz123/Code/fork/HAMi/pkg/scheduler/policy
    /home/s123zz123/Code/fork/HAMi/pkg/scheduler/routes
    /home/s123zz123/Code/fork/HAMi/pkg/util
    /home/s123zz123/Code/fork/HAMi/pkg/util/client
    /home/s123zz123/Code/fork/HAMi/pkg/util/client/testdata
    /home/s123zz123/Code/fork/HAMi/pkg/util/flag
    /home/s123zz123/Code/fork/HAMi/pkg/util/nodelock
    /home/s123zz123/Code/fork/HAMi/pkg/version

Passed import-aliases verification.

I have already fixed all issue.

@archlitchi
Copy link
Member

CC @Shouren

@codecov
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

❌ Patch coverage is 89.28571% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/scheduler/nodes.go 89.28% 2 Missing and 1 partial ⚠️
Flag Coverage Δ
unittests 66.41% <89.28%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/scheduler/nodes.go 86.36% <89.28%> (+1.36%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ZhengW22
Copy link
Author

Do we need 100% code coverage to new code?

@wawa0210
Copy link
Member

Do we need 100% code coverage to new code?

We recommend focusing on unit test coverage, at least new code needs to be covered

@lengrongfu
Copy link
Member

@ZhengW22
Copy link
Author

This function uses on the pod annotation and only works with nvidia gpu. My commit can use on the node annotation and support all current vendors. From excluding scope, this function guarantees that pod avoids use gpu by uuid and the purpose of my commit is that node avoids use gpu by uuid.

@ZhengW22
Copy link
Author

Hello, someone can help me to review this pr?

@wawa0210
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to disable specific GPUs on a node via annotations, which is a valuable addition. The implementation is generic and supports multiple vendors, which is great.

However, I've found a few issues:

  • There's a discrepancy between the PR description and the implementation. The description states that disabled GPUs are marked as fully utilized, while the code removes them from the scheduler's view. Please clarify the intended behavior and update the description if necessary.
  • There appears to be a critical compilation error due to the use of undefined constants from the metax package.
  • The parsing of UUIDs from annotations could be more robust.
  • The unit tests for the new functionality are minimal and should be expanded to cover more scenarios.

I've left detailed comments on these points. Please address them to ensure the quality and correctness of the feature.

Comment on lines +40 to +48
var vendorUUIDMap = map[string][]string{
nvidia.GPUNoUseUUID: {nvidia.NvidiaGPUDevice},
cambricon.MLUNoUseUUID: {cambricon.CambriconMLUDevice},
hygon.DCUNoUseUUID: {hygon.HygonDCUDevice},
iluvatar.IluvatarNoUseUUID: {iluvatar.IluvatarGPUDevice},
enflame.EnflameNoUseUUID: {enflame.EnflameGPUDevice},
mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice},
metax.MetaxNoUseUUID: {metax.MetaxGPUDevice, metax.MetaxSGPUDevice},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The vendorUUIDMap uses constants from various device packages. However, metax.MetaxNoUseUUID and metax.MetaxSGPUDevice do not seem to be defined in pkg/device/metax/device.go. This will cause a compilation error. Please ensure all constants used here are defined in their respective packages.

Similarly, enflame.EnflameNoUseUUID is used, but its definition is not visible in the PR context. Please verify it exists.

docs/config.md Outdated

* `nvidia.com/nouse-gpuuuid`:

String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-XXX"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example ie: "GPU-XXX" only shows a single device ID, but the description says it supports multiple. It would be more helpful to provide an example with multiple IDs to match the description and the Chinese documentation.

Suggested change
String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-XXX"
String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-AAA,GPU-BBB"

Comment on lines +101 to +103
for annokKey, vendor := range vendorUUIDMap {
klog.V(5).Infof("Current annokey is %s, and vendor is %v", annokKey, vendor)
if value, ok := nodeInfo.Node.Annotations[annokKey]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the variable name annokKey. It should be annoKey. The log message also contains a typo annokey.

Suggested change
for annokKey, vendor := range vendorUUIDMap {
klog.V(5).Infof("Current annokey is %s, and vendor is %v", annokKey, vendor)
if value, ok := nodeInfo.Node.Annotations[annokKey]; ok {
for annoKey, vendor := range vendorUUIDMap {
klog.V(5).Infof("Current annoKey is %s, and vendor is %v", annoKey, vendor)
if value, ok := nodeInfo.Node.Annotations[annoKey]; ok {

Comment on lines +329 to +367
func Test_rmDeviceByNodeAnnotation(t *testing.T) {
id1 := "60151478-4709-4242-a8c1-a944252d194b"
type args struct {
nodeInfo *util.NodeInfo
}
tests := []struct {
name string
args args
want []util.DeviceInfo
}{
{
name: "Test remove device",
args: args{
nodeInfo: &util.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: id1}}},
Devices: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
},
},
want: []util.DeviceInfo{},
},
{
name: "Test no removing device",
args: args{
nodeInfo: &util.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{"test-key": ""}}},
Devices: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
},
},
want: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := rmDeviceByNodeAnnotation(tt.args.nodeInfo); !reflect.DeepEqual(got, tt.want) {
t.Errorf("rmDeviceByNodeAnnotation() = %v, want %v", got, tt.want)
}
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test cases for rmDeviceByNodeAnnotation are not comprehensive enough. Please consider adding more tests to cover the following scenarios:

  • Multiple UUIDs in the annotation value.
  • Multiple devices on the node, with some being removed and some not.
  • Multiple annotations for different vendors on the same node.
  • Malformed annotation values (e.g., with extra spaces or empty parts like "uuid1,,uuid2").
  • A case where a UUID matches but the device vendor does not.

@ZhengW22
Copy link
Author

ZhengW22 commented Aug 5, 2025

@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those.

"github.com/Project-HAMi/HAMi/pkg/device/nvidia"
)

var vendorUUIDMap = map[string][]string{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhengW22 It is a map mapping from a key in annotation to vendor Name, the variable name vendorUUIDMap make me confused.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I change the map name from vendorUUIDMap to vendorNoUseAnnoKeyMap?

}

func rmDeviceByNodeAnnotation(nodeInfo *util.NodeInfo) []util.DeviceInfo {
disableGPUUUIDVendorMap := make(map[string][]string)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhengW22 I prefer a map[vendorName]map[uuid]bool map which avoiding potential UUID conflict

@Shouren
Copy link
Collaborator

Shouren commented Aug 6, 2025

@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those.

@ZhengW22 The docs in docs directory are moving to website and changes to this directory is not allowed now, so please submit a PR to website to update the docs.

@ZhengW22
Copy link
Author

ZhengW22 commented Aug 8, 2025

@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those.

@ZhengW22 The docs in docs directory are moving to website and changes to this directory is not allowed now, so please submit a PR to website to update the docs.

Do you mean i should delete md change in this project? And I only find English version of the documents in this project https://github.com/Project-HAMi/website, does this mean that I only need to submit the English version of the documentation?

@ZhengW22
Copy link
Author

ZhengW22 commented Aug 8, 2025

@Shouren Hello, I modify the code logic according to your suggestions.

@Shouren
Copy link
Collaborator

Shouren commented Aug 11, 2025

@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those.

@ZhengW22 The docs in docs directory are moving to website and changes to this directory is not allowed now, so please submit a PR to website to update the docs.

Do you mean i should delete md change in this project? And I only find English version of the documents in this project https://github.com/Project-HAMi/website, does this mean that I only need to submit the English version of the documentation?

@Nimbus318 Can you tell @ZhengW22 how to add Chinese version of the docs in website repo ?

@Nimbus318
Copy link
Contributor

Nimbus318 commented Aug 11, 2025

@ZhengW22 You need to update both the corresponding English docs in the docs/userguide directory and the corresponding Chinese docs in the i18n/zh/docusaurus-plugin-content-docs/current/userguide directory.
After these changes, they will appear on https://project-hami.io/docs/next/ and will be included in the versioned docs for future releases.
No changes are needed for other already versioned directories.

@ZhengW22
Copy link
Author

@Shouren , Hello, I already created the new documents pr in project website.
Project-HAMi/website#100

@Shouren
Copy link
Collaborator

Shouren commented Aug 20, 2025

@Shouren , Hello, I already created the new documents pr in project website. Project-HAMi/website#100

@ZhengW22 Please remove the docs in this PR and i will check it later.

@ZhengW22
Copy link
Author

@Shouren Hello, I have already the doc file.

@ZhengW22
Copy link
Author

@wawa0210 Hello, please help to review this pr.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to disable GPUs at the node level using annotations, which is a valuable addition. However, my review has identified a few issues.

First, the PR is very large and bundles several unrelated changes. Besides the main feature, it includes a significant amount of new code for the NVIDIA device plugin and OCI runtime handling, as well as constants for scheduler policies. These additions are not mentioned in the PR description, making it difficult to review and understand their purpose and impact. It's highly recommended to split these unrelated changes into separate pull requests to follow best practices for code review.

Second, there's a discrepancy between the PR description and the implementation. The description states that disabled GPUs are handled by 'setting the used count... to their maximum capacity', but the code actually filters them out from the scheduler's list of available devices. Please update the description to match the implementation.

My review comments below focus on the 'disable GPU' feature. I've found a bug in the implementation and a major inconsistency in the added documentation that should be addressed.

Comment on lines +78 to +96
## Device UUID Selection

You can specify which GPU devices to use or exclude using annotations:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: poddemo
annotations:
# Use specific GPU devices (comma-separated list)
enflame.com/use-gpuuuid: "node1-enflame-0,node1-enflame-1"
# Or exclude specific GPU devices (comma-separated list)
enflame.com/nouse-gpuuuid: "node1-enflame-2,node1-enflame-3"
spec:
# ... rest of pod spec
```

> **NOTE:** The device ID format is `{node-name}-enflame-{index}`. You can find the available device IDs in the node status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This "Device UUID Selection" section seems to describe a different feature than what is implemented in this pull request. The documentation describes pod-level annotations (enflame.com/use-gpuuuid, enflame.com/nouse-gpuuuid) for selecting GPUs, while the PR implements a node-level annotation for disabling GPUs from being scheduled. This is confusing. The documentation should be updated to accurately describe the new node-level feature. Additionally, the implementation in pkg/scheduler/nodes.go does not seem to include support for Enflame devices in the vendorNoUseAnnoKeyMap, so this feature wouldn't work for Enflame devices as documented.

} else {
m.nodes[nodeID] = nodeInfo
}
nodeInfo.Devices = rmDeviceByNodeAnnotation(m.nodes[nodeID])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a bug here. The nodeInfo.Devices is updated, but nodeInfo is the function argument. If the node already exists in m.nodes, this change will not be persisted in the node manager's state because you are modifying the argument, not the value stored in the map m.nodes. The change should be applied to m.nodes[nodeID].Devices.

Suggested change
nodeInfo.Devices = rmDeviceByNodeAnnotation(m.nodes[nodeID])
m.nodes[nodeID].Devices = rmDeviceByNodeAnnotation(m.nodes[nodeID])

}
newDeviceMap := make(map[string][]device.DeviceInfo)
for deviceName, deviceList := range nodeInfo.Devices {
newDeviceList := make([]device.DeviceInfo, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better performance, it's a good practice to pre-allocate the slice with a known capacity to avoid multiple re-allocations when appending elements in the loop. You can initialize newDeviceList with a capacity equal to the length of deviceList.

Suggested change
newDeviceList := make([]device.DeviceInfo, 0)
newDeviceList := make([]device.DeviceInfo, 0, len(deviceList))

@hami-robot hami-robot bot added size/L and removed size/XXL labels Jan 29, 2026
@ZhengW22
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to disable specific GPUs on a node using annotations. The implementation filters out the disabled GPUs from the scheduler's node cache. While the overall approach is sound, I've found a bug in how disabled GPU UUIDs are processed when there are spaces in the annotation value. I've also included suggestions to improve the test coverage and fix a redundant test case. Lastly, please note that the implementation differs from the PR description, which states that it works by maxing out the GPU's used resources; the code actually removes the device from the scheduler's view.

if vendorWithDisableGPUUUIDMap[vendor] == nil {
vendorWithDisableGPUUUIDMap[vendor] = make(map[string]bool)
}
vendorWithDisableGPUUUIDMap[vendor][disableGPUUUID] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a bug here. You are using disableGPUUUID as the map key, which is the raw value from splitting the annotation string. However, you should be using id, which is the result of strings.TrimSpace(disableGPUUUID). If the annotation value has spaces around the UUID (e.g., "uuid1, uuid2"), this will cause the lookup to fail.

Suggested change
vendorWithDisableGPUUUIDMap[vendor][disableGPUUUID] = true
vendorWithDisableGPUUUIDMap[vendor][id] = true

}
}

func Test_rmDeviceByNodeAnnotation(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current tests don't cover cases where the comma-separated UUID list in the annotation contains spaces (e.g., "uuid1, uuid2"). This could hide a bug where strings.TrimSpace is not being used correctly when populating the map of disabled UUIDs. Please consider adding a test case to cover this scenario to make the tests more robust.

Comment on lines +411 to +421
name: "Test removing metax device, case2",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}},
// Devices: []device.DeviceInfo{{DeviceVendor: metax.MetaxSGPUDevice, ID: id1}},
Devices: map[string][]device.DeviceInfo{metax.MetaxGPUDevice: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}}},
},
},
want: map[string][]device.DeviceInfo{metax.MetaxGPUDevice: []device.DeviceInfo{}},
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test case, Test removing metax device, case2, is a duplicate of Test removing metax device, case1. Based on the vendorNoUseAnnoKeyMap and the commented-out code, the intention was likely to test the removal of a metax.MetaxSGPUDevice. The current test does not cover this. I've suggested a corrected test case to properly test this scenario.

		{
			name: "Test removing metax sgpu device",
			args: args{
				nodeInfo: &device.NodeInfo{
					Node:    &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}},
					Devices: map[string][]device.DeviceInfo{metax.MetaxSGPUDevice: {{DeviceVendor: metax.MetaxSGPUDevice, ID: id1}}},
				},
			},
			want: map[string][]device.DeviceInfo{metax.MetaxSGPUDevice: []device.DeviceInfo{}},
		},

@ZhengW22
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to disable specific GPUs on a node using annotations. The implementation filters out the disabled devices from the scheduler's cache when a node is added or updated. This is achieved by introducing a new function rmDeviceByNodeAnnotation and a map vendorNoUseAnnoKeyMap to handle vendor-specific annotations. The changes are accompanied by a comprehensive set of unit tests.

My review has identified a couple of areas for improvement:

  • A potential nil pointer dereference in rmDeviceByNodeAnnotation if it were ever called with a nil nodeInfo.
  • Some commented-out dead code in the new tests that should be cleaned up.

Also, please note that the implementation in the code (removing devices from the scheduler's cache) differs from the approach described in the PR description ("setting the used count... to their maximum capacity"). The implemented approach seems more effective, but it would be good to align the description with the code for future reference.

}

func rmDeviceByNodeAnnotation(nodeInfo *device.NodeInfo) map[string][]device.DeviceInfo {
vendorWithDisableGPUUUIDMap := make(map[string]map[string]bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function rmDeviceByNodeAnnotation does not check if nodeInfo is nil. If a nil nodeInfo is passed, it could cause a panic at nodeInfo.Node. Although the current call site in addNode seems to prevent this, adding a nil check at the beginning of the function is a good practice for robustness and to make the function safer for future use.

	if nodeInfo == nil {
		return nil
	}
	vendorWithDisableGPUUUIDMap := make(map[string]map[string]bool)

Comment on lines +347 to +416
// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}},
Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}}},
},
},
want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{}},
},
{
name: "Test remove one device",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: id1}}},
Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
},
},
want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{}},
},
{
name: "Test remove two devices",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: strings.Join([]string{id1, id2}, ",")}}},
// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}},
Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}}},
},
},
want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{}},
},
{
name: "Test remove one device and keep one device",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: strings.Join([]string{id2}, ",")}}},
// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}},
Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}}},
},
},
// want: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
},
{
name: "Test no removing device, case1",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{"test-key": ""}}},
// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
},
},
// want: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
},
{
name: "Test no removing device, case2",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: id2}}},
// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
},
},
// want: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
},
{
name: "Test removing metax device, case1",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}},
// Devices: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}},
Devices: map[string][]device.DeviceInfo{metax.MetaxGPUDevice: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several commented-out lines of code within the new test cases (e.g., lines 347, 368, 379, 383, 415). These lines appear to be remnants from previous refactoring. To improve code clarity and maintainability, please remove this dead code.

@ZhengW22
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful feature for disabling specific GPUs on a node via annotations. The implementation is clean and is supported by a good set of unit tests. The core logic in rmDeviceByNodeAnnotation correctly filters out devices based on the nouse-gpuuuid annotations for various vendors.

However, there is a discrepancy between the implementation and the pull request description. The description states that the feature works by setting the used count of disabled GPUs to their maximum capacity. The actual code removes these GPUs from the scheduler's cached list of devices for the node. While the outcome is similar (the GPU is not schedulable), the mechanism is different. Please update the PR description to accurately reflect the implementation, which will help future developers understand the code.

@ZhengW22
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to disable specific GPUs on a node using annotations. The implementation works by filtering out the disabled devices from the scheduler's cache, preventing them from being allocated. This is a clean and effective approach. The code is well-tested with various scenarios.

I have a couple of suggestions for improvement:

  • The PR description seems to describe a different implementation strategy (marking devices as fully used). It would be beneficial to update it to accurately reflect the current implementation, which removes devices from the cache.
  • I've pointed out a design consideration regarding the new vendorNoUseAnnoKeyMap that could improve modularity and maintainability in the long run.

Overall, this is a solid contribution that adds valuable functionality.

Comment on lines +37 to +44
var vendorNoUseAnnoKeyMap = map[string][]string{
nvidia.GPUNoUseUUID: {nvidia.NvidiaGPUDevice},
cambricon.MLUNoUseUUID: {cambricon.CambriconMLUDevice},
hygon.DCUNoUseUUID: {hygon.HygonDCUDevice},
mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice},
metax.MetaxNoUseUUID: {metax.MetaxGPUDevice, metax.MetaxSGPUDevice},
kunlun.KunlunNoUseUUID: {kunlun.KunlunGPUDevice},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this map works, it introduces a tight coupling between the scheduler package and all specific device packages (nvidia, cambricon, etc.). Every time a new device type with a 'no-use' annotation is added, this central map will need to be modified.

A more decoupled and maintainable approach would be to have each device implementation provide this information. For example, you could add a method to the device.Devices interface:

// In pkg/device/devices.go
type Devices interface {
    // ...
    GetNoUseAnnotation() (key string, vendors []string)
}

The scheduler could then build this map dynamically at startup by iterating over the registered device implementations. This would make the system more extensible.

args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}},
// Devices: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out line appears to be a remnant from development and can be safely removed to improve code clarity.

@ZhengW22
Copy link
Author

@wawa0210 Hello, I have already fixed all critical problems.

@ZhengW22
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful feature to disable specific GPUs on a node via annotations. The implementation is clean and integrates well with the existing node management logic. The accompanying tests are thorough and cover important edge cases.

I've noticed a small discrepancy between the PR description and the code. The description states that disabled GPUs are marked as fully utilized, but the implementation filters them out from the scheduler's device list entirely. The implemented approach is cleaner and more efficient, so this is great!

I have a couple of suggestions for minor improvements to enhance performance and maintainability.

Comment on lines +109 to +122
newDeviceMap := make(map[string][]device.DeviceInfo)
for deviceName, deviceList := range nodeInfo.Devices {
newDeviceList := make([]device.DeviceInfo, 0, len(deviceList))
for _, d := range deviceList {
if disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[d.DeviceVendor]; ok {
if disabled := disableGPUUUIDMap[d.ID]; disabled {
klog.V(5).Infof("Disable gpu uuid is : %s", d.ID)
continue
}
}
newDeviceList = append(newDeviceList, d)
}
newDeviceMap[deviceName] = newDeviceList
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for filtering devices is correct. However, for better performance and readability, the check for whether a vendor has devices to disable can be hoisted out of the inner loop. Since d.DeviceVendor will be the same as deviceName for all devices in deviceList, you can check for vendorWithDisableGPUUUIDMap[deviceName] once before iterating through the deviceList.

 newDeviceMap := make(map[string][]device.DeviceInfo, len(nodeInfo.Devices))
 for deviceName, deviceList := range nodeInfo.Devices {
  disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[deviceName]
  if !ok {
   newDeviceMap[deviceName] = deviceList
   continue
  }
  newDeviceList := make([]device.DeviceInfo, 0, len(deviceList))
  for _, d := range deviceList {
   if disabled := disableGPUUUIDMap[d.ID]; disabled {
    klog.V(5).Infof("Disable gpu uuid is : %s", d.ID)
    continue
   }
   newDeviceList = append(newDeviceList, d)
  }
  newDeviceMap[deviceName] = newDeviceList
 }

Comment on lines +426 to +428
if got := rmDeviceByNodeAnnotation(tt.args.nodeInfo); !reflect.DeepEqual(got, tt.want) {
t.Errorf("rmDeviceByNodeAnnotation() = %v, want %v", got, tt.want)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other tests in this file (e.g., Test_addNode_ListNodes), it's better to use assert.DeepEqual for comparing the expected and actual results. This improves the maintainability and uniformity of the test suite.

   got := rmDeviceByNodeAnnotation(tt.args.nodeInfo)
   assert.DeepEqual(t, tt.want, got)

Signed-off-by: ZhengW22 <s123zz123@gmail.com>
@ZhengW22
Copy link
Author

ZhengW22 commented Feb 2, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature that allows disabling specific GPUs on a node via annotations. The implementation is clean and effective, filtering out the specified GPUs from the scheduler's node cache to prevent them from being allocated. The logic is sound, and the accompanying test cases are comprehensive, covering various scenarios to ensure correctness. I have one minor suggestion to clean up the test code.

args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}},
// Devices: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out line appears to be a remnant from development. It should be removed to improve code clarity and maintainability.

@ZhengW22
Copy link
Author

ZhengW22 commented Feb 2, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to disable specific GPUs on a node using annotations. The implementation filters out the disabled GPUs from the scheduler's device list when a node is added or updated. The overall approach is sound and the new logic is covered by unit tests.

However, I found a critical issue where constants from the metax package are used but not defined, which will cause a compilation failure. I've left a specific comment on this.

I also noticed a minor discrepancy between the PR description and the implementation. The description states that disabled GPUs are marked as fully used, while the code actually removes them from the list of available devices. Please consider updating the description to accurately reflect the implementation.

Finally, I've included a suggestion to simplify a conditional check for better code clarity.

cambricon.MLUNoUseUUID: {cambricon.CambriconMLUDevice},
hygon.DCUNoUseUUID: {hygon.HygonDCUDevice},
mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice},
metax.MetaxNoUseUUID: {metax.MetaxGPUDevice, metax.MetaxSGPUDevice},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The constants metax.MetaxNoUseUUID and metax.MetaxSGPUDevice are used here, but they are not defined in the pkg/device/metax/device.go file provided in the context. This will lead to a compilation error. Please ensure these constants are defined in the metax package. For consistency with other device packages, MetaxNoUseUUID could be defined as "metax-tech.com/nouse-gpuuuid".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a gemini problem. You can find those constants in pkg/device/metax/protocol.go and pkg/device/metax/sdevice.go easily. Because metax contributors don't put all constants in one files.

Comment on lines +114 to +117
if disabled := disableGPUUUIDMap[d.ID]; disabled {
klog.V(5).Infof("Disable gpu uuid is : %s", d.ID)
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This conditional check can be simplified. In Go, when checking a boolean value from a map, you can directly use the result of the map access in an if statement. If the key doesn't exist, the expression will evaluate to false, which is the desired behavior here.

Suggested change
if disabled := disableGPUUUIDMap[d.ID]; disabled {
klog.V(5).Infof("Disable gpu uuid is : %s", d.ID)
continue
}
if disableGPUUUIDMap[d.ID] {
klog.V(5).Infof("Disable gpu uuid is : %s", d.ID)
continue
}

@ZhengW22 ZhengW22 requested a review from Shouren February 2, 2026 06:19
…duling

Signed-off-by: ZhengW22 <s123zz123@gmail.com>
@ZhengW22
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to disable specific GPUs on a node using annotations. The implementation correctly filters out the disabled devices from the node's device list. The changes are logical and well-tested.

I have a couple of suggestions for improvement:

  1. In pkg/device/ascend/device.go, an unused constant AscendDevice is introduced and should be removed.
  2. In pkg/scheduler/nodes.go, there's an opportunity to refactor duplicated code in rmDeviceByNodeAnnotation to improve maintainability.

Additionally, the PR description mentions that the implementation works by setting the used count to max capacity. However, the code actually removes the disabled devices from the node's device list. It would be beneficial to update the description to accurately reflect the implementation.

Comment on lines +43 to +44
// Ascend device vendor identifier.
AscendDevice = "Ascend"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The constant AscendDevice and its comment are new additions but appear to be unused in the codebase. To avoid dead code and confusion, they should be removed.

Comment on lines +89 to +127
// Process known vendor annotations
for annoKey, vendors := range vendorNoUseAnnoKeyMap {
klog.V(5).Infof("Current annokey is %s, and vendor is %v", annoKey, vendors)
if value, ok := nodeInfo.Node.Annotations[annoKey]; ok {
disableGPUUUIDList := strings.Split(value, ",")
klog.V(5).Infof("Disable gpu uuid list is: %v", disableGPUUUIDList)
for _, disableGPUUUID := range disableGPUUUIDList {
if id := strings.TrimSpace(disableGPUUUID); id != "" {
for _, vendor := range vendors {
if vendorWithDisableGPUUUIDMap[vendor] == nil {
vendorWithDisableGPUUUIDMap[vendor] = make(map[string]bool)
}
vendorWithDisableGPUUUIDMap[vendor][id] = true
}
}
}
}
}
// Process Ascend device annotations dynamically
// Ascend devices use format: hami.io/no-use-{CommonWord}-uuid
for annoKey, value := range nodeInfo.Node.Annotations {
if strings.HasPrefix(annoKey, ascend.AscendNoUseUUIDPrefix) && strings.HasSuffix(annoKey, ascend.AscendNoUseUUIDSuffix) {
klog.V(5).Infof("Processing Ascend annotation: %s", annoKey)
disableGPUUUIDList := strings.Split(value, ",")
klog.V(5).Infof("Disable Ascend device uuid list is: %v", disableGPUUUIDList)
// Extract the device type from the annotation key
// Format: hami.io/no-use-{DeviceType}-uuid
deviceType := strings.TrimPrefix(annoKey, ascend.AscendNoUseUUIDPrefix)
deviceType = strings.TrimSuffix(deviceType, ascend.AscendNoUseUUIDSuffix)
for _, disableGPUUUID := range disableGPUUUIDList {
if id := strings.TrimSpace(disableGPUUUID); id != "" {
if vendorWithDisableGPUUUIDMap[deviceType] == nil {
vendorWithDisableGPUUUIDMap[deviceType] = make(map[string]bool)
}
vendorWithDisableGPUUUIDMap[deviceType][id] = true
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for processing annotations for known vendors (lines 90-106) and for Ascend devices (lines 109-127) is very similar, leading to code duplication. This can be refactored to improve maintainability.

Consider extracting the common logic into a helper function. For example, a function addDisabledUUIDs could handle splitting the UUID string, trimming spaces, and populating the vendorWithDisableGPUUUIDMap. This would make the code cleaner and reduce redundancy.

@FouoF
Copy link
Contributor

FouoF commented Feb 11, 2026

Can you help rebase your pr to the latest master? The CI fails for it only fetches 10 depth commits so the original commit is unreachable now.

@ZhengW22
Copy link
Author

Can you help rebase your pr to the latest master? The CI fails for it only fetches 10 depth commits so the original commit is unreachable now.

I have already fixed this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants