-
Notifications
You must be signed in to change notification settings - Fork 53
Amd device plugin #748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Amd device plugin #748
Conversation
a64d69f to
5038ad8
Compare
| %global gover 1.31.0.8 | ||
| %global rpmver %{gover} | ||
|
|
||
| Name: %{_cross_os}amd-k8s-device-plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this consistent with the upstream naming
| Name: %{_cross_os}amd-k8s-device-plugin | |
| Name: %{_cross_os}rocm-k8s-device-plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will update.
| releases-url = "https://github.com/ROCm/k8s-device-plugin/releases" | ||
|
|
||
| [[package.metadata.build-package.external-files]] | ||
| url = "https://github.com/ROCm/k8s-device-plugin/archive/v1.31.0.8.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is marked pre-release, is there a "latest" release or all of them just marked pre-release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All releases of the ROCm k8s-device-plugin (https://github.com/ROCm/k8s-device-plugin/releases?page=1) are marked as pre-release on their GitHub. v1.31.0.8 is the latest available release.
packages/libdrm/Cargo.toml
Outdated
| releases-url = "https://dri.freedesktop.org/libdrm/" | ||
|
|
||
| [[package.metadata.build-package.external-files]] | ||
| url = "https://dri.freedesktop.org/libdrm/libdrm-2.4.123.tar.xz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like latest is 2.4.128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
packages/libdrm/libdrm.spec
Outdated
|
|
||
| %files | ||
| %{_cross_attribution_file} | ||
| %{_cross_libdir}/*.so.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we list these out?
|
|
||
| [Service] | ||
| # Verify AMD GPU is detected | ||
| ExecStartPre=/usr/bin/ghostdog amd-gpu-present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this works for the other device plugin we have is that preconfigured.target is blocked if the driver is not loaded, that way the kubelet never loads, and the device plugin doesn't either. In this case, you could have the situation where the driver fails to load, kubelet starts, becomes ready, but this fails, leaving the node in a degraded state. I'd prefer we dealt with GPU detection and driver loading as a separate unit. The assumption should be we can't make it to kubelet starting without the drivers being loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I create a load-amd-kernel-modules.service in the kernel-kit (similar to how NVIDIA does it) that blocks
preconfigured.target if the driver isn't loaded? And then simplify the device plugin service in core-kit to just check for the kubelet socket, since we'd be guaranteed the driver is loaded by that point?
Also, for ghostdog match-driver amd - since AMD only has one driver type, should I just remove that entirely and use systemd's ConditionPathExists instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline with @yeazelm - here's the plan:
Following the Neuron pattern, I'll create:
[email protected]- adds conditions to check AMD GPU presenceload-amd-modules.service- loads AMD kernel modules,RequiredBy=drivers.target
These will be added to the kernel package (similar to how Neuron does it in kernel-6.12.spec), so driver loading happens before drivers.target completes. This ensures kubelet never starts if the AMD driver fails to load.
Will also simplify the device plugin service in core-kit to remove driver checks since they'll be guaranteed to be loaded by that point.
sources/ghostdog/src/main.rs
Outdated
| } | ||
| } | ||
|
|
||
| fn match_amd_driver(driver_flavor: &str) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should mix semantics here: match-driver is for matching a chosen driver with the hardware present, in this case you are having it check if the driver loaded. Would it not be easier to just use a path check in systemd if we want sys/class/kfd to exist before proceeding? I don't think we want ghostdog match-driver concerning itself with an already loaded driver, its about providing a boolean answer to if that driver choice makes sense for the hardware present.
5038ad8 to
a48a8c0
Compare
Signed-off-by: Gaurav Sharma <[email protected]>
cbc0cb5 to
432bae0
Compare
Signed-off-by: Gaurav Sharma <[email protected]>
Add amd-gpu-present subcommand to ghostdog to detect AMD Instinct MI355X GPUs (device ID 75a3) via PCI device scanning. Signed-off-by: Gaurav Sharma <[email protected]>
432bae0 to
919378b
Compare
Description of changes:
This PR adds support for AMD GPU detection and device plugin functionality to Bottlerocket:
link
amd-gpu-presentsubcommand to ghostdog for detecting AMD GPUs via PCImatch-driver amd rocmsupport to ghostdog to validate that the amdgpu kernel module and ROCm KFD driver are properly loaded before starting the device pluginThe implementation supports AMD Instinct MI355X GPUs (device ID 75a3).
Testing done:
Launched AMD MI355X node and verified the device plugin detects GPUs correctly.
● amd-k8s-device-plugin.service - Start AMD kubernetes device plugin Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/amd-k8s-device-plugin.service; enabled; preset: enabled) Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d └─00-aws-config.conf Active: active (running) since Thu 2025-11-13 09:46:39 UTC; 1 day 8h ago Invocation: 3a02d24942e14ec0aaa74440e9ed9e11 Process: 47305 ExecStartPre=/usr/bin/sleep 0.1 (code=exited, status=0/SUCCESS) Process: 47346 ExecStartPre=/usr/bin/test -S /var/lib/kubelet/device-plugins/kubelet.sock (code=exited, status=0/SUCCESS) Main PID: 47349 (amd-device-plug) Tasks: 14 (limit: 629145) Memory: 26.8M (peak: 30.7M) CPU: 3.110s CGroup: /system.slice/amd-k8s-device-plugin.service └─47349 /usr/bin/amd-device-plugin -logtostderr=true -stderrthreshold=INFO -v=5 Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440712 47349 amdgpu.go:261] Devices map: map[0000:51:00.0:map[card:0 computePartitionType:spx devID:12832063173580113071 memoryPartitionType:nps1 nodeId:2 numaNode:0 renderD:128]0000:52:00.0:map[card:8 computePartitionType:spx devID:10305807447086475360 memoryPartitionType:nps1 nodeId:3 numaNode:0 renderD:136] 0000:62:00.0:map[card:16 computePartitionType:spx devID:610766110054045314 memoryPartitionType:nps1 nodeId:4 numaNode:0 renderD:144] 0000:63:00.0:map[card:24 computePartitionType:spx devID:12108960588440933464 memoryPartitionType:nps1 nodeId:5 numaNode:0 renderD:152] 0000:73:00.0:map[card:32 computePartitionType:spx devID:9006080295669176546 memoryPartitionType:nps1 nodeId:6 numaNode:1 renderD:160] 0000:74:00.0:map[card:40 computePartitionType:spx devID:18073956319122529846 memoryPartitionType:nps1 nodeId:7 numaNode:1 renderD:168] 0000:84:00.0:map[card:48 computePartitionType:spx devID:7569327791852919328 memoryPartitionType:nps1 nodeId:8 numaNode:1 renderD:176] 0000:85:00.0:map[card:56 computePartitionType:spx devID:10504069062564729379 memoryPartitionType:nps1 nodeId:9 numaNode:1 renderD:184]] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440736 47349 amdgpu.go:278] Partition counts: map[spx_nps1:8] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440740 47349 plugin.go:254] Watching GPU with bus ID: 0000:74:00.0 NUMA Node: [1] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440746 47349 plugin.go:254] Watching GPU with bus ID: 0000:84:00.0 NUMA Node: [1] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440750 47349 plugin.go:254] Watching GPU with bus ID: 0000:85:00.0 NUMA Node: [1] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440752 47349 plugin.go:254] Watching GPU with bus ID: 0000:51:00.0 NUMA Node: [0] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440756 47349 plugin.go:254] Watching GPU with bus ID: 0000:52:00.0 NUMA Node: [0] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440758 47349 plugin.go:254] Watching GPU with bus ID: 0000:62:00.0 NUMA Node: [0] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440761 47349 plugin.go:254] Watching GPU with bus ID: 0000:63:00.0 NUMA Node: [0] Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440764 47349 plugin.go:254] Watching GPU with bus ID: 0000:73:00.0 NUMA Node: [1]Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.