Skip to content

Conversation

@mgsharm
Copy link
Contributor

@mgsharm mgsharm commented Nov 14, 2025

Description of changes:

This PR adds support for AMD GPU detection and device plugin functionality to Bottlerocket:

  • libdrm package: Added libdrm 2.4.123 as one of the device plugin dependencies. Reference:
    link
  • amd-k8s-device-plugin package: Added AMD K8s device plugin v1.31.0.8 from ROCm for GPU resource management
  • AMD GPU detection: Added amd-gpu-present subcommand to ghostdog for detecting AMD GPUs via PCI
  • AMD GPU driver validation: Added match-driver amd rocm support to ghostdog to validate that the amdgpu kernel module and ROCm KFD driver are properly loaded before starting the device plugin
  • Node readiness logic: The device plugin service now validates both GPU presence and driver availability, preventing the node from joining the cluster if drivers are not loaded correctly

The implementation supports AMD Instinct MI355X GPUs (device ID 75a3).

Testing done:

Launched AMD MI355X node and verified the device plugin detects GPUs correctly.

  • lspci output
bash-5.1# lspci -d 1002:75a3
51:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
52:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
62:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
63:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
73:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
74:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
84:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
85:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
  • Driver validation passing
  bash-5.1# ghostdog amd-gpu-present && echo "GPU detected"
  GPU detected
  bash-5.1# ghostdog match-driver amd rocm && echo "Driver loaded"
  Driver loaded
  • amd-k8s-device-plugin service running on AMD MI355x instance
  ● amd-k8s-device-plugin.service - Start AMD kubernetes device plugin
       Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/amd-k8s-device-plugin.service; enabled; preset:
  enabled)
      Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
               └─00-aws-config.conf
       Active: active (running) since Thu 2025-11-13 09:46:39 UTC; 1 day 8h ago
   Invocation: 3a02d24942e14ec0aaa74440e9ed9e11
      Process: 47305 ExecStartPre=/usr/bin/sleep 0.1 (code=exited, status=0/SUCCESS)
      Process: 47346 ExecStartPre=/usr/bin/test -S /var/lib/kubelet/device-plugins/kubelet.sock (code=exited, status=0/SUCCESS)
     Main PID: 47349 (amd-device-plug)
        Tasks: 14 (limit: 629145)
       Memory: 26.8M (peak: 30.7M)
          CPU: 3.110s
       CGroup: /system.slice/amd-k8s-device-plugin.service
               └─47349 /usr/bin/amd-device-plugin -logtostderr=true -stderrthreshold=INFO -v=5

  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440712   47349 amdgpu.go:261]
  Devices map: map[0000:51:00.0:map[card:0 computePartitionType:spx devID:12832063173580113071 memoryPartitionType:nps1 nodeId:2 numaNode:0
   renderD:128]0000:52:00.0:map[card:8 computePartitionType:spx devID:10305807447086475360 memoryPartitionType:nps1 nodeId:3 numaNode:0
  renderD:136] 0000:62:00.0:map[card:16 computePartitionType:spx devID:610766110054045314 memoryPartitionType:nps1 nodeId:4 numaNode:0
  renderD:144] 0000:63:00.0:map[card:24 computePartitionType:spx devID:12108960588440933464 memoryPartitionType:nps1 nodeId:5 numaNode:0
  renderD:152] 0000:73:00.0:map[card:32 computePartitionType:spx devID:9006080295669176546 memoryPartitionType:nps1 nodeId:6 numaNode:1
  renderD:160] 0000:74:00.0:map[card:40 computePartitionType:spx devID:18073956319122529846 memoryPartitionType:nps1 nodeId:7 numaNode:1
  renderD:168] 0000:84:00.0:map[card:48 computePartitionType:spx devID:7569327791852919328 memoryPartitionType:nps1 nodeId:8 numaNode:1
  renderD:176] 0000:85:00.0:map[card:56 computePartitionType:spx devID:10504069062564729379 memoryPartitionType:nps1 nodeId:9 numaNode:1
  renderD:184]]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440736   47349 amdgpu.go:278]
  Partition counts: map[spx_nps1:8]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440740   47349 plugin.go:254]
  Watching GPU with bus ID: 0000:74:00.0 NUMA Node: [1]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440746   47349 plugin.go:254]
  Watching GPU with bus ID: 0000:84:00.0 NUMA Node: [1]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440750   47349 plugin.go:254]
  Watching GPU with bus ID: 0000:85:00.0 NUMA Node: [1]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440752   47349 plugin.go:254]
  Watching GPU with bus ID: 0000:51:00.0 NUMA Node: [0]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440756   47349 plugin.go:254]
  Watching GPU with bus ID: 0000:52:00.0 NUMA Node: [0]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440758   47349 plugin.go:254]
  Watching GPU with bus ID: 0000:62:00.0 NUMA Node: [0]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440761   47349 plugin.go:254]
  Watching GPU with bus ID: 0000:63:00.0 NUMA Node: [0]
  Nov 13 09:46:42 ip-172-31-54-189.us-west-2.compute.internal amd-device-plugin[47349]: I1113 09:46:39.440764   47349 plugin.go:254]
  Watching GPU with bus ID: 0000:73:00.0 NUMA Node: [1]
  • GPU Capacity advertised for AMD MI355 instance
  ➜ kubectl get nodes -o json | jq '.items[].status.capacity'
  {
    "amd.com/gpu": "8",
    "cpu": "192",
    "ephemeral-storage": "15471392Ki",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "4206782968Ki",
    "pods": "110"
  }
  {
    "cpu": "2",
    "ephemeral-storage": "81854Mi",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "8003780Ki",
    "pods": "29"
  }

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

%global gover 1.31.0.8
%global rpmver %{gover}

Name: %{_cross_os}amd-k8s-device-plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this consistent with the upstream naming

Suggested change
Name: %{_cross_os}amd-k8s-device-plugin
Name: %{_cross_os}rocm-k8s-device-plugin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will update.

releases-url = "https://github.com/ROCm/k8s-device-plugin/releases"

[[package.metadata.build-package.external-files]]
url = "https://github.com/ROCm/k8s-device-plugin/archive/v1.31.0.8.tar.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is marked pre-release, is there a "latest" release or all of them just marked pre-release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All releases of the ROCm k8s-device-plugin (https://github.com/ROCm/k8s-device-plugin/releases?page=1) are marked as pre-release on their GitHub. v1.31.0.8 is the latest available release.

releases-url = "https://dri.freedesktop.org/libdrm/"

[[package.metadata.build-package.external-files]]
url = "https://dri.freedesktop.org/libdrm/libdrm-2.4.123.tar.xz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like latest is 2.4.128

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.


%files
%{_cross_attribution_file}
%{_cross_libdir}/*.so.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we list these out?


[Service]
# Verify AMD GPU is detected
ExecStartPre=/usr/bin/ghostdog amd-gpu-present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this works for the other device plugin we have is that preconfigured.target is blocked if the driver is not loaded, that way the kubelet never loads, and the device plugin doesn't either. In this case, you could have the situation where the driver fails to load, kubelet starts, becomes ready, but this fails, leaving the node in a degraded state. I'd prefer we dealt with GPU detection and driver loading as a separate unit. The assumption should be we can't make it to kubelet starting without the drivers being loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I create a load-amd-kernel-modules.service in the kernel-kit (similar to how NVIDIA does it) that blocks
preconfigured.target if the driver isn't loaded? And then simplify the device plugin service in core-kit to just check for the kubelet socket, since we'd be guaranteed the driver is loaded by that point?

Also, for ghostdog match-driver amd - since AMD only has one driver type, should I just remove that entirely and use systemd's ConditionPathExists instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @yeazelm - here's the plan:

Following the Neuron pattern, I'll create:

  1. [email protected] - adds conditions to check AMD GPU presence
  2. load-amd-modules.service - loads AMD kernel modules, RequiredBy=drivers.target

These will be added to the kernel package (similar to how Neuron does it in kernel-6.12.spec), so driver loading happens before drivers.target completes. This ensures kubelet never starts if the AMD driver fails to load.

Will also simplify the device plugin service in core-kit to remove driver checks since they'll be guaranteed to be loaded by that point.

}
}

fn match_amd_driver(driver_flavor: &str) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should mix semantics here: match-driver is for matching a chosen driver with the hardware present, in this case you are having it check if the driver loaded. Would it not be easier to just use a path check in systemd if we want sys/class/kfd to exist before proceeding? I don't think we want ghostdog match-driver concerning itself with an already loaded driver, its about providing a boolean answer to if that driver choice makes sense for the hardware present.

@mgsharm mgsharm force-pushed the amd-device-plugin branch 3 times, most recently from cbc0cb5 to 432bae0 Compare November 17, 2025 07:48
Add amd-gpu-present subcommand to ghostdog to detect
AMD Instinct MI355X GPUs (device ID 75a3) via PCI device scanning.

Signed-off-by: Gaurav Sharma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants