Skip to content

smarttechlabs-projects/strix-halo-amdgpu-blacklisted

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

ROCm Boot Fix for AMD Strix Halo

A diagnostic and repair script that resolves the amdgpu kernel module blacklisting issue preventing ROCm from initializing on AMD Strix Halo APUs, part of the Ryzen AI 300 (Max, Max+) series. Tested on a Ryzen AI Max+ 395 based system.

About the Ryzen AI 300 Series

The AMD Ryzen AI 300 series (codenamed "Strix Halo") represents AMD's high-performance mobile APU lineup combining CPU, GPU, and NPU on a single chip with unified memory architecture. The Ryzen AI Max+ 395 is the flagship configuration featuring:

Component Specification
CPU 16 Zen 5 cores (32 threads)
iGPU Radeon 8060S - RDNA 3.5 architecture, 40 CUs, gfx1151 target
NPU XDNA 2 architecture, 50 TOPS AI performance
Memory Unified memory architecture - CPU, GPU, and NPU share up to 128GB system RAM

The unified memory architecture is particularly significant for AI/ML workloads, as the iGPU can access the full system memory pool without PCIe bandwidth limitations. This makes ROCm functionality essential for leveraging the GPU compute capabilities for frameworks like PyTorch, TensorFlow, and ONNX Runtime.

Problem Statement

After installing ROCm on systems with AMD Strix Halo integrated graphics, users may encounter a situation where:

  • rocminfo returns no agents or fails entirely
  • /dev/kfd (Kernel Fusion Driver) device node is missing
  • /dev/dri/renderD* nodes are absent
  • GPU compute workloads fail to initialize

Root Cause: The amdgpu kernel module gets blacklisted, preventing the GPU driver from loading at boot.


Understanding Kernel Module Blacklisting

What is Module Blacklisting?

Linux kernel modules are pieces of code that can be loaded into the kernel on demand to extend functionality (drivers, filesystems, etc.). Blacklisting prevents a module from loading automatically, even if the hardware it supports is present.

Blacklist configurations are stored in /etc/modprobe.d/ as .conf files with entries like:

blacklist amdgpu

Why Blacklisting is Used

Blacklisting is a legitimate system administration technique used to enforce a specific hardware configuration. NVIDIA and CUDA installers intentionally blacklist competing GPU drivers (including amdgpu, radeon, and nouveau) to create a predictable, single-vendor GPU environment. This prevents driver conflicts, ensures CUDA has exclusive GPU access, and simplifies debugging - but it becomes problematic on systems where you actually want to use the AMD GPU.

A parallel example is the nouveau driver (open-source NVIDIA driver): NVIDIA's proprietary installer blacklists nouveau to prevent it from competing with their closed-source driver. This is standard practice and works well for dedicated NVIDIA systems, but causes issues on hybrid setups or when switching GPU vendors.

Why Does amdgpu Get Blacklisted?

Several scenarios can cause the amdgpu module to be blacklisted:

Cause Description
NVIDIA Driver Installation Proprietary NVIDIA drivers (via apt, runfiles, or CUDA installers) often blacklist competing GPU drivers to prevent conflicts
Legacy Driver Conflicts Systems with both integrated AMD graphics and discrete GPUs may have conflicting driver requirements
Ubuntu Pro/Livepatch Some enterprise configurations blacklist modules for stability
Manual Intervention Previous troubleshooting attempts may have added blacklist entries
Installer Bugs Some ROCm or driver installer versions incorrectly create blacklist files
initramfs Persistence Even after removing blacklist files, old configurations persist in the initial RAM filesystem

Blacklist File Locations

/etc/modprobe.d/                    # Primary configuration directory
├── blacklist.conf                  # General blacklist (check for amdgpu entries)
├── blacklist-amdgpu.conf           # Dedicated amdgpu blacklist (if exists)
├── nvidia-installer-*.conf         # NVIDIA installer generated
├── nvidia-graphics-drivers.conf    # Ubuntu NVIDIA package
└── *.conf                          # Any file can contain blacklist directives

Diagnosing the Issue

Step 1: Check if amdgpu Module is Loaded

lsmod | grep amdgpu

Expected output (working system):

amdgpu              12345678  0
drm_ttm_helper         1234  1 amdgpu
ttm                   56789  1 amdgpu
drm_exec               1234  1 amdgpu
gpu_sched             12345  1 amdgpu
drm_buddy              1234  1 amdgpu
drm_display_helper    12345  1 amdgpu
i2c_algo_bit           1234  1 amdgpu

If empty: The module is not loaded.

Step 2: Check for Blacklist Entries

# Search all modprobe config files for amdgpu references
grep -r "amdgpu" /etc/modprobe.d/

# Check specifically for blacklist directives
grep -r "blacklist.*amdgpu" /etc/modprobe.d/

Problem indicator:

/etc/modprobe.d/blacklist-amdgpu.conf:blacklist amdgpu

Step 3: Verify Device Nodes Exist

# ROCm compute device (Kernel Fusion Driver)
ls -la /dev/kfd

# GPU render nodes
ls -la /dev/dri/render*

Missing nodes indicate the driver is not loaded.

Step 4: Check Kernel Messages

# View amdgpu-related kernel messages
sudo dmesg | grep -i amdgpu

# Check for module loading errors
sudo dmesg | grep -i "module.*blacklist\|amdgpu.*error"

Step 5: Verify Module Would Load (Dry Run)

# Check what would happen if we tried to load amdgpu
modprobe --dry-run --verbose amdgpu

The Fix: What This Script Does

The fix_rocm_boot.sh script performs these operations:

Phase 1: Remove Blacklist (Line 26-32)

rm /etc/modprobe.d/blacklist-amdgpu.conf

Deletes the blacklist configuration file that prevents amdgpu from loading.

Phase 2: Configure Driver Options (Line 35-46)

Creates /etc/modprobe.d/amdgpu.conf with optimal settings for Strix Halo:

Option Value Purpose
dc=1 Enable Display Core - modern display engine for HDMI/DP
dpm=1 Enable Dynamic Power Management - power states and thermal control
si_support=0 Disable Southern Islands (GCN 1.0) - not needed for RDNA 3.5
cik_support=0 Disable Sea Islands (GCN 2.0) - not needed for RDNA 3.5

Disabling legacy GPU support reduces memory footprint and prevents potential conflicts.

Phase 3: Enable Boot Loading (Line 49-54)

Creates /etc/modules-load.d/amdgpu.conf:

amdgpu

This ensures the module loads early in the boot process, before display managers or user services start.

Phase 4: Update initramfs (Line 57-59)

update-initramfs -u

Critical step: The initial RAM filesystem (initramfs) is a temporary root filesystem loaded at boot. It contains:

  • Essential kernel modules
  • Module configuration (including blacklists)
  • Early boot scripts

Without updating initramfs, the old blacklist configuration remains embedded and continues to prevent module loading, even after the source file is deleted.

Phase 5: Load Module Immediately (Line 62-74)

modprobe amdgpu

Loads the driver without requiring a reboot for immediate testing.

Phase 6-7: Verification (Line 77-108)

Validates the fix by checking:

  • /dev/kfd existence (ROCm compute support)
  • /dev/dri/render* nodes (GPU access)
  • Module loaded in lsmod
  • rocminfo output (ROCm stack verification)

Usage

Prerequisites

  • Ubuntu 22.04/24.04 or compatible distribution
  • ROCm installed (rocminfo in PATH)
  • Root/sudo access

Running the Script

# Make executable
chmod +x fix_rocm_boot.sh

# Run with root privileges
sudo ./fix_rocm_boot.sh

Expected Output

=== ROCm Boot Fix Script ===

[INFO] Step 1: Removing amdgpu blacklist...
[INFO] Removed /etc/modprobe.d/blacklist-amdgpu.conf
[INFO] Step 2: Creating amdgpu driver options...
[INFO] Created /etc/modprobe.d/amdgpu.conf
[INFO] Step 3: Configuring amdgpu to load at boot...
[INFO] Created /etc/modules-load.d/amdgpu.conf
[INFO] Step 4: Updating initramfs (this may take a moment)...
[INFO] Initramfs updated
[INFO] Step 5: Loading amdgpu module now...
[INFO] amdgpu module loaded successfully
[INFO] Step 6: Verifying GPU devices...

[INFO] /dev/kfd exists - ROCm compute support available
crw-rw---- 1 root render 234, 0 Dec 10 12:00 /dev/kfd

[INFO] Render nodes found:
crw-rw----+ 1 root render 226, 128 Dec 10 12:00 /dev/dri/renderD128

[INFO] Loaded amdgpu modules:
amdgpu              15728640  0

[INFO] Step 7: Testing ROCm...

ROCm Runtime Version: 6.x.x
...

=== Summary ===
[INFO] Configuration changes applied:
  - Removed: /etc/modprobe.d/blacklist-amdgpu.conf
  - Created: /etc/modprobe.d/amdgpu.conf
  - Created: /etc/modules-load.d/amdgpu.conf
  - Updated: initramfs

[INFO] ROCm should now work! Test with: rocminfo
[INFO] Done!

Post-Installation Verification

Verify ROCm Stack

# List ROCm agents (should show your GPU)
rocminfo

# Check OpenCL devices
clinfo

# For PyTorch users
python3 -c "import torch; print(torch.cuda.is_available())"  # Uses HIP backend

If you are using Ubuntu with Desktop UI another good sign is that you are now able to change the display resolutions. Without the drivers in place this would not be possible.

Verify Persistent Configuration

After reboot:

# Confirm module loads at boot
lsmod | grep amdgpu

# Confirm devices exist
ls /dev/kfd /dev/dri/render*

# Confirm no blacklist remains
grep -r "blacklist.*amdgpu" /etc/modprobe.d/

Troubleshooting

Issue: /dev/kfd Still Missing After Reboot

  1. Check kernel support:

    grep CONFIG_HSA_AMD /boot/config-$(uname -r)
    # Should show: CONFIG_HSA_AMD=y or =m
  2. Verify user permissions:

    # Add user to render and video groups
    sudo usermod -aG render,video $USER
    # Log out and back in
  3. Check for conflicting drivers:

    lsmod | grep -E "radeon|nvidia"

Issue: Module Loads but rocminfo Fails

  1. Check HSA status:

    cat /sys/class/kfd/kfd/topology/nodes/*/properties
  2. Verify ROCm installation:

    apt list --installed | grep rocm
    dpkg -l | grep amdgpu-dkms

Issue: Display Issues After Running Script

The script enables Display Core (dc=1). If experiencing display problems:

# Temporarily disable DC for debugging
sudo modprobe -r amdgpu
sudo modprobe amdgpu dc=0

Files Modified

File Action Purpose
/etc/modprobe.d/blacklist-amdgpu.conf Removed Eliminate blacklist preventing driver load
/etc/modprobe.d/amdgpu.conf Created Set driver options for Strix Halo
/etc/modules-load.d/amdgpu.conf Created Ensure module loads at boot
/boot/initrd.img-* Updated Embed new configuration in boot image

Technical Background

AMD GPU Driver Architecture

┌─────────────────────────────────────────────────────────────┐
│                      User Space                             │
├─────────────────────────────────────────────────────────────┤
│  ROCm Runtime  │  OpenCL  │  HIP  │  PyTorch/TensorFlow     │
├─────────────────────────────────────────────────────────────┤
│                    libdrm / libhsakmt                       │
├─────────────────────────────────────────────────────────────┤
│                      Kernel Space                           │
├─────────────────────────────────────────────────────────────┤
│                    amdgpu.ko (DRM driver)                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐ │
│  │    DC    │  │   DPM    │  │   KFD    │  │  GPU Sched   │ │
│  │ Display  │  │  Power   │  │ Compute  │  │   Workload   │ │
│  │  Core    │  │   Mgmt   │  │  Driver  │  │   Manager    │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────────┘ │
├─────────────────────────────────────────────────────────────┤
│                        Hardware                             │
│            AMD Strix Halo (RDNA 3.5 iGPU)                   │
└─────────────────────────────────────────────────────────────┘

Key Components

  • amdgpu.ko: Unified kernel driver for all modern AMD GPUs (GCN, RDNA)
  • KFD (Kernel Fusion Driver): HSA-compatible compute interface, exposes /dev/kfd
  • DC (Display Core): Modern display engine for HDMI 2.1, DP, eDP
  • DPM (Dynamic Power Management): Power states, clocking, thermal management

Strix Halo Specifics

The AMD Strix Halo (Ryzen AI 300 series) features:

  • RDNA 3.5 integrated graphics (Radeon 8060S)
  • 40 Compute Units
  • ROCm support via gfx1151 target
  • Requires amdgpu driver (not legacy radeon)

References


License

MIT License - See LICENSE for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

Author

Juergen Fey - SmartTechlabs.de. 12-2025. Created to resolve ROCm initialization issues on AMD Strix Halo systems.

About

Even on a system with a AMD GPU the blacklisting can prevent the GPU driver from loading at boot. This script fixes that.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages