Skip to content

[Issue]: CI nightly hip programs fail to compile via pip install on RHEL 8.10 baremetal. #4271

@pbhandar-amd

Description

@pbhandar-amd

Problem Description

You won't be able to compile any hip programs If you install the latest nightly on RHEL 8.10 via the pip install method. Here is the error you will get

clang++: error: unable to execute command: Segmentation fault
clang++: error: amdgcn-link command failed due to signal (use -v to see invocation)
AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git 4adeabb0862ea8119c143d9f8256475b0b687217+PATCHED:f3b5643f91ad4def7b92cd48247bc11f1f39fb5c)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/rocm/.venv/lib/python3.11/site-packages/_rocm_sdk_core/lib/llvm/bin
clang++: error: unable to execute command: Segmentation fault
clang++: note: diagnostic msg: Error generating preprocessed source(s).
gmake[2]: *** [CMakeFiles/hip_hello_world.dir/build.make:75: CMakeFiles/hip_hello_world.dir/main.hip.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/hip_hello_world.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

I initially created a ticket here ROCM-20033 and assigned it to the compiler team. Sam from the compiler team debugged the issue and here is what he said.

We reproduced the issue on the RHEL 8.10 machine. The crash happens because the pip-packaged clang-offload-bundler binary has a non-standard ELF layout — its base address is at 0x3ff000 instead of the normal 0x400000, and its first LOAD segment is read-write instead of read-only. This is caused by patchelf modifying the binary RPATH during pip package creation in TheRock build system. The RHEL 8.10 kernel (4.18) has a bug in its ELF loader that cannot handle this layout, so execve fails with EEXIST and the process crashes with SIGSEGV. Newer kernels (5.x+) handle it fine, which is why it works in Docker on an Ubuntu host but not on a RHEL 8.10 host. As a workaround, users can invoke the binary through the dynamic linker directly: /lib64/ld-linux-x86-64.so.2 /clang-offload-bundler. To fix this properly, the TheRock packaging should either pre-allocate space in the ELF dynamic section so patchelf does not need to insert a new segment, or ship wrapper scripts for older kernels.

Hi Parag, I took a look at this and it seems like the issue is coming from TheRock's pip packaging scripts rather than the compiler itself. The patchelf post-processing and the exe stub generator were written by Stella Laurenzo, so it might be worth reaching out to her to see how best to get this ticket reassigned to the TheRock team.

Operating System

RHEL 8.10

CPU

12th Gen Intel(R) Core(TM) i7-12700K

GPU

Navi31 XTX

ROCm Version

ROCm 7.13 (Nightly)

ROCm Component

No response

Steps to Reproduce

  1. Install RHEL 8.10 on a baremetal system. Running RHEL 8.10 on a docker does not reproduce the error.
  2. Install amdgpu version 30.30. This issue was created when the latest version of amdgpu released was 30.30.

Steps to install amdgpu version 30.30

# Install prerequisites
sudo dnf update --releasever=8.10 --exclude=\*release\*
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo rpm -ivh epel-release-latest-8.noarch.rpm
sudo dnf config-manager --enable codeready-builder-for-rhel-8-x86_64-rpms
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
# Restart machine

# Install amdgpu 30.30
sudo dnf install https://repo.radeon.com/amdgpu-install/7.2/rhel/8/amdgpu-install-7.2.70200-1.el8.noarch.rpm
sudo dnf clean all
sudo dnf install "kernel-headers-$(uname -r)" "kernel-devel-$(uname -r)"
sudo dnf install amdgpu-dkms
# Restart machine again.

Install latest ROCm nightly

sudo dnf install -y python3.11 python3.11-pip libatomic
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/ "rocm[libraries,devel]"
rocm-sdk init

Download rocm-examples and set environment variables

sudo dnf install -y wget sudo python3 gcc-c++ git cmake glfw-devel vulkan-headers vulkan-loader-devel vulkan-validation-layers mesa-libGL-devel gcc-toolset-11 ninja-build
git clone https://github.com/ROCm/rocm-examples.git -b "release/therock-7.11"
cd rocm-examples/HIP-Basic/hello_world

export ROCM_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel
export LD_LIBRARY_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel/lib
export HIP_PLATFORM=amd
export HIP_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel
export HIP_CLANG_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel/llvm/bin
export HIP_DEVICE_LIB_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel/lib/llvm/amdgcn/bitcode
export PATH="$ROCM_PATH/bin:$HIP_CLANG_PATH:$PATH"

source /opt/rh/gcc-toolset-11/enable
export CXX=/opt/rh/gcc-toolset-11/root/usr/bin/g++
export CC=/opt/rh/gcc-toolset-11/root/usr/bin/gcc

Build the hello_world example

cmake -B build -DROCM_ROOT=$ROCM_PATH .
cmake --build build

You will run into the error above.

Let me know if you need a baremetal RHEL 8.10 system to reproduce this error. I can setup one up for you.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    TODO

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions