-
Notifications
You must be signed in to change notification settings - Fork 213
[Issue]: CI nightly hip programs fail to compile via pip install on RHEL 8.10 baremetal. #4271
Description
Problem Description
You won't be able to compile any hip programs If you install the latest nightly on RHEL 8.10 via the pip install method. Here is the error you will get
clang++: error: unable to execute command: Segmentation fault
clang++: error: amdgcn-link command failed due to signal (use -v to see invocation)
AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git 4adeabb0862ea8119c143d9f8256475b0b687217+PATCHED:f3b5643f91ad4def7b92cd48247bc11f1f39fb5c)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/rocm/.venv/lib/python3.11/site-packages/_rocm_sdk_core/lib/llvm/bin
clang++: error: unable to execute command: Segmentation fault
clang++: note: diagnostic msg: Error generating preprocessed source(s).
gmake[2]: *** [CMakeFiles/hip_hello_world.dir/build.make:75: CMakeFiles/hip_hello_world.dir/main.hip.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/hip_hello_world.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2
I initially created a ticket here ROCM-20033 and assigned it to the compiler team. Sam from the compiler team debugged the issue and here is what he said.
We reproduced the issue on the RHEL 8.10 machine. The crash happens because the pip-packaged clang-offload-bundler binary has a non-standard ELF layout — its base address is at 0x3ff000 instead of the normal 0x400000, and its first LOAD segment is read-write instead of read-only. This is caused by patchelf modifying the binary RPATH during pip package creation in TheRock build system. The RHEL 8.10 kernel (4.18) has a bug in its ELF loader that cannot handle this layout, so execve fails with EEXIST and the process crashes with SIGSEGV. Newer kernels (5.x+) handle it fine, which is why it works in Docker on an Ubuntu host but not on a RHEL 8.10 host. As a workaround, users can invoke the binary through the dynamic linker directly: /lib64/ld-linux-x86-64.so.2 /clang-offload-bundler. To fix this properly, the TheRock packaging should either pre-allocate space in the ELF dynamic section so patchelf does not need to insert a new segment, or ship wrapper scripts for older kernels.
Hi Parag, I took a look at this and it seems like the issue is coming from TheRock's pip packaging scripts rather than the compiler itself. The patchelf post-processing and the exe stub generator were written by Stella Laurenzo, so it might be worth reaching out to her to see how best to get this ticket reassigned to the TheRock team.
Operating System
RHEL 8.10
CPU
12th Gen Intel(R) Core(TM) i7-12700K
GPU
Navi31 XTX
ROCm Version
ROCm 7.13 (Nightly)
ROCm Component
No response
Steps to Reproduce
- Install RHEL 8.10 on a baremetal system. Running RHEL 8.10 on a docker does not reproduce the error.
- Install amdgpu version 30.30. This issue was created when the latest version of amdgpu released was 30.30.
Steps to install amdgpu version 30.30
# Install prerequisites
sudo dnf update --releasever=8.10 --exclude=\*release\*
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo rpm -ivh epel-release-latest-8.noarch.rpm
sudo dnf config-manager --enable codeready-builder-for-rhel-8-x86_64-rpms
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
# Restart machine
# Install amdgpu 30.30
sudo dnf install https://repo.radeon.com/amdgpu-install/7.2/rhel/8/amdgpu-install-7.2.70200-1.el8.noarch.rpm
sudo dnf clean all
sudo dnf install "kernel-headers-$(uname -r)" "kernel-devel-$(uname -r)"
sudo dnf install amdgpu-dkms
# Restart machine again.
Install latest ROCm nightly
sudo dnf install -y python3.11 python3.11-pip libatomic
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/ "rocm[libraries,devel]"
rocm-sdk init
Download rocm-examples and set environment variables
sudo dnf install -y wget sudo python3 gcc-c++ git cmake glfw-devel vulkan-headers vulkan-loader-devel vulkan-validation-layers mesa-libGL-devel gcc-toolset-11 ninja-build
git clone https://github.com/ROCm/rocm-examples.git -b "release/therock-7.11"
cd rocm-examples/HIP-Basic/hello_world
export ROCM_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel
export LD_LIBRARY_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel/lib
export HIP_PLATFORM=amd
export HIP_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel
export HIP_CLANG_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel/llvm/bin
export HIP_DEVICE_LIB_PATH=/home/rocm/.venv/lib64/python3.11/site-packages/_rocm_sdk_devel/lib/llvm/amdgcn/bitcode
export PATH="$ROCM_PATH/bin:$HIP_CLANG_PATH:$PATH"
source /opt/rh/gcc-toolset-11/enable
export CXX=/opt/rh/gcc-toolset-11/root/usr/bin/g++
export CC=/opt/rh/gcc-toolset-11/root/usr/bin/gcc
Build the hello_world example
cmake -B build -DROCM_ROOT=$ROCM_PATH .
cmake --build build
You will run into the error above.
Let me know if you need a baremetal RHEL 8.10 system to reproduce this error. I can setup one up for you.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status