-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Summary
When PlatformState::getStatFunc() fails to find a kernel for a specific deviceId, ihipLaunchKernel() incorrectly casts the raw hostFunction pointer (a function stub/code address) as a hipFunction_t (a DeviceFunc* object pointer). This causes a crash when the code later attempts to access DeviceFunc::kernel() on what is actually machine code, not a valid object.
Affected Versions
- ROCm 7.1.1 (confirmed)
- ROCm 7.2.0 (confirmed by source code review - same bug exists)
Hardware
- 2x AMD Instinct MI50 (Vega20) GPUs
- Architecture:
gfx906:sramecc+:xnack- - GPUs on different NUMA nodes
Bug Location
File: hipamd/src/hip_platform.cpp
Function: ihipLaunchKernel()
hipError_t ihipLaunchKernel(const void* hostFunction, dim3 gridDim, dim3 blockDim, void** args,
size_t sharedMemBytes, hipStream_t stream, hipEvent_t startEvent,
hipEvent_t stopEvent, int flags) {
// ...
hipFunction_t func = nullptr;
int deviceId = hip::Stream::DeviceId(stream);
hipError_t hip_error =
PlatformState::instance().getStatFunc(&func, hostFunction, deviceId);
if ((hip_error != hipSuccess) || (func == nullptr)) {
// BUG: assumes hostFunction IS a hipFunction_t if lookup fails
// But hostFunction is a CODE ADDRESS (function stub), not a DeviceFunc* object!
func = reinterpret_cast<hipFunction_t>(const_cast<void *>(hostFunction));
}
// ...
return ihipModuleLaunchKernel(func, ...); // Crashes here
}The comment says "assume its hip function type if we did not get a valid output from static func lookup" — but this assumption is unsafe and incorrect.
Root Cause Analysis
When RCCL (or any HIP library) registers kernels via __hipRegisterFunction(), the kernels are registered with deferred loading enabled by default (HIP_ENABLE_DEFERRED_LOADING=1). This means:
- Kernels are NOT pre-built for all devices at registration time
Function::getStatFunc()attempts to build the kernel on-demand viaBuildProgram(deviceId)- If
BuildProgram()fails for a specific device,getStatFunc()returns an error ihipLaunchKernel()then incorrectly interpretshostFunctionas already being ahipFunction_t
What hostFunction actually is:
- A pointer to the host-side function stub (machine code)
- Example:
0x7ffff7eae768points to x86 instructions forncclDevKernel_Generic_4
What hipFunction_t should be:
- A pointer to a
DeviceFuncC++ object - Contains
name_,kernel_members
When the code later calls DeviceFunc::asFunction(func)->kernel(), it interprets machine code bytes as a DeviceFunc object, reads garbage from the kernel_ offset (e.g., 0x43000), and crashes when accessing kernel->signature().
GDB Trace Evidence
Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007fffeb6a7f33 in amd::Kernel::signature (this=0x43000)
at /longer_pathname.../hipamd/src/hip_module.cpp:179
#0 amd::Kernel::signature (this=0x43000)
ROCm/clr#1 hip::ihipLaunchKernel_validate (f=0x7ffff7eae768, ...)
ROCm/clr#2 hip::ihipModuleLaunchKernel (f=0x7ffff7eae768, ...)
ROCm/clr#3 hip::ihipLaunchKernel (hostFunction=0x7ffff7eae768, ...)
ROCm/clr#4 hipLaunchKernel (...)
ROCm/clr#5 ncclLaunchKernel (...)
Examining f=0x7ffff7eae768:
(gdb) x/10i 0x7ffff7eae768
0x7ffff7eae768 <ncclDevKernel_Generic_4(...)>: push %rbp
0x7ffff7eae769: mov %rsp,%rbp
...
This confirms f points to executable code, not a DeviceFunc* object.
Reproducer
Multi-GPU RCCL test (crashes):
#include <rccl/rccl.h>
#include <hip/hip_runtime.h>
int main() {
int nDev = 2;
ncclComm_t comms[2];
int devs[2] = {0, 1};
// This triggers kernel launch on device 1, which crashes
ncclCommInitAll(comms, nDev, devs);
// ... AllReduce operations crash with SIGSEGV
}Single-GPU RCCL test on either device 0 OR device 1 individually: WORKS
Workaround Attempted
Setting HIP_ENABLE_DEFERRED_LOADING=0 changes the failure mode:
Cannot retrieve Static function, error: 218
Aborted (core dumped)
Error 218 = hipErrorNoBinaryForGpu, which suggests the kernel binary may not exist for the target device. However, the crash in the default deferred loading mode is the more severe bug — it should return an error, not crash.
Suggested Fix
Instead of blindly casting hostFunction to hipFunction_t, the code should return an error:
hipError_t hip_error =
PlatformState::instance().getStatFunc(&func, hostFunction, deviceId);
if ((hip_error != hipSuccess) || (func == nullptr)) {
// DON'T cast hostFunction - it's not a valid hipFunction_t!
LogPrintfError("Failed to get function for device %d: error %d", deviceId, hip_error);
return hipErrorInvalidDeviceFunction;
}Or, if there's a legitimate case where hostFunction could be a pre-registered hipFunction_t, add validation:
if ((hip_error != hipSuccess) || (func == nullptr)) {
// Validate that hostFunction is actually a DeviceFunc* before casting
if (!PlatformState::instance().isValidDynFunc(hostFunction)) {
return hipErrorInvalidDeviceFunction;
}
func = reinterpret_cast<hipFunction_t>(const_cast<void *>(hostFunction));
}Impact
This bug prevents multi-GPU RCCL operations from working on MI50 (gfx906) systems with ROCm 7.1.1, causing hard crashes instead of graceful error handling.