Skip to content

ihipLaunchKernel: Unsafe cast of hostFunction to hipFunction_t when getStatFunc fails causes SIGSEGV #2805

@dbsanfte

Description

@dbsanfte

Summary

When PlatformState::getStatFunc() fails to find a kernel for a specific deviceId, ihipLaunchKernel() incorrectly casts the raw hostFunction pointer (a function stub/code address) as a hipFunction_t (a DeviceFunc* object pointer). This causes a crash when the code later attempts to access DeviceFunc::kernel() on what is actually machine code, not a valid object.

Affected Versions

  • ROCm 7.1.1 (confirmed)
  • ROCm 7.2.0 (confirmed by source code review - same bug exists)

Hardware

  • 2x AMD Instinct MI50 (Vega20) GPUs
  • Architecture: gfx906:sramecc+:xnack-
  • GPUs on different NUMA nodes

Bug Location

File: hipamd/src/hip_platform.cpp
Function: ihipLaunchKernel()

hipError_t ihipLaunchKernel(const void* hostFunction, dim3 gridDim, dim3 blockDim, void** args,
                            size_t sharedMemBytes, hipStream_t stream, hipEvent_t startEvent,
                            hipEvent_t stopEvent, int flags) {
  // ...
  hipFunction_t func = nullptr;
  int deviceId = hip::Stream::DeviceId(stream);

  hipError_t hip_error =
      PlatformState::instance().getStatFunc(&func, hostFunction, deviceId);
  if ((hip_error != hipSuccess) || (func == nullptr)) {
    // BUG: assumes hostFunction IS a hipFunction_t if lookup fails
    // But hostFunction is a CODE ADDRESS (function stub), not a DeviceFunc* object!
    func = reinterpret_cast<hipFunction_t>(const_cast<void *>(hostFunction));
  }
  // ...
  return ihipModuleLaunchKernel(func, ...);  // Crashes here
}

The comment says "assume its hip function type if we did not get a valid output from static func lookup" — but this assumption is unsafe and incorrect.

Root Cause Analysis

When RCCL (or any HIP library) registers kernels via __hipRegisterFunction(), the kernels are registered with deferred loading enabled by default (HIP_ENABLE_DEFERRED_LOADING=1). This means:

  1. Kernels are NOT pre-built for all devices at registration time
  2. Function::getStatFunc() attempts to build the kernel on-demand via BuildProgram(deviceId)
  3. If BuildProgram() fails for a specific device, getStatFunc() returns an error
  4. ihipLaunchKernel() then incorrectly interprets hostFunction as already being a hipFunction_t

What hostFunction actually is:

  • A pointer to the host-side function stub (machine code)
  • Example: 0x7ffff7eae768 points to x86 instructions for ncclDevKernel_Generic_4

What hipFunction_t should be:

  • A pointer to a DeviceFunc C++ object
  • Contains name_, kernel_ members

When the code later calls DeviceFunc::asFunction(func)->kernel(), it interprets machine code bytes as a DeviceFunc object, reads garbage from the kernel_ offset (e.g., 0x43000), and crashes when accessing kernel->signature().

GDB Trace Evidence

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007fffeb6a7f33 in amd::Kernel::signature (this=0x43000)
    at /longer_pathname.../hipamd/src/hip_module.cpp:179

#0  amd::Kernel::signature (this=0x43000)
ROCm/clr#1  hip::ihipLaunchKernel_validate (f=0x7ffff7eae768, ...)
ROCm/clr#2  hip::ihipModuleLaunchKernel (f=0x7ffff7eae768, ...)
ROCm/clr#3  hip::ihipLaunchKernel (hostFunction=0x7ffff7eae768, ...)
ROCm/clr#4  hipLaunchKernel (...)
ROCm/clr#5  ncclLaunchKernel (...)

Examining f=0x7ffff7eae768:

(gdb) x/10i 0x7ffff7eae768
   0x7ffff7eae768 <ncclDevKernel_Generic_4(...)>:  push %rbp
   0x7ffff7eae769:  mov %rsp,%rbp
   ...

This confirms f points to executable code, not a DeviceFunc* object.

Reproducer

Multi-GPU RCCL test (crashes):

#include <rccl/rccl.h>
#include <hip/hip_runtime.h>

int main() {
    int nDev = 2;
    ncclComm_t comms[2];
    int devs[2] = {0, 1};
    
    // This triggers kernel launch on device 1, which crashes
    ncclCommInitAll(comms, nDev, devs);
    
    // ... AllReduce operations crash with SIGSEGV
}

Single-GPU RCCL test on either device 0 OR device 1 individually: WORKS

Workaround Attempted

Setting HIP_ENABLE_DEFERRED_LOADING=0 changes the failure mode:

Cannot retrieve Static function, error: 218
Aborted (core dumped)

Error 218 = hipErrorNoBinaryForGpu, which suggests the kernel binary may not exist for the target device. However, the crash in the default deferred loading mode is the more severe bug — it should return an error, not crash.

Suggested Fix

Instead of blindly casting hostFunction to hipFunction_t, the code should return an error:

hipError_t hip_error =
    PlatformState::instance().getStatFunc(&func, hostFunction, deviceId);
if ((hip_error != hipSuccess) || (func == nullptr)) {
    // DON'T cast hostFunction - it's not a valid hipFunction_t!
    LogPrintfError("Failed to get function for device %d: error %d", deviceId, hip_error);
    return hipErrorInvalidDeviceFunction;
}

Or, if there's a legitimate case where hostFunction could be a pre-registered hipFunction_t, add validation:

if ((hip_error != hipSuccess) || (func == nullptr)) {
    // Validate that hostFunction is actually a DeviceFunc* before casting
    if (!PlatformState::instance().isValidDynFunc(hostFunction)) {
        return hipErrorInvalidDeviceFunction;
    }
    func = reinterpret_cast<hipFunction_t>(const_cast<void *>(hostFunction));
}

Impact

This bug prevents multi-GPU RCCL operations from working on MI50 (gfx906) systems with ROCm 7.1.1, causing hard crashes instead of graceful error handling.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions