ihipLaunchKernel: Unsafe cast of hostFunction to hipFunction_t when getStatFunc fails causes SIGSEGV

## Summary

When `PlatformState::getStatFunc()` fails to find a kernel for a specific `deviceId`, `ihipLaunchKernel()` incorrectly casts the raw `hostFunction` pointer (a function stub/code address) as a `hipFunction_t` (a `DeviceFunc*` object pointer). This causes a crash when the code later attempts to access `DeviceFunc::kernel()` on what is actually machine code, not a valid object.

## Affected Versions

- ROCm 7.1.1 (confirmed)
- ROCm 7.2.0 (confirmed by source code review - same bug exists)

## Hardware

- 2x AMD Instinct MI50 (Vega20) GPUs
- Architecture: `gfx906:sramecc+:xnack-`
- GPUs on different NUMA nodes

## Bug Location

**File:** `hipamd/src/hip_platform.cpp`  
**Function:** `ihipLaunchKernel()`

```cpp
hipError_t ihipLaunchKernel(const void* hostFunction, dim3 gridDim, dim3 blockDim, void** args,
                            size_t sharedMemBytes, hipStream_t stream, hipEvent_t startEvent,
                            hipEvent_t stopEvent, int flags) {
  // ...
  hipFunction_t func = nullptr;
  int deviceId = hip::Stream::DeviceId(stream);

  hipError_t hip_error =
      PlatformState::instance().getStatFunc(&func, hostFunction, deviceId);
  if ((hip_error != hipSuccess) || (func == nullptr)) {
    // BUG: assumes hostFunction IS a hipFunction_t if lookup fails
    // But hostFunction is a CODE ADDRESS (function stub), not a DeviceFunc* object!
    func = reinterpret_cast<hipFunction_t>(const_cast<void *>(hostFunction));
  }
  // ...
  return ihipModuleLaunchKernel(func, ...);  // Crashes here
}
```

The comment says "assume its hip function type if we did not get a valid output from static func lookup" — but this assumption is **unsafe and incorrect**.

## Root Cause Analysis

When RCCL (or any HIP library) registers kernels via `__hipRegisterFunction()`, the kernels are registered with deferred loading enabled by default (`HIP_ENABLE_DEFERRED_LOADING=1`). This means:

1. Kernels are NOT pre-built for all devices at registration time
2. `Function::getStatFunc()` attempts to build the kernel on-demand via `BuildProgram(deviceId)`
3. If `BuildProgram()` fails for a specific device, `getStatFunc()` returns an error
4. `ihipLaunchKernel()` then incorrectly interprets `hostFunction` as already being a `hipFunction_t`

**What `hostFunction` actually is:**
- A pointer to the **host-side function stub** (machine code)
- Example: `0x7ffff7eae768` points to x86 instructions for `ncclDevKernel_Generic_4`

**What `hipFunction_t` should be:**
- A pointer to a `DeviceFunc` C++ object
- Contains `name_`, `kernel_` members

When the code later calls `DeviceFunc::asFunction(func)->kernel()`, it interprets machine code bytes as a `DeviceFunc` object, reads garbage from the `kernel_` offset (e.g., `0x43000`), and crashes when accessing `kernel->signature()`.

## GDB Trace Evidence

```
Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007fffeb6a7f33 in amd::Kernel::signature (this=0x43000)
    at /longer_pathname.../hipamd/src/hip_module.cpp:179

#0  amd::Kernel::signature (this=0x43000)
ROCm/clr#1  hip::ihipLaunchKernel_validate (f=0x7ffff7eae768, ...)
ROCm/clr#2  hip::ihipModuleLaunchKernel (f=0x7ffff7eae768, ...)
ROCm/clr#3  hip::ihipLaunchKernel (hostFunction=0x7ffff7eae768, ...)
ROCm/clr#4  hipLaunchKernel (...)
ROCm/clr#5  ncclLaunchKernel (...)
```

Examining `f=0x7ffff7eae768`:
```
(gdb) x/10i 0x7ffff7eae768
   0x7ffff7eae768 <ncclDevKernel_Generic_4(...)>:  push %rbp
   0x7ffff7eae769:  mov %rsp,%rbp
   ...
```

This confirms `f` points to **executable code**, not a `DeviceFunc*` object.

## Reproducer

Multi-GPU RCCL test (crashes):
```cpp
#include <rccl/rccl.h>
#include <hip/hip_runtime.h>

int main() {
    int nDev = 2;
    ncclComm_t comms[2];
    int devs[2] = {0, 1};
    
    // This triggers kernel launch on device 1, which crashes
    ncclCommInitAll(comms, nDev, devs);
    
    // ... AllReduce operations crash with SIGSEGV
}
```

Single-GPU RCCL test on either device 0 OR device 1 individually: **WORKS**

## Workaround Attempted

Setting `HIP_ENABLE_DEFERRED_LOADING=0` changes the failure mode:
```
Cannot retrieve Static function, error: 218
Aborted (core dumped)
```

Error 218 = `hipErrorNoBinaryForGpu`, which suggests the kernel binary may not exist for the target device. However, the crash in the default deferred loading mode is the more severe bug — it should return an error, not crash.

## Suggested Fix

Instead of blindly casting `hostFunction` to `hipFunction_t`, the code should return an error:

```cpp
hipError_t hip_error =
    PlatformState::instance().getStatFunc(&func, hostFunction, deviceId);
if ((hip_error != hipSuccess) || (func == nullptr)) {
    // DON'T cast hostFunction - it's not a valid hipFunction_t!
    LogPrintfError("Failed to get function for device %d: error %d", deviceId, hip_error);
    return hipErrorInvalidDeviceFunction;
}
```

Or, if there's a legitimate case where `hostFunction` could be a pre-registered `hipFunction_t`, add validation:
```cpp
if ((hip_error != hipSuccess) || (func == nullptr)) {
    // Validate that hostFunction is actually a DeviceFunc* before casting
    if (!PlatformState::instance().isValidDynFunc(hostFunction)) {
        return hipErrorInvalidDeviceFunction;
    }
    func = reinterpret_cast<hipFunction_t>(const_cast<void *>(hostFunction));
}
```

## Impact

This bug prevents multi-GPU RCCL operations from working on MI50 (gfx906) systems with ROCm 7.1.1, causing hard crashes instead of graceful error handling.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ihipLaunchKernel: Unsafe cast of hostFunction to hipFunction_t when getStatFunc fails causes SIGSEGV #2805

Summary

Affected Versions

Hardware

Bug Location

Root Cause Analysis

GDB Trace Evidence

Reproducer

Workaround Attempted

Suggested Fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ihipLaunchKernel: Unsafe cast of hostFunction to hipFunction_t when getStatFunc fails causes SIGSEGV #2805

Description

Summary

Affected Versions

Hardware

Bug Location

Root Cause Analysis

GDB Trace Evidence

Reproducer

Workaround Attempted

Suggested Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions