Skip to content

[MLIR][ROCDL] Truncate ROCDL DeviceFunctionCall results to i32 when gpu::amd::Runtime is HIP #158220

@GwokHiujin

Description

@GwokHiujin

Description:

Currently, during MLIR ROCDL Dialect lowering to LLVM IR, device function calls are translated to return i64 values directly:

static llvm::Value *createDimGetterFunctionCall(llvm::IRBuilderBase &builder,
Operation *op, StringRef fnName,
int parameter) {
llvm::Module *module = builder.GetInsertBlock()->getModule();
llvm::FunctionType *functionType = llvm::FunctionType::get(
llvm::Type::getInt64Ty(module->getContext()), // return type.
llvm::Type::getInt32Ty(module->getContext()), // parameter type.
false); // no variadic arguments.
llvm::Function *fn = dyn_cast<llvm::Function>(
module->getOrInsertFunction(fnName, functionType).getCallee());
llvm::Value *fnOp0 = llvm::ConstantInt::get(
llvm::Type::getInt32Ty(module->getContext()), parameter);
auto *call = builder.CreateCall(fn, ArrayRef<llvm::Value *>(fnOp0));
if (auto rangeAttr = op->getAttrOfType<LLVM::ConstantRangeAttr>("range")) {
// Zero-extend to 64 bits because the GPU dialect uses 32-bit bounds but
// these ockl functions are defined to be 64-bits
call->addRangeRetAttr(llvm::ConstantRange(rangeAttr.getLower().zext(64),
rangeAttr.getUpper().zext(64)));
}
return call;
}

This causes an issue when compiling HIP code using the Source -> MLIR -> LLVM IR -> Binary path. If a GPU builtin function result participates in a use-def chain, other operands in the chain are sign-extended (sext) to i64, leading to unnecessary 64-bit arithmetic. For example, given the following HIP kernel:

#include <hip/hip_runtime.h>

__global__ void axpy(int a, int b, int *x, int *y) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  printf("Idx: %d", idx);
  y[idx] = a * x[idx] + b;
}

MLIR lowering produces LLVM IR like:

...
%13 = call i32 @llvm.amdgcn.workgroup.id.x()
%14 = sext i32 %13 to i64
%15 = call i64 @__ockl_get_local_size(i32 0)
%16 = trunc i64 %15 to i32
%17 = sext i32 %16 to i64
%18 = mul i64 %14, %17
%19 = call i32 @llvm.amdgcn.workitem.id.x()
%20 = sext i32 %19 to i64
%21 = add i64 %18, %20
...

However, compiling the same HIP kernel with a command like clang++ --offload-device-only -emit-llvm -O0 a.hip -S -o - produces LLVM IR that truncates the device function results to i32, keeping the arithmetic in 32-bit:

...
%31 = call i64 @__ockl_get_group_id(i32 noundef 0) #13
%32 = trunc i64 %31 to i32
...
%35 = call i64 @__ockl_get_local_size(i32 noundef 0) #13
%36 = trunc i64 %35 to i32
%37 = mul i32 %32, %36
...
%40 = call i64 @__ockl_get_local_id(i32 noundef 0) #13
%41 = trunc i64 %40 to i32
%42 = add i32 %37, %41
...

I understand that in some contexts, such as when gpu::amd::Runtime is OpenCL, returning i64 may be correct since the OpenCL spec requires size_t (which is 64-bit). But for HIP, truncation is generally necessary, because arithmetic on i64 is more expensive. It can be crutial for kernel compiling. And on AMD hardware, local size is at most 32-bit.

Clang already performs truncation for HIP, so I think MLIR lowering should ideally match. BTW without this change, trying to set indexBitwidth = 32 in the MLIR->LLVMIR lowering pipeline, e.g.:

pm.addNestedPass<gpu::GPUModuleOp>(createLowerGpuOpsToROCDLOpsPass(
    options.chipset, /*kIndexBitwidth=*/32,
    options.use_bare_ptr_memref_call_conv,
    gpu::amd::Runtime::HIP));

will cause translation errors due to mismatched operand types.

Proposal:

When gpu::amd::Runtime == HIP, in ROCDLToLLVMIRTranslation, truncate the device function call result to i32, e.g.:

return builder.CreateTrunc(
    builder.CreateCall(fn, ArrayRef<llvm::Value *>(fnOp0)),
    llvm::Type::getInt32Ty(module->getContext()));

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions