-
Notifications
You must be signed in to change notification settings - Fork 15k
Description
Description:
Currently, during MLIR ROCDL Dialect lowering to LLVM IR, device function calls are translated to return i64 values directly:
llvm-project/mlir/lib/Target/LLVMIR/Dialect/ROCDL/ROCDLToLLVMIRTranslation.cpp
Lines 33 to 53 in 4bb250d
| static llvm::Value *createDimGetterFunctionCall(llvm::IRBuilderBase &builder, | |
| Operation *op, StringRef fnName, | |
| int parameter) { | |
| llvm::Module *module = builder.GetInsertBlock()->getModule(); | |
| llvm::FunctionType *functionType = llvm::FunctionType::get( | |
| llvm::Type::getInt64Ty(module->getContext()), // return type. | |
| llvm::Type::getInt32Ty(module->getContext()), // parameter type. | |
| false); // no variadic arguments. | |
| llvm::Function *fn = dyn_cast<llvm::Function>( | |
| module->getOrInsertFunction(fnName, functionType).getCallee()); | |
| llvm::Value *fnOp0 = llvm::ConstantInt::get( | |
| llvm::Type::getInt32Ty(module->getContext()), parameter); | |
| auto *call = builder.CreateCall(fn, ArrayRef<llvm::Value *>(fnOp0)); | |
| if (auto rangeAttr = op->getAttrOfType<LLVM::ConstantRangeAttr>("range")) { | |
| // Zero-extend to 64 bits because the GPU dialect uses 32-bit bounds but | |
| // these ockl functions are defined to be 64-bits | |
| call->addRangeRetAttr(llvm::ConstantRange(rangeAttr.getLower().zext(64), | |
| rangeAttr.getUpper().zext(64))); | |
| } | |
| return call; | |
| } |
This causes an issue when compiling HIP code using the Source -> MLIR -> LLVM IR -> Binary path. If a GPU builtin function result participates in a use-def chain, other operands in the chain are sign-extended (sext) to i64, leading to unnecessary 64-bit arithmetic. For example, given the following HIP kernel:
#include <hip/hip_runtime.h>
__global__ void axpy(int a, int b, int *x, int *y) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
printf("Idx: %d", idx);
y[idx] = a * x[idx] + b;
}MLIR lowering produces LLVM IR like:
...
%13 = call i32 @llvm.amdgcn.workgroup.id.x()
%14 = sext i32 %13 to i64
%15 = call i64 @__ockl_get_local_size(i32 0)
%16 = trunc i64 %15 to i32
%17 = sext i32 %16 to i64
%18 = mul i64 %14, %17
%19 = call i32 @llvm.amdgcn.workitem.id.x()
%20 = sext i32 %19 to i64
%21 = add i64 %18, %20
...However, compiling the same HIP kernel with a command like clang++ --offload-device-only -emit-llvm -O0 a.hip -S -o - produces LLVM IR that truncates the device function results to i32, keeping the arithmetic in 32-bit:
...
%31 = call i64 @__ockl_get_group_id(i32 noundef 0) #13
%32 = trunc i64 %31 to i32
...
%35 = call i64 @__ockl_get_local_size(i32 noundef 0) #13
%36 = trunc i64 %35 to i32
%37 = mul i32 %32, %36
...
%40 = call i64 @__ockl_get_local_id(i32 noundef 0) #13
%41 = trunc i64 %40 to i32
%42 = add i32 %37, %41
...I understand that in some contexts, such as when gpu::amd::Runtime is OpenCL, returning i64 may be correct since the OpenCL spec requires size_t (which is 64-bit). But for HIP, truncation is generally necessary, because arithmetic on i64 is more expensive. It can be crutial for kernel compiling. And on AMD hardware, local size is at most 32-bit.
Clang already performs truncation for HIP, so I think MLIR lowering should ideally match. BTW without this change, trying to set indexBitwidth = 32 in the MLIR->LLVMIR lowering pipeline, e.g.:
pm.addNestedPass<gpu::GPUModuleOp>(createLowerGpuOpsToROCDLOpsPass(
options.chipset, /*kIndexBitwidth=*/32,
options.use_bare_ptr_memref_call_conv,
gpu::amd::Runtime::HIP));will cause translation errors due to mismatched operand types.
Proposal:
When gpu::amd::Runtime == HIP, in ROCDLToLLVMIRTranslation, truncate the device function call result to i32, e.g.:
return builder.CreateTrunc(
builder.CreateCall(fn, ArrayRef<llvm::Value *>(fnOp0)),
llvm::Type::getInt32Ty(module->getContext()));