[MLIR][ROCDL] Truncate ROCDL DeviceFunctionCall results to i32 when gpu::amd::Runtime is HIP

## Description:

Currently, during MLIR ROCDL Dialect lowering to LLVM IR, device function calls are translated to return i64 values directly:

https://github.com/llvm/llvm-project/blob/4bb250d6a3d63c41f5d539c9b9a162070ea5b619/mlir/lib/Target/LLVMIR/Dialect/ROCDL/ROCDLToLLVMIRTranslation.cpp#L33-L53

This causes an issue when compiling HIP code using the `Source -> MLIR -> LLVM IR -> Binary` path. If a GPU builtin function result participates in a use-def chain, other operands in the chain are sign-extended (sext) to i64, leading to unnecessary 64-bit arithmetic. For example, given the following HIP kernel:

```hip
#include <hip/hip_runtime.h>

__global__ void axpy(int a, int b, int *x, int *y) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  printf("Idx: %d", idx);
  y[idx] = a * x[idx] + b;
}
```

MLIR lowering produces LLVM IR like:

```LLVM IR
...
%13 = call i32 @llvm.amdgcn.workgroup.id.x()
%14 = sext i32 %13 to i64
%15 = call i64 @__ockl_get_local_size(i32 0)
%16 = trunc i64 %15 to i32
%17 = sext i32 %16 to i64
%18 = mul i64 %14, %17
%19 = call i32 @llvm.amdgcn.workitem.id.x()
%20 = sext i32 %19 to i64
%21 = add i64 %18, %20
...
```

However, compiling the same HIP kernel with a command like `clang++ --offload-device-only -emit-llvm -O0 a.hip -S -o -` produces LLVM IR that truncates the device function results to i32, keeping the arithmetic in 32-bit:

```LLVM IR
...
%31 = call i64 @__ockl_get_group_id(i32 noundef 0) #13
%32 = trunc i64 %31 to i32
...
%35 = call i64 @__ockl_get_local_size(i32 noundef 0) #13
%36 = trunc i64 %35 to i32
%37 = mul i32 %32, %36
...
%40 = call i64 @__ockl_get_local_id(i32 noundef 0) #13
%41 = trunc i64 %40 to i32
%42 = add i32 %37, %41
...
```

I understand that in some contexts, such as when `gpu::amd::Runtime` is OpenCL, returning i64 may be correct since the OpenCL spec requires size_t (which is 64-bit). But for HIP, truncation is generally necessary, because arithmetic on i64 is more expensive. It can be crutial for kernel compiling. And on AMD hardware, local size is at most 32-bit.

Clang already performs truncation for HIP, so I think MLIR lowering should ideally match. BTW without this change, trying to set indexBitwidth = 32 in the MLIR->LLVMIR lowering pipeline, e.g.:

```cpp
pm.addNestedPass<gpu::GPUModuleOp>(createLowerGpuOpsToROCDLOpsPass(
    options.chipset, /*kIndexBitwidth=*/32,
    options.use_bare_ptr_memref_call_conv,
    gpu::amd::Runtime::HIP));
```

will cause translation errors due to mismatched operand types. 

## Proposal:

When `gpu::amd::Runtime` == `HIP`, in ROCDLToLLVMIRTranslation, truncate the device function call result to i32, e.g.:

```cpp
return builder.CreateTrunc(
    builder.CreateCall(fn, ArrayRef<llvm::Value *>(fnOp0)),
    llvm::Type::getInt32Ty(module->getContext()));
```


	static llvm::Value *createDimGetterFunctionCall(llvm::IRBuilderBase &builder,
	Operation *op, StringRef fnName,
	int parameter) {
	llvm::Module *module = builder.GetInsertBlock()->getModule();
	llvm::FunctionType *functionType = llvm::FunctionType::get(
	llvm::Type::getInt64Ty(module->getContext()), // return type.
	llvm::Type::getInt32Ty(module->getContext()), // parameter type.
	false); // no variadic arguments.
	llvm::Function *fn = dyn_cast<llvm::Function>(
	module->getOrInsertFunction(fnName, functionType).getCallee());
	llvm::Value *fnOp0 = llvm::ConstantInt::get(
	llvm::Type::getInt32Ty(module->getContext()), parameter);
	auto call = builder.CreateCall(fn, ArrayRef<llvm::Value >(fnOp0));
	if (auto rangeAttr = op->getAttrOfType<LLVM::ConstantRangeAttr>("range")) {
	// Zero-extend to 64 bits because the GPU dialect uses 32-bit bounds but
	// these ockl functions are defined to be 64-bits
	call->addRangeRetAttr(llvm::ConstantRange(rangeAttr.getLower().zext(64),
	rangeAttr.getUpper().zext(64)));
	}
	return call;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLIR][ROCDL] Truncate ROCDL DeviceFunctionCall results to i32 when gpu::amd::Runtime is HIP #158220

Description:

Proposal:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MLIR][ROCDL] Truncate ROCDL DeviceFunctionCall results to i32 when gpu::amd::Runtime is HIP #158220

Description

Description:

Proposal:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions