Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 42 additions & 5 deletions llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1017,15 +1017,52 @@ struct AAAMDGPUNoAGPR

const char AAAMDGPUNoAGPR::ID = 0;

static unsigned getMaxNumPreloadArgs(const Function &F, const DataLayout &DL,
const GCNSubtarget &ST) {
unsigned Offset = 0;
unsigned ArgsToPreload = 0;
for (const auto &Arg : F.args()) {
if (Arg.hasByRefAttr())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up we really should be handling the byref case for the future

break;

Type *Ty = Arg.getType();
Align ArgAlign = DL.getABITypeAlign(Ty);
auto Size = DL.getTypeAllocSize(Ty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation size doesn't directly map to number of registers. Really this should look at what ComputeValueVTs does to get the number of registers.

I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alloc size is correct with respect to the number of user SGPRs that are initially allocated for preloading, since it maps directly on to how the data looks in the kernarg segment, and the number of registers used for preloading is derived from that.

I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)

I short of agree, but we have to decide what to do when frontends like Triton just add inreg to every argument. Should we remove it in cases where we cannot preload the argument? Print a warning if we cannot preload?

I'm leaning towards the first option where we remove inreg from arguments that wont actually be preloaded somewhere like AMDGPULowerKernelArguments after all the attributes are finalized, ect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only correct for trivially legal types. It will not be for anything exotic or aggregates. You either need to use ComputeValueVTs or only accept trivial types that map to N registers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are calculating is the size in memory, i.e. the start offset to the ending explicit offset. Isn't this method correct for anything that is not byref?

Do you think we should also consider the attributes that are inferred by the AMDGPUAttributor itself when marking things for preloading here? I.e. add inreg at the very end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally kernel arguments would only use byref (except for these inreg arguments). Non-byref arguments are really byref, and will be lowered to byref.

If you're making the values inreg, they should be treated more like ordinary calling convention register arguments. It's probably simplest to just filter out types that aren't trivially register types. That's the case for all implicit kernel arguments anyway.

Offset = alignTo(Offset, ArgAlign);
if (((Offset + Size) / 4) > ST.getMaxNumUserSGPRs())
break;

Offset += Size;
ArgsToPreload++;
}

return ArgsToPreload;
}

static void addPreloadKernArgHint(Function &F, TargetMachine &TM) {
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
for (unsigned I = 0;
I < F.arg_size() &&
I < std::min(KernargPreloadCount.getValue(), ST.getMaxNumUserSGPRs());
++I) {
if (!ST.hasKernargPreload())
return;

// Enable kernarg preloading by default on GFX940+.
size_t PreloadCount;
if (KernargPreloadCount.getNumOccurrences()) {
// Override default behavior if CL option is present.
PreloadCount = std::min<size_t>(KernargPreloadCount, F.arg_size());
} else {
// Defaults with no CL option.
if (ST.defaultEnabledKernargPreload())
PreloadCount =
getMaxNumPreloadArgs(F, F.getParent()->getDataLayout(), ST);
else
PreloadCount = 0;
}

for (size_t I = 0; I < PreloadCount; ++I) {
Argument &Arg = *F.getArg(I);
// Check for incompatible attributes.
if (Arg.hasByRefAttr() || Arg.hasNestAttr())
if (Arg.getType()->isAggregateType() || Arg.hasByRefAttr() ||
Arg.hasNestAttr())
break;

Arg.addAttr(Attribute::InReg);
Expand Down
3 changes: 3 additions & 0 deletions llvm/lib/Target/AMDGPU/GCNSubtarget.h
Original file line number Diff line number Diff line change
Expand Up @@ -1357,6 +1357,9 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
// \returns true if preloading kernel arguments is supported.
bool hasKernargPreload() const { return KernargPreload; }

// \returns true if target enables preloading kernel arguments by default.
bool defaultEnabledKernargPreload() const { return hasGFX940Insts(); }

// \returns true if the target has split barriers feature
bool hasSplitBarriers() const { return getGeneration() >= GFX12; }

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -passes='amdgpu-attributor,function(amdgpu-lower-kernel-arguments)' -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -passes='amdgpu-attributor,function(amdgpu-lower-kernel-arguments)' -amdgpu-kernarg-preload-count=0 -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -passes='amdgpu-attributor,function(amdgpu-lower-kernel-arguments)' -amdgpu-kernarg-preload-count=16 -S < %s | FileCheck -check-prefix=PRELOAD %s

define amdgpu_kernel void @preload_block_count_x(ptr addrspace(1) %out) {
Expand Down
128 changes: 64 additions & 64 deletions llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --function-signature
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=0 -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=1 -S < %s | FileCheck -check-prefix=PRELOAD-1 %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=3 -S < %s | FileCheck -check-prefix=PRELOAD-3 %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=8 -S < %s | FileCheck -check-prefix=PRELOAD-8 %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -S < %s | FileCheck -check-prefix=PRELOAD-DEFAULT %s

define amdgpu_kernel void @test_preload_IR_lowering_kernel_2(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 {
; NO-PRELOAD-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
Expand Down Expand Up @@ -32,12 +32,12 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_2(ptr addrspace(1) %i
; PRELOAD-3-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-3-NEXT: ret void
;
; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_2_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-8-NEXT: ret void
; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_2_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
store i32 %load, ptr addrspace(1) %out
Expand Down Expand Up @@ -88,14 +88,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4(ptr addrspace(1) %i
; PRELOAD-3-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1_LOAD]], align 4
; PRELOAD-3-NEXT: ret void
;
; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4
; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-8-NEXT: ret void
; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4
; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
Expand Down Expand Up @@ -184,20 +184,20 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_8(ptr addrspace(1) %i
; PRELOAD-3-NEXT: store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
; PRELOAD-3-NEXT: ret void
;
; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_8
; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[IN2:%.*]], ptr addrspace(1) inreg [[IN3:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]], ptr addrspace(1) inreg [[OUT2:%.*]], ptr addrspace(1) inreg [[OUT3:%.*]]) #[[ATTR0]] {
; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-8-NEXT: [[OUT3_KERNARG_OFFSET:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT]], i64 56
; PRELOAD-8-NEXT: [[OUT3_LOAD:%.*]] = load ptr addrspace(1), ptr addrspace(4) [[OUT3_KERNARG_OFFSET]], align 8, !invariant.load [[META0:![0-9]+]]
; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-8-NEXT: [[LOAD2:%.*]] = load i32, ptr addrspace(1) [[IN2]], align 4
; PRELOAD-8-NEXT: [[LOAD3:%.*]] = load i32, ptr addrspace(1) [[IN3]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD2]], ptr addrspace(1) [[OUT2]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
; PRELOAD-8-NEXT: ret void
; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_8
; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[IN2:%.*]], ptr addrspace(1) inreg [[IN3:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]], ptr addrspace(1) inreg [[OUT2:%.*]], ptr addrspace(1) inreg [[OUT3:%.*]]) #[[ATTR0]] {
; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-DEFAULT-NEXT: [[OUT3_KERNARG_OFFSET:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT]], i64 56
; PRELOAD-DEFAULT-NEXT: [[OUT3_LOAD:%.*]] = load ptr addrspace(1), ptr addrspace(4) [[OUT3_KERNARG_OFFSET]], align 8, !invariant.load [[META0:![0-9]+]]
; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-DEFAULT-NEXT: [[LOAD2:%.*]] = load i32, ptr addrspace(1) [[IN2]], align 4
; PRELOAD-DEFAULT-NEXT: [[LOAD3:%.*]] = load i32, ptr addrspace(1) [[IN3]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD2]], ptr addrspace(1) [[OUT2]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
Expand Down Expand Up @@ -254,14 +254,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_inreg_offset(ptr ad
; PRELOAD-3-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-3-NEXT: ret void
;
; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset
; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-8-NEXT: ret void
; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset
; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
Expand Down Expand Up @@ -312,14 +312,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_inreg_offset_two_se
; PRELOAD-3-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-3-NEXT: ret void
;
; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset_two_sequence
; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_TWO_SEQUENCE_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-8-NEXT: ret void
; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset_two_sequence
; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_TWO_SEQUENCE_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
Expand Down Expand Up @@ -385,16 +385,16 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_misaligned(i16 %arg
; PRELOAD-3-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1_LOAD]], align 4
; PRELOAD-3-NEXT: ret void
;
; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_misaligned
; PRELOAD-8-SAME: (i16 inreg [[ARG0:%.*]], ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_MISALIGNED_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(40) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-8-NEXT: [[EXT:%.*]] = zext i16 [[ARG0]] to i32
; PRELOAD-8-NEXT: [[ADD:%.*]] = add i32 [[LOAD]], [[EXT]]
; PRELOAD-8-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-8-NEXT: ret void
; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_misaligned
; PRELOAD-DEFAULT-SAME: (i16 inreg [[ARG0:%.*]], ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_MISALIGNED_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(40) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
; PRELOAD-DEFAULT-NEXT: [[EXT:%.*]] = zext i16 [[ARG0]] to i32
; PRELOAD-DEFAULT-NEXT: [[ADD:%.*]] = add i32 [[LOAD]], [[EXT]]
; PRELOAD-DEFAULT-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
Expand Down Expand Up @@ -450,14 +450,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_i16_i16(i16 %arg0,
; PRELOAD-3-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-3-NEXT: ret void
;
; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_i16_i16
; PRELOAD-8-SAME: (i16 inreg [[ARG0:%.*]], i16 inreg [[ARG1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0]] {
; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_I16_I16_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-8-NEXT: [[EXT:%.*]] = zext i16 [[ARG0]] to i32
; PRELOAD-8-NEXT: [[EXT1:%.*]] = zext i16 [[ARG1]] to i32
; PRELOAD-8-NEXT: [[ADD:%.*]] = add i32 [[EXT]], [[EXT1]]
; PRELOAD-8-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-8-NEXT: ret void
; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_i16_i16
; PRELOAD-DEFAULT-SAME: (i16 inreg [[ARG0:%.*]], i16 inreg [[ARG1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0]] {
; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_I16_I16_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
; PRELOAD-DEFAULT-NEXT: [[EXT:%.*]] = zext i16 [[ARG0]] to i32
; PRELOAD-DEFAULT-NEXT: [[EXT1:%.*]] = zext i16 [[ARG1]] to i32
; PRELOAD-DEFAULT-NEXT: [[ADD:%.*]] = add i32 [[EXT]], [[EXT1]]
; PRELOAD-DEFAULT-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-DEFAULT-NEXT: ret void
;
%ext = zext i16 %arg0 to i32
%ext1 = zext i16 %arg1 to i32
Expand Down
Loading
Loading