-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Enable kernarg preloading by default on gfx940 #110691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-backend-amdgpu Author: Austin Kerbow (kerbowa) ChangesThe easiest way to enable this by default without requiring a bunch of test updates is to do it in the attributor but ensure that the old CL option works like before and can overwrite the default behavior. Note that the old restriction to cap preloaded arguments at the max number of user SGPRs for the GPU has been removed since it didn't make much sense. It is possible to have sub-dword arguments that add up to more than the number of user SGPRs, and later passes that are concerned with actually allocating preload SGPRs will always check the actual number of registers that are available. In this patch we add inreg by default to the max number of arguments that may take up all available user SGPRs. The actual number of arguments that are preloaded will always be the same or less. Full diff: https://github.com/llvm/llvm-project/pull/110691.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index 687a7339da379d..611a5ff5fa4574 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -1014,12 +1014,49 @@ struct AAAMDGPUNoAGPR
const char AAAMDGPUNoAGPR::ID = 0;
+static unsigned getMaxNumPreloadArgs(const Function &F, const DataLayout &DL,
+ const TargetMachine &TM) {
+ const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+ unsigned Offset = 0;
+ unsigned ArgsToPreload = 0;
+ for (const auto &Arg : F.args()) {
+ if (Arg.hasByRefAttr())
+ break;
+
+ Type *Ty = Arg.getType();
+ Align ArgAlign = DL.getABITypeAlign(Ty);
+ auto Size = DL.getTypeAllocSize(Ty);
+ Offset = alignTo(Offset, ArgAlign);
+ if (((Offset + Size) / 4) > ST.getMaxNumUserSGPRs())
+ break;
+
+ Offset += Size;
+ ArgsToPreload++;
+ }
+
+ return ArgsToPreload;
+}
+
static void addPreloadKernArgHint(Function &F, TargetMachine &TM) {
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
- for (unsigned I = 0;
- I < F.arg_size() &&
- I < std::min(KernargPreloadCount.getValue(), ST.getMaxNumUserSGPRs());
- ++I) {
+ if (!ST.hasKernargPreload())
+ return;
+
+ // Enable kernarg preloading by default on GFX940+.
+ size_t PreloadCount;
+ if (KernargPreloadCount.getNumOccurrences() > 0) {
+ // Override default behavior is CL option is present.
+ PreloadCount = std::min<size_t>(KernargPreloadCount, F.arg_size());
+ } else {
+ // Defaults with no CL option.
+ if (ST.hasGFX940Insts())
+ PreloadCount =
+ getMaxNumPreloadArgs(F, F.getParent()->getDataLayout(), TM);
+ else
+ PreloadCount = 0;
+ }
+
+ for (unsigned I = 0; I < PreloadCount; ++I) {
Argument &Arg = *F.getArg(I);
// Check for incompatible attributes.
if (Arg.hasByRefAttr() || Arg.hasNestAttr())
diff --git a/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll b/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
index ab0fb7584d50ce..3a5e72b0408bf7 100644
--- a/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
+++ b/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
@@ -1,8 +1,8 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --function-signature
-; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=0 -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=1 -S < %s | FileCheck -check-prefix=PRELOAD-1 %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=3 -S < %s | FileCheck -check-prefix=PRELOAD-3 %s
-; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=8 -S < %s | FileCheck -check-prefix=PRELOAD-8 %s
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -S < %s | FileCheck -check-prefix=PRELOAD-DEFAULT %s
define amdgpu_kernel void @test_preload_IR_lowering_kernel_2(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 {
; NO-PRELOAD-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
@@ -32,12 +32,12 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_2(ptr addrspace(1) %i
; PRELOAD-3-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-3-NEXT: ret void
;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
-; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_2_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT: ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
+; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_2_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
store i32 %load, ptr addrspace(1) %out
@@ -88,14 +88,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4(ptr addrspace(1) %i
; PRELOAD-3-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1_LOAD]], align 4
; PRELOAD-3-NEXT: ret void
;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT: ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
@@ -184,20 +184,20 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_8(ptr addrspace(1) %i
; PRELOAD-3-NEXT: store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
; PRELOAD-3-NEXT: ret void
;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_8
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[IN2:%.*]], ptr addrspace(1) inreg [[IN3:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]], ptr addrspace(1) inreg [[OUT2:%.*]], ptr addrspace(1) inreg [[OUT3:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT: [[OUT3_KERNARG_OFFSET:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT]], i64 56
-; PRELOAD-8-NEXT: [[OUT3_LOAD:%.*]] = load ptr addrspace(1), ptr addrspace(4) [[OUT3_KERNARG_OFFSET]], align 8, !invariant.load [[META0:![0-9]+]]
-; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT: [[LOAD2:%.*]] = load i32, ptr addrspace(1) [[IN2]], align 4
-; PRELOAD-8-NEXT: [[LOAD3:%.*]] = load i32, ptr addrspace(1) [[IN3]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD2]], ptr addrspace(1) [[OUT2]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
-; PRELOAD-8-NEXT: ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_8
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[IN2:%.*]], ptr addrspace(1) inreg [[IN3:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]], ptr addrspace(1) inreg [[OUT2:%.*]], ptr addrspace(1) inreg [[OUT3:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT: [[OUT3_KERNARG_OFFSET:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT]], i64 56
+; PRELOAD-DEFAULT-NEXT: [[OUT3_LOAD:%.*]] = load ptr addrspace(1), ptr addrspace(4) [[OUT3_KERNARG_OFFSET]], align 8, !invariant.load [[META0:![0-9]+]]
+; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT: [[LOAD2:%.*]] = load i32, ptr addrspace(1) [[IN2]], align 4
+; PRELOAD-DEFAULT-NEXT: [[LOAD3:%.*]] = load i32, ptr addrspace(1) [[IN3]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD2]], ptr addrspace(1) [[OUT2]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
+; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
@@ -254,14 +254,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_inreg_offset(ptr ad
; PRELOAD-3-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-3-NEXT: ret void
;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT: ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
@@ -312,14 +312,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_inreg_offset_two_se
; PRELOAD-3-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
; PRELOAD-3-NEXT: ret void
;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset_two_sequence
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_TWO_SEQUENCE_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT: ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset_two_sequence
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_TWO_SEQUENCE_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
@@ -385,16 +385,16 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_misaligned(i16 %arg
; PRELOAD-3-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1_LOAD]], align 4
; PRELOAD-3-NEXT: ret void
;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_misaligned
-; PRELOAD-8-SAME: (i16 inreg [[ARG0:%.*]], ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_MISALIGNED_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(40) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT: [[EXT:%.*]] = zext i16 [[ARG0]] to i32
-; PRELOAD-8-NEXT: [[ADD:%.*]] = add i32 [[LOAD]], [[EXT]]
-; PRELOAD-8-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT: ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_misaligned
+; PRELOAD-DEFAULT-SAME: (i16 inreg [[ARG0:%.*]], ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_MISALIGNED_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(40) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT: [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT: [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT: [[EXT:%.*]] = zext i16 [[ARG0]] to i32
+; PRELOAD-DEFAULT-NEXT: [[ADD:%.*]] = add i32 [[LOAD]], [[EXT]]
+; PRELOAD-DEFAULT-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT: store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT: ret void
;
%load = load i32, ptr addrspace(1) %in
%load1 = load i32, ptr addrspace(1) %in1
@@ -450,14 +450,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_i16_i16(i16 %arg0,
; PRELOAD-3-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
; PRELOAD-3-NEXT: ret void
;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_i16_i16
-; PRELOAD-8-SAME: (i16 inreg [[ARG0:%.*]], i16 inreg [[ARG1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_I16_I16_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT: [[EXT:%.*]] = zext i16 [[ARG0]] to i32
-; PRELOAD-8-NEXT: [[EXT1:%.*]] = zext i16 [[ARG1]] to i32
-; PRELOAD-8-NEXT: [[ADD:%.*]] = add i32 [[EXT]], [[EXT1]]
-; PRELOAD-8-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT: ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_i16_i16
+; PRELOAD-DEFAULT-SAME: (i16 inreg [[ARG0:%.*]], i16 inreg [[ARG1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT: [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_I16_I16_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT: [[EXT:%.*]] = zext i16 [[ARG0]] to i32
+; PRELOAD-DEFAULT-NEXT: [[EXT1:%.*]] = zext i16 [[ARG1]] to i32
+; PRELOAD-DEFAULT-NEXT: [[ADD:%.*]] = add i32 [[EXT]], [[EXT1]]
+; PRELOAD-DEFAULT-NEXT: store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT: ret void
;
%ext = zext i16 %arg0 to i32
%ext1 = zext i16 %arg1 to i32
diff --git a/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll b/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
index 20edbd6c0d0fa6..bb08aa18cd8dea 100644
--- a/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
+++ b/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
@@ -95,7 +95,7 @@ define amdgpu_kernel void @test_preload_hint_kernel_18(i32 %0, i64 %1, <2 x floa
; PRELOAD-16-NEXT: ret void
;
; PRELOAD-20-LABEL: define {{[^@]+}}@test_preload_hint_kernel_18
-; PRELOAD-20-SAME: (i32 inreg [[TMP0:%.*]], i64 inreg [[TMP1:%.*]], <2 x float> inreg [[TMP2:%.*]], ptr inreg [[TMP3:%.*]], i32 inreg [[TMP4:%.*]], i32 inreg [[TMP5:%.*]], i32 inreg [[TMP6:%.*]], i32 inreg [[TMP7:%.*]], i32 inreg [[TMP8:%.*]], i32 inreg [[TMP9:%.*]], i32 inreg [[TMP10:%.*]], i32 inreg [[TMP11:%.*]], i32 inreg [[TMP12:%.*]], i32 inreg [[TMP13:%.*]], i32 inreg [[TMP14:%.*]], i32 inreg [[TMP15:%.*]], i32 [[TMP16:%.*]], i32 [[TMP17:%.*]]) #[[ATTR0]] {
+; PRELOAD-20-SAME: (i32 inreg [[TMP0:%.*]], i64 inreg [[TMP1:%.*]], <2 x float> inreg [[TMP2:%.*]], ptr inreg [[TMP3:%.*]], i32 inreg [[TMP4:%.*]], i32 inreg [[TMP5:%.*]], i32 inreg [[TMP6:%.*]], i32 inreg [[TMP7:%.*]], i32 inreg [[TMP8:%.*]], i32 inreg [[TMP9:%.*]], i32 inreg [[TMP10:%.*]], i32 inreg [[TMP11:%.*]], i32 inreg [[TMP12:%.*]], i32 inreg [[TMP13:%.*]], i32 inreg [[TMP14:%.*]], i32 inreg [[TMP15:%.*]], i32 inreg [[TMP16:%.*]], i32 inreg [[TMP17:%.*]]) #[[ATTR0]] {
; PRELOAD-20-NEXT: ret void
;
ret void
|
cb3a257 to
564d80b
Compare
|
|
||
| Type *Ty = Arg.getType(); | ||
| Align ArgAlign = DL.getABITypeAlign(Ty); | ||
| auto Size = DL.getTypeAllocSize(Ty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocation size doesn't directly map to number of registers. Really this should look at what ComputeValueVTs does to get the number of registers.
I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the alloc size is correct with respect to the number of user SGPRs that are initially allocated for preloading, since it maps directly on to how the data looks in the kernarg segment, and the number of registers used for preloading is derived from that.
I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)
I short of agree, but we have to decide what to do when frontends like Triton just add inreg to every argument. Should we remove it in cases where we cannot preload the argument? Print a warning if we cannot preload?
I'm leaning towards the first option where we remove inreg from arguments that wont actually be preloaded somewhere like AMDGPULowerKernelArguments after all the attributes are finalized, ect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only correct for trivially legal types. It will not be for anything exotic or aggregates. You either need to use ComputeValueVTs or only accept trivial types that map to N registers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we are calculating is the size in memory, i.e. the start offset to the ending explicit offset. Isn't this method correct for anything that is not byref?
Do you think we should also consider the attributes that are inferred by the AMDGPUAttributor itself when marking things for preloading here? I.e. add inreg at the very end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally kernel arguments would only use byref (except for these inreg arguments). Non-byref arguments are really byref, and will be lowered to byref.
If you're making the values inreg, they should be treated more like ordinary calling convention register arguments. It's probably simplest to just filter out types that aren't trivially register types. That's the case for all implicit kernel arguments anyway.
| // Defaults with no CL option. | ||
| if (ST.hasGFX940Insts()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should put this into some kind of other predicate, rather than directly checking the subtarget. Make getMaxNumPreloadArgs preload count return 0 if the target doesn't support it?
|
|
||
| // Enable kernarg preloading by default on GFX940+. | ||
| size_t PreloadCount; | ||
| if (KernargPreloadCount.getNumOccurrences() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should move this to be a pass parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of having the logic outside of the pass and using a pass parameter?
I want to keep it so the flag works as before and users can override the defaults. I would still need the cl opt somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cl::opts are evil and should mostly be purged. They're only for debugging and shouldn't be given to users as a configuration mechanism. I don't really see why users would want direct control over something so specific either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we removed the cl option we would have to give users another way to enable it on gfx90a, but we don't want to enable it by default there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it really needs to be a user exposed flag, there should be an attribute for it with a corresponding clang flag to set it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really see where you are coming from, and I agree with you in most circumstances, but this CL option has been used for this purpose since downstream prototypes over a year ago. Now, the flag will only be used for a niche case, where
some libraries are already using the flag on gfx90a. I'd like to keep it because I don't actually want it to be a user facing option. This shouldn't be exposed to the user and is really only a "user beware" type hidden flag for downstream libraries that are already using it.
| PreloadCount = 0; | ||
| } | ||
|
|
||
| for (unsigned I = 0; I < PreloadCount; ++I) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I's type is inconsistent with PreloadCount, make them both size_t or unsigned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCNTargetMachine
fe82ef3 to
9818586
Compare
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add some aggregate tests, and just skip them if you just want to handle the simple register cases
| unsigned Offset = 0; | ||
| unsigned ArgsToPreload = 0; | ||
| for (const auto &Arg : F.args()) { | ||
| if (Arg.hasByRefAttr()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up we really should be handling the byref case for the future
9818586 to
45bd2a4
Compare
|
Should include rationale for why it was closed |
The easiest way to enable this by default without requiring a bunch of test updates is to do it in the attributor but ensure that the old CL option works like before and can overwrite the default behavior.
Note that the old restriction to cap preloaded arguments at the max number of user SGPRs for the GPU has been removed since it didn't make much sense. It is possible to have sub-dword arguments that add up to more than the number of user SGPRs, and later passes that are concerned with actually allocating preload SGPRs will always check the actual number of registers that are available.
In this patch we add inreg by default to the max number of arguments that may take up all available user SGPRs. The actual number of arguments that are preloaded will always be the same or less.