[AMDGPU] Enable kernarg preloading by default on gfx940 #110691

kerbowa · 2024-10-01T15:40:46Z

The easiest way to enable this by default without requiring a bunch of test updates is to do it in the attributor but ensure that the old CL option works like before and can overwrite the default behavior.

Note that the old restriction to cap preloaded arguments at the max number of user SGPRs for the GPU has been removed since it didn't make much sense. It is possible to have sub-dword arguments that add up to more than the number of user SGPRs, and later passes that are concerned with actually allocating preload SGPRs will always check the actual number of registers that are available.

In this patch we add inreg by default to the max number of arguments that may take up all available user SGPRs. The actual number of arguments that are preloaded will always be the same or less.

llvmbot · 2024-10-01T15:41:19Z

@llvm/pr-subscribers-backend-amdgpu

Author: Austin Kerbow (kerbowa)

Changes

The easiest way to enable this by default without requiring a bunch of test updates is to do it in the attributor but ensure that the old CL option works like before and can overwrite the default behavior.

Note that the old restriction to cap preloaded arguments at the max number of user SGPRs for the GPU has been removed since it didn't make much sense. It is possible to have sub-dword arguments that add up to more than the number of user SGPRs, and later passes that are concerned with actually allocating preload SGPRs will always check the actual number of registers that are available.

In this patch we add inreg by default to the max number of arguments that may take up all available user SGPRs. The actual number of arguments that are preloaded will always be the same or less.

Full diff: https://github.com/llvm/llvm-project/pull/110691.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (+41-4)
(modified) llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll (+64-64)
(modified) llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll (+1-1)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index 687a7339da379d..611a5ff5fa4574 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -1014,12 +1014,49 @@ struct AAAMDGPUNoAGPR
 
 const char AAAMDGPUNoAGPR::ID = 0;
 
+static unsigned getMaxNumPreloadArgs(const Function &F, const DataLayout &DL,
+                                     const TargetMachine &TM) {
+  const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  unsigned Offset = 0;
+  unsigned ArgsToPreload = 0;
+  for (const auto &Arg : F.args()) {
+    if (Arg.hasByRefAttr())
+      break;
+
+    Type *Ty = Arg.getType();
+    Align ArgAlign = DL.getABITypeAlign(Ty);
+    auto Size = DL.getTypeAllocSize(Ty);
+    Offset = alignTo(Offset, ArgAlign);
+    if (((Offset + Size) / 4) > ST.getMaxNumUserSGPRs())
+      break;
+
+    Offset += Size;
+    ArgsToPreload++;
+  }
+
+  return ArgsToPreload;
+}
+
 static void addPreloadKernArgHint(Function &F, TargetMachine &TM) {
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
-  for (unsigned I = 0;
-       I < F.arg_size() &&
-       I < std::min(KernargPreloadCount.getValue(), ST.getMaxNumUserSGPRs());
-       ++I) {
+  if (!ST.hasKernargPreload())
+    return;
+
+  // Enable kernarg preloading by default on GFX940+.
+  size_t PreloadCount;
+  if (KernargPreloadCount.getNumOccurrences() > 0) {
+    // Override default behavior is CL option is present.
+    PreloadCount = std::min<size_t>(KernargPreloadCount, F.arg_size());
+  } else {
+    // Defaults with no CL option.
+    if (ST.hasGFX940Insts())
+      PreloadCount =
+          getMaxNumPreloadArgs(F, F.getParent()->getDataLayout(), TM);
+    else
+      PreloadCount = 0;
+  }
+
+  for (unsigned I = 0; I < PreloadCount; ++I) {
     Argument &Arg = *F.getArg(I);
     // Check for incompatible attributes.
     if (Arg.hasByRefAttr() || Arg.hasNestAttr())
diff --git a/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll b/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
index ab0fb7584d50ce..3a5e72b0408bf7 100644
--- a/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
+++ b/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --function-signature
-; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=0 -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
 ; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=1 -S < %s | FileCheck -check-prefix=PRELOAD-1 %s
 ; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=3 -S < %s | FileCheck -check-prefix=PRELOAD-3 %s
-; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=8 -S < %s | FileCheck -check-prefix=PRELOAD-8 %s
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -S < %s | FileCheck -check-prefix=PRELOAD-DEFAULT %s
 
 define amdgpu_kernel void @test_preload_IR_lowering_kernel_2(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 {
 ; NO-PRELOAD-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
@@ -32,12 +32,12 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_2(ptr addrspace(1) %i
 ; PRELOAD-3-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_2_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_2_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   store i32 %load, ptr addrspace(1) %out
@@ -88,14 +88,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4(ptr addrspace(1) %i
 ; PRELOAD-3-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1_LOAD]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -184,20 +184,20 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_8(ptr addrspace(1) %i
 ; PRELOAD-3-NEXT:    store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_8
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[IN2:%.*]], ptr addrspace(1) inreg [[IN3:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]], ptr addrspace(1) inreg [[OUT2:%.*]], ptr addrspace(1) inreg [[OUT3:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[OUT3_KERNARG_OFFSET:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT]], i64 56
-; PRELOAD-8-NEXT:    [[OUT3_LOAD:%.*]] = load ptr addrspace(1), ptr addrspace(4) [[OUT3_KERNARG_OFFSET]], align 8, !invariant.load [[META0:![0-9]+]]
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    [[LOAD2:%.*]] = load i32, ptr addrspace(1) [[IN2]], align 4
-; PRELOAD-8-NEXT:    [[LOAD3:%.*]] = load i32, ptr addrspace(1) [[IN3]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD2]], ptr addrspace(1) [[OUT2]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_8
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[IN2:%.*]], ptr addrspace(1) inreg [[IN3:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]], ptr addrspace(1) inreg [[OUT2:%.*]], ptr addrspace(1) inreg [[OUT3:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[OUT3_KERNARG_OFFSET:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT]], i64 56
+; PRELOAD-DEFAULT-NEXT:    [[OUT3_LOAD:%.*]] = load ptr addrspace(1), ptr addrspace(4) [[OUT3_KERNARG_OFFSET]], align 8, !invariant.load [[META0:![0-9]+]]
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD2:%.*]] = load i32, ptr addrspace(1) [[IN2]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD3:%.*]] = load i32, ptr addrspace(1) [[IN3]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD2]], ptr addrspace(1) [[OUT2]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -254,14 +254,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_inreg_offset(ptr ad
 ; PRELOAD-3-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -312,14 +312,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_inreg_offset_two_se
 ; PRELOAD-3-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset_two_sequence
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_TWO_SEQUENCE_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset_two_sequence
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_TWO_SEQUENCE_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -385,16 +385,16 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_misaligned(i16 %arg
 ; PRELOAD-3-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1_LOAD]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_misaligned
-; PRELOAD-8-SAME: (i16 inreg [[ARG0:%.*]], ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_MISALIGNED_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(40) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    [[EXT:%.*]] = zext i16 [[ARG0]] to i32
-; PRELOAD-8-NEXT:    [[ADD:%.*]] = add i32 [[LOAD]], [[EXT]]
-; PRELOAD-8-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_misaligned
+; PRELOAD-DEFAULT-SAME: (i16 inreg [[ARG0:%.*]], ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_MISALIGNED_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(40) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[EXT:%.*]] = zext i16 [[ARG0]] to i32
+; PRELOAD-DEFAULT-NEXT:    [[ADD:%.*]] = add i32 [[LOAD]], [[EXT]]
+; PRELOAD-DEFAULT-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -450,14 +450,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_i16_i16(i16 %arg0,
 ; PRELOAD-3-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_i16_i16
-; PRELOAD-8-SAME: (i16 inreg [[ARG0:%.*]], i16 inreg [[ARG1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_I16_I16_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[EXT:%.*]] = zext i16 [[ARG0]] to i32
-; PRELOAD-8-NEXT:    [[EXT1:%.*]] = zext i16 [[ARG1]] to i32
-; PRELOAD-8-NEXT:    [[ADD:%.*]] = add i32 [[EXT]], [[EXT1]]
-; PRELOAD-8-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_i16_i16
+; PRELOAD-DEFAULT-SAME: (i16 inreg [[ARG0:%.*]], i16 inreg [[ARG1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_I16_I16_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[EXT:%.*]] = zext i16 [[ARG0]] to i32
+; PRELOAD-DEFAULT-NEXT:    [[EXT1:%.*]] = zext i16 [[ARG1]] to i32
+; PRELOAD-DEFAULT-NEXT:    [[ADD:%.*]] = add i32 [[EXT]], [[EXT1]]
+; PRELOAD-DEFAULT-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %ext = zext i16 %arg0 to i32
   %ext1 = zext i16 %arg1 to i32
diff --git a/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll b/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
index 20edbd6c0d0fa6..bb08aa18cd8dea 100644
--- a/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
+++ b/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
@@ -95,7 +95,7 @@ define amdgpu_kernel void @test_preload_hint_kernel_18(i32 %0, i64 %1, <2 x floa
 ; PRELOAD-16-NEXT:    ret void
 ;
 ; PRELOAD-20-LABEL: define {{[^@]+}}@test_preload_hint_kernel_18
-; PRELOAD-20-SAME: (i32 inreg [[TMP0:%.*]], i64 inreg [[TMP1:%.*]], <2 x float> inreg [[TMP2:%.*]], ptr inreg [[TMP3:%.*]], i32 inreg [[TMP4:%.*]], i32 inreg [[TMP5:%.*]], i32 inreg [[TMP6:%.*]], i32 inreg [[TMP7:%.*]], i32 inreg [[TMP8:%.*]], i32 inreg [[TMP9:%.*]], i32 inreg [[TMP10:%.*]], i32 inreg [[TMP11:%.*]], i32 inreg [[TMP12:%.*]], i32 inreg [[TMP13:%.*]], i32 inreg [[TMP14:%.*]], i32 inreg [[TMP15:%.*]], i32 [[TMP16:%.*]], i32 [[TMP17:%.*]]) #[[ATTR0]] {
+; PRELOAD-20-SAME: (i32 inreg [[TMP0:%.*]], i64 inreg [[TMP1:%.*]], <2 x float> inreg [[TMP2:%.*]], ptr inreg [[TMP3:%.*]], i32 inreg [[TMP4:%.*]], i32 inreg [[TMP5:%.*]], i32 inreg [[TMP6:%.*]], i32 inreg [[TMP7:%.*]], i32 inreg [[TMP8:%.*]], i32 inreg [[TMP9:%.*]], i32 inreg [[TMP10:%.*]], i32 inreg [[TMP11:%.*]], i32 inreg [[TMP12:%.*]], i32 inreg [[TMP13:%.*]], i32 inreg [[TMP14:%.*]], i32 inreg [[TMP15:%.*]], i32 inreg [[TMP16:%.*]], i32 inreg [[TMP17:%.*]]) #[[ATTR0]] {
 ; PRELOAD-20-NEXT:    ret void
 ;
   ret void

arsenm · 2024-10-01T16:04:10Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+
+    Type *Ty = Arg.getType();
+    Align ArgAlign = DL.getABITypeAlign(Ty);
+    auto Size = DL.getTypeAllocSize(Ty);


The allocation size doesn't directly map to number of registers. Really this should look at what ComputeValueVTs does to get the number of registers.

I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)

I think the alloc size is correct with respect to the number of user SGPRs that are initially allocated for preloading, since it maps directly on to how the data looks in the kernarg segment, and the number of registers used for preloading is derived from that.

I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)

I short of agree, but we have to decide what to do when frontends like Triton just add inreg to every argument. Should we remove it in cases where we cannot preload the argument? Print a warning if we cannot preload?

I'm leaning towards the first option where we remove inreg from arguments that wont actually be preloaded somewhere like AMDGPULowerKernelArguments after all the attributes are finalized, ect.

It's only correct for trivially legal types. It will not be for anything exotic or aggregates. You either need to use ComputeValueVTs or only accept trivial types that map to N registers

What we are calculating is the size in memory, i.e. the start offset to the ending explicit offset. Isn't this method correct for anything that is not byref?

Do you think we should also consider the attributes that are inferred by the AMDGPUAttributor itself when marking things for preloading here? I.e. add inreg at the very end?

Ideally kernel arguments would only use byref (except for these inreg arguments). Non-byref arguments are really byref, and will be lowered to byref.

If you're making the values inreg, they should be treated more like ordinary calling convention register arguments. It's probably simplest to just filter out types that aren't trivially register types. That's the case for all implicit kernel arguments anyway.

arsenm · 2024-10-01T19:33:32Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+    // Defaults with no CL option.
+    if (ST.hasGFX940Insts())


Should put this into some kind of other predicate, rather than directly checking the subtarget. Make getMaxNumPreloadArgs preload count return 0 if the target doesn't support it?

arsenm · 2024-10-01T19:34:06Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+
+  // Enable kernarg preloading by default on GFX940+.
+  size_t PreloadCount;
+  if (KernargPreloadCount.getNumOccurrences() > 0) {


Should move this to be a pass parameter?

What's the benefit of having the logic outside of the pass and using a pass parameter?

I want to keep it so the flag works as before and users can override the defaults. I would still need the cl opt somewhere.

cl::opts are evil and should mostly be purged. They're only for debugging and shouldn't be given to users as a configuration mechanism. I don't really see why users would want direct control over something so specific either.

If we removed the cl option we would have to give users another way to enable it on gfx90a, but we don't want to enable it by default there.

If it really needs to be a user exposed flag, there should be an attribute for it with a corresponding clang flag to set it

I really see where you are coming from, and I agree with you in most circumstances, but this CL option has been used for this purpose since downstream prototypes over a year ago. Now, the flag will only be used for a niche case, where
some libraries are already using the flag on gfx90a. I'd like to keep it because I don't actually want it to be a user facing option. This shouldn't be exposed to the user and is really only a "user beware" type hidden flag for downstream libraries that are already using it.

arsenm · 2024-10-01T19:35:12Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+      PreloadCount = 0;
+  }
+
+  for (unsigned I = 0; I < PreloadCount; ++I) {


I's type is inconsistent with PreloadCount, make them both size_t or unsigned

arsenm · 2024-10-03T19:45:19Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

GCNTargetMachine

arsenm

Should add some aggregate tests, and just skip them if you just want to handle the simple register cases

arsenm · 2024-10-10T19:12:20Z

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp

+  unsigned Offset = 0;
+  unsigned ArgsToPreload = 0;
+  for (const auto &Arg : F.args()) {
+    if (Arg.hasByRefAttr())


As a follow up we really should be handling the byref case for the future

arsenm · 2025-01-06T08:24:16Z

Should include rationale for why it was closed

kerbowa requested a review from arsenm October 1, 2024 15:40

llvmbot added the backend:AMDGPU label Oct 1, 2024

kerbowa force-pushed the enable-preload-by-default branch from cb3a257 to 564d80b Compare October 1, 2024 15:42

arsenm reviewed Oct 1, 2024

View reviewed changes

arsenm reviewed Oct 3, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp Outdated

Copy link

Contributor

arsenm Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCNTargetMachine

kerbowa force-pushed the enable-preload-by-default branch 2 times, most recently from fe82ef3 to 9818586 Compare October 7, 2024 20:14

arsenm requested changes Oct 10, 2024

View reviewed changes

kerbowa added 4 commits November 18, 2024 21:26

[AMDGPU] Enable kernarg preloading by default on gfx940

5e2f073

Add predicate and match types.

d5099d1

Fix types.

db7fad4

Skip aggregate types.

45bd2a4

kerbowa force-pushed the enable-preload-by-default branch from 9818586 to 45bd2a4 Compare November 19, 2024 07:32

kerbowa closed this Dec 31, 2024

[AMDGPU] Enable kernarg preloading by default on gfx940 #110691

[AMDGPU] Enable kernarg preloading by default on gfx940 #110691

Uh oh!

Conversation

kerbowa commented Oct 1, 2024

Uh oh!

llvmbot commented Oct 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm commented Jan 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants