Skip to content

Conversation

@kerbowa
Copy link
Member

@kerbowa kerbowa commented Oct 1, 2024

The easiest way to enable this by default without requiring a bunch of test updates is to do it in the attributor but ensure that the old CL option works like before and can overwrite the default behavior.

Note that the old restriction to cap preloaded arguments at the max number of user SGPRs for the GPU has been removed since it didn't make much sense. It is possible to have sub-dword arguments that add up to more than the number of user SGPRs, and later passes that are concerned with actually allocating preload SGPRs will always check the actual number of registers that are available.

In this patch we add inreg by default to the max number of arguments that may take up all available user SGPRs. The actual number of arguments that are preloaded will always be the same or less.

@llvmbot
Copy link
Member

llvmbot commented Oct 1, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Austin Kerbow (kerbowa)

Changes

The easiest way to enable this by default without requiring a bunch of test updates is to do it in the attributor but ensure that the old CL option works like before and can overwrite the default behavior.

Note that the old restriction to cap preloaded arguments at the max number of user SGPRs for the GPU has been removed since it didn't make much sense. It is possible to have sub-dword arguments that add up to more than the number of user SGPRs, and later passes that are concerned with actually allocating preload SGPRs will always check the actual number of registers that are available.

In this patch we add inreg by default to the max number of arguments that may take up all available user SGPRs. The actual number of arguments that are preloaded will always be the same or less.


Full diff: https://github.com/llvm/llvm-project/pull/110691.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (+41-4)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll (+64-64)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll (+1-1)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index 687a7339da379d..611a5ff5fa4574 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -1014,12 +1014,49 @@ struct AAAMDGPUNoAGPR
 
 const char AAAMDGPUNoAGPR::ID = 0;
 
+static unsigned getMaxNumPreloadArgs(const Function &F, const DataLayout &DL,
+                                     const TargetMachine &TM) {
+  const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  unsigned Offset = 0;
+  unsigned ArgsToPreload = 0;
+  for (const auto &Arg : F.args()) {
+    if (Arg.hasByRefAttr())
+      break;
+
+    Type *Ty = Arg.getType();
+    Align ArgAlign = DL.getABITypeAlign(Ty);
+    auto Size = DL.getTypeAllocSize(Ty);
+    Offset = alignTo(Offset, ArgAlign);
+    if (((Offset + Size) / 4) > ST.getMaxNumUserSGPRs())
+      break;
+
+    Offset += Size;
+    ArgsToPreload++;
+  }
+
+  return ArgsToPreload;
+}
+
 static void addPreloadKernArgHint(Function &F, TargetMachine &TM) {
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
-  for (unsigned I = 0;
-       I < F.arg_size() &&
-       I < std::min(KernargPreloadCount.getValue(), ST.getMaxNumUserSGPRs());
-       ++I) {
+  if (!ST.hasKernargPreload())
+    return;
+
+  // Enable kernarg preloading by default on GFX940+.
+  size_t PreloadCount;
+  if (KernargPreloadCount.getNumOccurrences() > 0) {
+    // Override default behavior is CL option is present.
+    PreloadCount = std::min<size_t>(KernargPreloadCount, F.arg_size());
+  } else {
+    // Defaults with no CL option.
+    if (ST.hasGFX940Insts())
+      PreloadCount =
+          getMaxNumPreloadArgs(F, F.getParent()->getDataLayout(), TM);
+    else
+      PreloadCount = 0;
+  }
+
+  for (unsigned I = 0; I < PreloadCount; ++I) {
     Argument &Arg = *F.getArg(I);
     // Check for incompatible attributes.
     if (Arg.hasByRefAttr() || Arg.hasNestAttr())
diff --git a/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll b/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
index ab0fb7584d50ce..3a5e72b0408bf7 100644
--- a/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
+++ b/llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --function-signature
-; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=0 -S < %s | FileCheck -check-prefix=NO-PRELOAD %s
 ; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=1 -S < %s | FileCheck -check-prefix=PRELOAD-1 %s
 ; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=3 -S < %s | FileCheck -check-prefix=PRELOAD-3 %s
-; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -amdgpu-kernarg-preload-count=8 -S < %s | FileCheck -check-prefix=PRELOAD-8 %s
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -amdgpu-attributor -amdgpu-lower-kernel-arguments -S < %s | FileCheck -check-prefix=PRELOAD-DEFAULT %s
 
 define amdgpu_kernel void @test_preload_IR_lowering_kernel_2(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 {
 ; NO-PRELOAD-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
@@ -32,12 +32,12 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_2(ptr addrspace(1) %i
 ; PRELOAD-3-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_2_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_2
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_2_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   store i32 %load, ptr addrspace(1) %out
@@ -88,14 +88,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4(ptr addrspace(1) %i
 ; PRELOAD-3-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1_LOAD]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -184,20 +184,20 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_8(ptr addrspace(1) %i
 ; PRELOAD-3-NEXT:    store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_8
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[IN2:%.*]], ptr addrspace(1) inreg [[IN3:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]], ptr addrspace(1) inreg [[OUT2:%.*]], ptr addrspace(1) inreg [[OUT3:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[OUT3_KERNARG_OFFSET:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT]], i64 56
-; PRELOAD-8-NEXT:    [[OUT3_LOAD:%.*]] = load ptr addrspace(1), ptr addrspace(4) [[OUT3_KERNARG_OFFSET]], align 8, !invariant.load [[META0:![0-9]+]]
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    [[LOAD2:%.*]] = load i32, ptr addrspace(1) [[IN2]], align 4
-; PRELOAD-8-NEXT:    [[LOAD3:%.*]] = load i32, ptr addrspace(1) [[IN3]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD2]], ptr addrspace(1) [[OUT2]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_8
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[IN2:%.*]], ptr addrspace(1) inreg [[IN3:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]], ptr addrspace(1) inreg [[OUT2:%.*]], ptr addrspace(1) inreg [[OUT3:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[OUT3_KERNARG_OFFSET:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[TEST_PRELOAD_IR_LOWERING_KERNEL_8_KERNARG_SEGMENT]], i64 56
+; PRELOAD-DEFAULT-NEXT:    [[OUT3_LOAD:%.*]] = load ptr addrspace(1), ptr addrspace(4) [[OUT3_KERNARG_OFFSET]], align 8, !invariant.load [[META0:![0-9]+]]
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD2:%.*]] = load i32, ptr addrspace(1) [[IN2]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD3:%.*]] = load i32, ptr addrspace(1) [[IN3]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD2]], ptr addrspace(1) [[OUT2]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD3]], ptr addrspace(1) [[OUT3_LOAD]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -254,14 +254,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_inreg_offset(ptr ad
 ; PRELOAD-3-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -312,14 +312,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_inreg_offset_two_se
 ; PRELOAD-3-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset_two_sequence
-; PRELOAD-8-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_TWO_SEQUENCE_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_inreg_offset_two_sequence
+; PRELOAD-DEFAULT-SAME: (ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_INREG_OFFSET_TWO_SEQUENCE_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(32) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -385,16 +385,16 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_misaligned(i16 %arg
 ; PRELOAD-3-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1_LOAD]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_misaligned
-; PRELOAD-8-SAME: (i16 inreg [[ARG0:%.*]], ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_MISALIGNED_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(40) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
-; PRELOAD-8-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
-; PRELOAD-8-NEXT:    [[EXT:%.*]] = zext i16 [[ARG0]] to i32
-; PRELOAD-8-NEXT:    [[ADD:%.*]] = add i32 [[LOAD]], [[EXT]]
-; PRELOAD-8-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_misaligned
+; PRELOAD-DEFAULT-SAME: (i16 inreg [[ARG0:%.*]], ptr addrspace(1) inreg [[IN:%.*]], ptr addrspace(1) inreg [[IN1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]], ptr addrspace(1) inreg [[OUT1:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_MISALIGNED_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(40) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[LOAD:%.*]] = load i32, ptr addrspace(1) [[IN]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[LOAD1:%.*]] = load i32, ptr addrspace(1) [[IN1]], align 4
+; PRELOAD-DEFAULT-NEXT:    [[EXT:%.*]] = zext i16 [[ARG0]] to i32
+; PRELOAD-DEFAULT-NEXT:    [[ADD:%.*]] = add i32 [[LOAD]], [[EXT]]
+; PRELOAD-DEFAULT-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    store i32 [[LOAD1]], ptr addrspace(1) [[OUT1]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %load = load i32, ptr addrspace(1) %in
   %load1 = load i32, ptr addrspace(1) %in1
@@ -450,14 +450,14 @@ define amdgpu_kernel void @test_preload_IR_lowering_kernel_4_i16_i16(i16 %arg0,
 ; PRELOAD-3-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
 ; PRELOAD-3-NEXT:    ret void
 ;
-; PRELOAD-8-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_i16_i16
-; PRELOAD-8-SAME: (i16 inreg [[ARG0:%.*]], i16 inreg [[ARG1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0]] {
-; PRELOAD-8-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_I16_I16_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
-; PRELOAD-8-NEXT:    [[EXT:%.*]] = zext i16 [[ARG0]] to i32
-; PRELOAD-8-NEXT:    [[EXT1:%.*]] = zext i16 [[ARG1]] to i32
-; PRELOAD-8-NEXT:    [[ADD:%.*]] = add i32 [[EXT]], [[EXT1]]
-; PRELOAD-8-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
-; PRELOAD-8-NEXT:    ret void
+; PRELOAD-DEFAULT-LABEL: define {{[^@]+}}@test_preload_IR_lowering_kernel_4_i16_i16
+; PRELOAD-DEFAULT-SAME: (i16 inreg [[ARG0:%.*]], i16 inreg [[ARG1:%.*]], ptr addrspace(1) inreg [[OUT:%.*]]) #[[ATTR0]] {
+; PRELOAD-DEFAULT-NEXT:    [[TEST_PRELOAD_IR_LOWERING_KERNEL_4_I16_I16_KERNARG_SEGMENT:%.*]] = call nonnull align 16 dereferenceable(16) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+; PRELOAD-DEFAULT-NEXT:    [[EXT:%.*]] = zext i16 [[ARG0]] to i32
+; PRELOAD-DEFAULT-NEXT:    [[EXT1:%.*]] = zext i16 [[ARG1]] to i32
+; PRELOAD-DEFAULT-NEXT:    [[ADD:%.*]] = add i32 [[EXT]], [[EXT1]]
+; PRELOAD-DEFAULT-NEXT:    store i32 [[ADD]], ptr addrspace(1) [[OUT]], align 4
+; PRELOAD-DEFAULT-NEXT:    ret void
 ;
   %ext = zext i16 %arg0 to i32
   %ext1 = zext i16 %arg1 to i32
diff --git a/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll b/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
index 20edbd6c0d0fa6..bb08aa18cd8dea 100644
--- a/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
+++ b/llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll
@@ -95,7 +95,7 @@ define amdgpu_kernel void @test_preload_hint_kernel_18(i32 %0, i64 %1, <2 x floa
 ; PRELOAD-16-NEXT:    ret void
 ;
 ; PRELOAD-20-LABEL: define {{[^@]+}}@test_preload_hint_kernel_18
-; PRELOAD-20-SAME: (i32 inreg [[TMP0:%.*]], i64 inreg [[TMP1:%.*]], <2 x float> inreg [[TMP2:%.*]], ptr inreg [[TMP3:%.*]], i32 inreg [[TMP4:%.*]], i32 inreg [[TMP5:%.*]], i32 inreg [[TMP6:%.*]], i32 inreg [[TMP7:%.*]], i32 inreg [[TMP8:%.*]], i32 inreg [[TMP9:%.*]], i32 inreg [[TMP10:%.*]], i32 inreg [[TMP11:%.*]], i32 inreg [[TMP12:%.*]], i32 inreg [[TMP13:%.*]], i32 inreg [[TMP14:%.*]], i32 inreg [[TMP15:%.*]], i32 [[TMP16:%.*]], i32 [[TMP17:%.*]]) #[[ATTR0]] {
+; PRELOAD-20-SAME: (i32 inreg [[TMP0:%.*]], i64 inreg [[TMP1:%.*]], <2 x float> inreg [[TMP2:%.*]], ptr inreg [[TMP3:%.*]], i32 inreg [[TMP4:%.*]], i32 inreg [[TMP5:%.*]], i32 inreg [[TMP6:%.*]], i32 inreg [[TMP7:%.*]], i32 inreg [[TMP8:%.*]], i32 inreg [[TMP9:%.*]], i32 inreg [[TMP10:%.*]], i32 inreg [[TMP11:%.*]], i32 inreg [[TMP12:%.*]], i32 inreg [[TMP13:%.*]], i32 inreg [[TMP14:%.*]], i32 inreg [[TMP15:%.*]], i32 inreg [[TMP16:%.*]], i32 inreg [[TMP17:%.*]]) #[[ATTR0]] {
 ; PRELOAD-20-NEXT:    ret void
 ;
   ret void

@kerbowa kerbowa force-pushed the enable-preload-by-default branch from cb3a257 to 564d80b Compare October 1, 2024 15:42

Type *Ty = Arg.getType();
Align ArgAlign = DL.getABITypeAlign(Ty);
auto Size = DL.getTypeAllocSize(Ty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation size doesn't directly map to number of registers. Really this should look at what ComputeValueVTs does to get the number of registers.

I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alloc size is correct with respect to the number of user SGPRs that are initially allocated for preloading, since it maps directly on to how the data looks in the kernarg segment, and the number of registers used for preloading is derived from that.

I think it would be better to be more precise (and maybe even make the inreg a hard requirement to respect)

I short of agree, but we have to decide what to do when frontends like Triton just add inreg to every argument. Should we remove it in cases where we cannot preload the argument? Print a warning if we cannot preload?

I'm leaning towards the first option where we remove inreg from arguments that wont actually be preloaded somewhere like AMDGPULowerKernelArguments after all the attributes are finalized, ect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only correct for trivially legal types. It will not be for anything exotic or aggregates. You either need to use ComputeValueVTs or only accept trivial types that map to N registers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are calculating is the size in memory, i.e. the start offset to the ending explicit offset. Isn't this method correct for anything that is not byref?

Do you think we should also consider the attributes that are inferred by the AMDGPUAttributor itself when marking things for preloading here? I.e. add inreg at the very end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally kernel arguments would only use byref (except for these inreg arguments). Non-byref arguments are really byref, and will be lowered to byref.

If you're making the values inreg, they should be treated more like ordinary calling convention register arguments. It's probably simplest to just filter out types that aren't trivially register types. That's the case for all implicit kernel arguments anyway.

Comment on lines 1051 to 1052
// Defaults with no CL option.
if (ST.hasGFX940Insts())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should put this into some kind of other predicate, rather than directly checking the subtarget. Make getMaxNumPreloadArgs preload count return 0 if the target doesn't support it?


// Enable kernarg preloading by default on GFX940+.
size_t PreloadCount;
if (KernargPreloadCount.getNumOccurrences() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should move this to be a pass parameter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of having the logic outside of the pass and using a pass parameter?

I want to keep it so the flag works as before and users can override the defaults. I would still need the cl opt somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cl::opts are evil and should mostly be purged. They're only for debugging and shouldn't be given to users as a configuration mechanism. I don't really see why users would want direct control over something so specific either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we removed the cl option we would have to give users another way to enable it on gfx90a, but we don't want to enable it by default there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it really needs to be a user exposed flag, there should be an attribute for it with a corresponding clang flag to set it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really see where you are coming from, and I agree with you in most circumstances, but this CL option has been used for this purpose since downstream prototypes over a year ago. Now, the flag will only be used for a niche case, where
some libraries are already using the flag on gfx90a. I'd like to keep it because I don't actually want it to be a user facing option. This shouldn't be exposed to the user and is really only a "user beware" type hidden flag for downstream libraries that are already using it.

PreloadCount = 0;
}

for (unsigned I = 0; I < PreloadCount; ++I) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I's type is inconsistent with PreloadCount, make them both size_t or unsigned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCNTargetMachine

@kerbowa kerbowa force-pushed the enable-preload-by-default branch 2 times, most recently from fe82ef3 to 9818586 Compare October 7, 2024 20:14
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add some aggregate tests, and just skip them if you just want to handle the simple register cases

unsigned Offset = 0;
unsigned ArgsToPreload = 0;
for (const auto &Arg : F.args()) {
if (Arg.hasByRefAttr())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up we really should be handling the byref case for the future

@kerbowa kerbowa force-pushed the enable-preload-by-default branch from 9818586 to 45bd2a4 Compare November 19, 2024 07:32
@kerbowa kerbowa closed this Dec 31, 2024
@arsenm
Copy link
Contributor

arsenm commented Jan 6, 2025

Should include rationale for why it was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants