[AMDGPU] Change default loop alignment for GFX9 and higher targets #153065

hjagasiaAMD · 2025-08-11T18:24:03Z

Align small loops aggresively to 32 bytes and larger loops to 16 bytes

github-actions · 2025-08-11T18:24:22Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-08-11T18:24:55Z

@llvm/pr-subscribers-llvm-globalisel

Author: None (hjagasiaAMD)

Changes

Align small loops aggresively to 32 bytes and larger loops to 16 bytes

Patch is 544.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153065.diff

118 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+14-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h (+3)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+40-21)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+40-14)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+10-18)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx1250.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+8-67)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+8-41)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+8-41)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/combine-add-zext-xor.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-frameindex.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-block-end-iterator-debugloc.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+8-100)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+8-112)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+16)
(modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/insert-waitcnts-crash.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/issue139317-bad-opsel-reg-sequence-fold.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+8-44)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+8-44)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+8-44)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+8-44)
(modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/loop-live-out-copy-undef-subrange.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/loop-on-function-argument.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch.ll (+19-4)
(modified) llvm/test/CodeGen/AMDGPU/loop_header_nopred.mir (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+4)
(modified) llvm/test/CodeGen/AMDGPU/mdt-preserving-crash.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+3-21)
(modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+1-48)
(modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/select-undef.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-to-vreg1-copy.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/structurize-hoist.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/swdev380865.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/uniform-select.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/v_swap_b16.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+7)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index f26639847be75..1d2c07b4deea9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1431,6 +1431,14 @@ def FeatureDisable : SubtargetFeature<"",
   "Dummy feature to disable assembler instructions"
 >;
 
+// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
+// instruction store which can supply 4 dwords to each of the 2 waves per
+// cycle. Change default alignment to 4 dwords or 16 bytes.
+def FeaturePrefLoopAlign32B : SubtargetFeature<"loop-align",
+  "PrefLoopAlignmentLog2",
+  "5",
+  "Prefer 32-byte alignment for loops">;
+
 //===----------------------------------------------------------------------===//
 
 class GCNSubtargetFeatureGeneration <string Value,
@@ -1495,7 +1503,8 @@ def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9",
    FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureSupportsXNACK,
    FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
    FeatureUnalignedDSAccess, FeatureNegativeScratchOffsetBug, FeatureGWS,
-   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
+   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad,
+   FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1519,7 +1528,7 @@ def FeatureGFX10 : GCNSubtargetFeatureGeneration<"GFX10",
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength63,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts,
    FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts,
-   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
+   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad, FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1542,7 +1551,7 @@ def FeatureGFX11 : GCNSubtargetFeatureGeneration<"GFX11",
    FeatureUnalignedDSAccess, FeatureGDS, FeatureGWS,
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
-   FeatureVmemWriteVgprInOrder
+   FeatureVmemWriteVgprInOrder, FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1566,7 +1575,8 @@ def FeatureGFX12 : GCNSubtargetFeatureGeneration<"GFX12",
    FeatureDefaultComponentBroadcast, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
    FeatureIEEEMinimumMaximumInsts, FeatureMinimum3Maximum3F32,
-   FeatureMinimum3Maximum3F16, FeatureAgentScopeFineGrainedRemoteMemoryAtomics
+   FeatureMinimum3Maximum3F16, FeatureAgentScopeFineGrainedRemoteMemoryAtomics,
+   FeaturePrefLoopAlign32B
   ]
 >;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 64e68ab7d753c..5124aa04550d3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -628,6 +628,7 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
   setMaxAtomicSizeInBitsSupported(64);
   setMaxDivRemBitWidthSupported(64);
   setMaxLargeFPConvertBitWidthSupported(64);
+  setPrefLoopAlignment(Align(1ULL << Subtarget->getPrefLoopAlignment()));
 }
 
 bool AMDGPUTargetLowering::mayIgnoreSignedZero(SDValue Op) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
index 6878744496cfe..fff803b40c41e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
@@ -80,6 +80,7 @@ class AMDGPUSubtarget {
   unsigned LocalMemorySize = 0;
   unsigned AddressableLocalMemorySize = 0;
   char WavefrontSizeLog2 = 0;
+  unsigned PrefLoopAlignmentLog2 = 0;
 
 public:
   AMDGPUSubtarget(Triple TT);
@@ -377,6 +378,8 @@ class AMDGPUSubtarget {
   uint64_t getExplicitKernArgSize(const Function &F, Align &MaxAlign) const;
   unsigned getKernArgSegmentSize(const Function &F, Align &MaxAlign) const;
 
+  unsigned getPrefLoopAlignment() const { return PrefLoopAlignmentLog2; }
+
   /// \returns Corresponding DWARF register number mapping flavour for the
   /// \p WavefrontSize.
   AMDGPUDwarfFlavour getAMDGPUDwarfFlavour() const;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index e866bd47e267d..bb49b8e254bbc 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -53,10 +53,14 @@ using namespace llvm::SDPatternMatch;
 
 STATISTIC(NumTailCalls, "Number of tail calls");
 
+static cl::opt<bool> DisableLoopAlignment("amdgpu-disable-loop-alignment",
+                                          cl::desc("Do not align loops"),
+                                          cl::init(false));
+
 static cl::opt<bool>
-    DisableLoopAlignment("amdgpu-disable-loop-alignment",
-                         cl::desc("Do not align and prefetch loops"),
-                         cl::init(false));
+    DisableLoopAlignmentPrefetch("amdgpu-disable-loop-alignment-prefetch",
+                                 cl::desc("Do not align and prefetch loops"),
+                                 cl::init(false));
 
 static cl::opt<bool> UseDivergentRegisterIndexing(
     "amdgpu-use-divergent-register-indexing", cl::Hidden,
@@ -17434,25 +17438,9 @@ Align SITargetLowering::computeKnownAlignForTargetInstr(
 Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
   const Align PrefAlign = TargetLowering::getPrefLoopAlignment(ML);
   const Align CacheLineAlign = Align(64);
-
-  // Pre-GFX10 target did not benefit from loop alignment
-  if (!ML || DisableLoopAlignment || !getSubtarget()->hasInstPrefetch() ||
-      getSubtarget()->hasInstFwdPrefetchBug())
-    return PrefAlign;
-
-  // On GFX10 I$ is 4 x 64 bytes cache lines.
-  // By default prefetcher keeps one cache line behind and reads two ahead.
-  // We can modify it with S_INST_PREFETCH for larger loops to have two lines
-  // behind and one ahead.
-  // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
-  // If loop fits 64 bytes it always spans no more than two cache lines and
-  // does not need an alignment.
-  // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
-  // Else if loop is less or equal 192 bytes we need two lines behind.
-
   const SIInstrInfo *TII = getSubtarget()->getInstrInfo();
   const MachineBasicBlock *Header = ML->getHeader();
-  if (Header->getAlignment() != PrefAlign)
+  if (DisableLoopAlignment || Header->getAlignment() > PrefAlign)
     return Header->getAlignment(); // Already processed.
 
   unsigned LoopSize = 0;
@@ -17465,10 +17453,41 @@ Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
     for (const MachineInstr &MI : *MBB) {
       LoopSize += TII->getInstSizeInBytes(MI);
       if (LoopSize > 192)
-        return PrefAlign;
+        break;
     }
   }
 
+  // Pre-GFX10 targets did not benefit from loop alignment driven by prefetch
+  // considerations
+  if (!ML || DisableLoopAlignmentPrefetch ||
+      !getSubtarget()->hasInstPrefetch() ||
+      getSubtarget()->hasInstFwdPrefetchBug()) {
+    // Align loops < 32 bytes agrressively
+    if (LoopSize <= 32)
+      return PrefAlign;
+    // Align larger loops less aggressively
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
+  }
+
+  // On GFX10 I$ is 4 x 64 bytes cache lines.
+  // By default prefetcher keeps one cache line behind and reads two ahead.
+  // We can modify it with S_INST_PREFETCH for larger loops to have two lines
+  // behind and one ahead.
+  // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
+  // If loop fits 64 bytes it always spans no more than two cache lines and
+  // does not need an alignment.
+  // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
+  // Else if loop is less or equal 192 bytes we need two lines behind.
+
+  // Align larger loops less aggressively
+  if (LoopSize > 192) {
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
+  }
+
   if (LoopSize <= 64)
     return PrefAlign;
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index 666523c88860c..37590f5a189ea 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -1,10 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=tonga < %s | FileCheck -check-prefix=GFX8 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=hawaii < %s | FileCheck -check-prefix=GFX7 %s
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
index 351502816ae6e..f77ff18203122 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
@@ -1,10 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=tonga < %s | FileCheck -check-prefix=GFX8 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=hawaii < %s | FileCheck -check-prefix=GFX7 %s
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
index ff26ea21390e2..4666e7f26a297 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Divergent phis that don't require lowering using lane mask merging
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
index a8a75cd2ffaa8..d36725eb8dee9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
 
 ; This file contains various tests that have divergent i1s used outside of
 ; the loop. These are lane masks is sgpr and need to have correct value in
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..a3197d9239bcd 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Simples case, if - then, that requires lane mask merging,
 ; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
index d13d6a19d332a..e27ee8b94416d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 define void @temporal_divergent_i1_phi(float %val, ptr %addr) {
 ; GFX10-LABEL: temporal_divergent_i1_phi:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
index d4e5487828c48..bf5c3cfab4d03 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 define void @temporal_divergent_i32(float %val, ptr %addr) {
 ; GFX10-LABEL: temporal_divergent_i32:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
index 6148bc2d5ae6e..12f10c78b603d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck %s
 
 ; Make sure the branch targets are correct after lowering llvm.amdgcn.if
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..36ffb47ef7ae5 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
 
 define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
 ; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..aea5eec52d076 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
 
 ; if instruction is uniform and there is available instruction, select SALU instruction
 define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
index 9c2fabce4bcde..c5cfb9eae7051 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
-; RUN: llc -global-isel -mtriple=amdgcn -m...
[truncated]

llvmbot · 2025-08-11T18:24:56Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (hjagasiaAMD)

Changes

Align small loops aggresively to 32 bytes and larger loops to 16 bytes

Patch is 544.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153065.diff

118 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+14-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h (+3)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+40-21)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+40-14)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+10-18)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx1250.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+8-67)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+8-41)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+8-41)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/combine-add-zext-xor.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-frameindex.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-block-end-iterator-debugloc.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+8-100)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+8-112)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+8-76)
(modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+16)
(modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/insert-waitcnts-crash.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/issue139317-bad-opsel-reg-sequence-fold.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+8-44)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+8-44)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+8-44)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+8-44)
(modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/loop-live-out-copy-undef-subrange.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/loop-on-function-argument.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch.ll (+19-4)
(modified) llvm/test/CodeGen/AMDGPU/loop_header_nopred.mir (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+4)
(modified) llvm/test/CodeGen/AMDGPU/mdt-preserving-crash.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+3-21)
(modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+1-48)
(modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/select-undef.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-to-vreg1-copy.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/structurize-hoist.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/swdev380865.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/uniform-select.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/v_swap_b16.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+7)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index f26639847be75..1d2c07b4deea9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1431,6 +1431,14 @@ def FeatureDisable : SubtargetFeature<"",
   "Dummy feature to disable assembler instructions"
 >;
 
+// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
+// instruction store which can supply 4 dwords to each of the 2 waves per
+// cycle. Change default alignment to 4 dwords or 16 bytes.
+def FeaturePrefLoopAlign32B : SubtargetFeature<"loop-align",
+  "PrefLoopAlignmentLog2",
+  "5",
+  "Prefer 32-byte alignment for loops">;
+
 //===----------------------------------------------------------------------===//
 
 class GCNSubtargetFeatureGeneration <string Value,
@@ -1495,7 +1503,8 @@ def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9",
    FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureSupportsXNACK,
    FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
    FeatureUnalignedDSAccess, FeatureNegativeScratchOffsetBug, FeatureGWS,
-   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
+   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad,
+   FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1519,7 +1528,7 @@ def FeatureGFX10 : GCNSubtargetFeatureGeneration<"GFX10",
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength63,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts,
    FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts,
-   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
+   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad, FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1542,7 +1551,7 @@ def FeatureGFX11 : GCNSubtargetFeatureGeneration<"GFX11",
    FeatureUnalignedDSAccess, FeatureGDS, FeatureGWS,
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
-   FeatureVmemWriteVgprInOrder
+   FeatureVmemWriteVgprInOrder, FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1566,7 +1575,8 @@ def FeatureGFX12 : GCNSubtargetFeatureGeneration<"GFX12",
    FeatureDefaultComponentBroadcast, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
    FeatureIEEEMinimumMaximumInsts, FeatureMinimum3Maximum3F32,
-   FeatureMinimum3Maximum3F16, FeatureAgentScopeFineGrainedRemoteMemoryAtomics
+   FeatureMinimum3Maximum3F16, FeatureAgentScopeFineGrainedRemoteMemoryAtomics,
+   FeaturePrefLoopAlign32B
   ]
 >;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 64e68ab7d753c..5124aa04550d3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -628,6 +628,7 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
   setMaxAtomicSizeInBitsSupported(64);
   setMaxDivRemBitWidthSupported(64);
   setMaxLargeFPConvertBitWidthSupported(64);
+  setPrefLoopAlignment(Align(1ULL << Subtarget->getPrefLoopAlignment()));
 }
 
 bool AMDGPUTargetLowering::mayIgnoreSignedZero(SDValue Op) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
index 6878744496cfe..fff803b40c41e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
@@ -80,6 +80,7 @@ class AMDGPUSubtarget {
   unsigned LocalMemorySize = 0;
   unsigned AddressableLocalMemorySize = 0;
   char WavefrontSizeLog2 = 0;
+  unsigned PrefLoopAlignmentLog2 = 0;
 
 public:
   AMDGPUSubtarget(Triple TT);
@@ -377,6 +378,8 @@ class AMDGPUSubtarget {
   uint64_t getExplicitKernArgSize(const Function &F, Align &MaxAlign) const;
   unsigned getKernArgSegmentSize(const Function &F, Align &MaxAlign) const;
 
+  unsigned getPrefLoopAlignment() const { return PrefLoopAlignmentLog2; }
+
   /// \returns Corresponding DWARF register number mapping flavour for the
   /// \p WavefrontSize.
   AMDGPUDwarfFlavour getAMDGPUDwarfFlavour() const;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index e866bd47e267d..bb49b8e254bbc 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -53,10 +53,14 @@ using namespace llvm::SDPatternMatch;
 
 STATISTIC(NumTailCalls, "Number of tail calls");
 
+static cl::opt<bool> DisableLoopAlignment("amdgpu-disable-loop-alignment",
+                                          cl::desc("Do not align loops"),
+                                          cl::init(false));
+
 static cl::opt<bool>
-    DisableLoopAlignment("amdgpu-disable-loop-alignment",
-                         cl::desc("Do not align and prefetch loops"),
-                         cl::init(false));
+    DisableLoopAlignmentPrefetch("amdgpu-disable-loop-alignment-prefetch",
+                                 cl::desc("Do not align and prefetch loops"),
+                                 cl::init(false));
 
 static cl::opt<bool> UseDivergentRegisterIndexing(
     "amdgpu-use-divergent-register-indexing", cl::Hidden,
@@ -17434,25 +17438,9 @@ Align SITargetLowering::computeKnownAlignForTargetInstr(
 Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
   const Align PrefAlign = TargetLowering::getPrefLoopAlignment(ML);
   const Align CacheLineAlign = Align(64);
-
-  // Pre-GFX10 target did not benefit from loop alignment
-  if (!ML || DisableLoopAlignment || !getSubtarget()->hasInstPrefetch() ||
-      getSubtarget()->hasInstFwdPrefetchBug())
-    return PrefAlign;
-
-  // On GFX10 I$ is 4 x 64 bytes cache lines.
-  // By default prefetcher keeps one cache line behind and reads two ahead.
-  // We can modify it with S_INST_PREFETCH for larger loops to have two lines
-  // behind and one ahead.
-  // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
-  // If loop fits 64 bytes it always spans no more than two cache lines and
-  // does not need an alignment.
-  // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
-  // Else if loop is less or equal 192 bytes we need two lines behind.
-
   const SIInstrInfo *TII = getSubtarget()->getInstrInfo();
   const MachineBasicBlock *Header = ML->getHeader();
-  if (Header->getAlignment() != PrefAlign)
+  if (DisableLoopAlignment || Header->getAlignment() > PrefAlign)
     return Header->getAlignment(); // Already processed.
 
   unsigned LoopSize = 0;
@@ -17465,10 +17453,41 @@ Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
     for (const MachineInstr &MI : *MBB) {
       LoopSize += TII->getInstSizeInBytes(MI);
       if (LoopSize > 192)
-        return PrefAlign;
+        break;
     }
   }
 
+  // Pre-GFX10 targets did not benefit from loop alignment driven by prefetch
+  // considerations
+  if (!ML || DisableLoopAlignmentPrefetch ||
+      !getSubtarget()->hasInstPrefetch() ||
+      getSubtarget()->hasInstFwdPrefetchBug()) {
+    // Align loops < 32 bytes agrressively
+    if (LoopSize <= 32)
+      return PrefAlign;
+    // Align larger loops less aggressively
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
+  }
+
+  // On GFX10 I$ is 4 x 64 bytes cache lines.
+  // By default prefetcher keeps one cache line behind and reads two ahead.
+  // We can modify it with S_INST_PREFETCH for larger loops to have two lines
+  // behind and one ahead.
+  // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
+  // If loop fits 64 bytes it always spans no more than two cache lines and
+  // does not need an alignment.
+  // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
+  // Else if loop is less or equal 192 bytes we need two lines behind.
+
+  // Align larger loops less aggressively
+  if (LoopSize > 192) {
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
+  }
+
   if (LoopSize <= 64)
     return PrefAlign;
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index 666523c88860c..37590f5a189ea 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -1,10 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=tonga < %s | FileCheck -check-prefix=GFX8 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=hawaii < %s | FileCheck -check-prefix=GFX7 %s
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
index 351502816ae6e..f77ff18203122 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
@@ -1,10 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=tonga < %s | FileCheck -check-prefix=GFX8 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=hawaii < %s | FileCheck -check-prefix=GFX7 %s
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
index ff26ea21390e2..4666e7f26a297 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Divergent phis that don't require lowering using lane mask merging
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
index a8a75cd2ffaa8..d36725eb8dee9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
 
 ; This file contains various tests that have divergent i1s used outside of
 ; the loop. These are lane masks is sgpr and need to have correct value in
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..a3197d9239bcd 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Simples case, if - then, that requires lane mask merging,
 ; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
index d13d6a19d332a..e27ee8b94416d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 define void @temporal_divergent_i1_phi(float %val, ptr %addr) {
 ; GFX10-LABEL: temporal_divergent_i1_phi:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
index d4e5487828c48..bf5c3cfab4d03 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 define void @temporal_divergent_i32(float %val, ptr %addr) {
 ; GFX10-LABEL: temporal_divergent_i32:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
index 6148bc2d5ae6e..12f10c78b603d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck %s
 
 ; Make sure the branch targets are correct after lowering llvm.amdgcn.if
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..36ffb47ef7ae5 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
 
 define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
 ; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..aea5eec52d076 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
 
 ; if instruction is uniform and there is available instruction, select SALU instruction
 define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
index 9c2fabce4bcde..c5cfb9eae7051 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
-; RUN: llc -global-isel -mtriple=amdgcn -m...
[truncated]

shiltian · 2025-08-11T21:23:02Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

-    DisableLoopAlignment("amdgpu-disable-loop-alignment",
-                         cl::desc("Do not align and prefetch loops"),
-                         cl::init(false));
+    DisableLoopAlignmentPrefetch("amdgpu-disable-loop-alignment-prefetch",


Not sure if that's a good idea to change existing options, as it might have already been used by someone else.

will change that, thanks

Nobody should be using backend defined cl::opts

It is not used by "anyone" in the LLVM repo but who knows whether there would be downstream users using it directly. If there is no one even in the downstream, I'd just remove it.

shiltian · 2025-08-11T21:24:17Z

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h

  uint64_t getExplicitKernArgSize(const Function &F, Align &MaxAlign) const;
  unsigned getKernArgSegmentSize(const Function &F, Align &MaxAlign) const;

+  unsigned getPrefLoopAlignment() const { return PrefLoopAlignmentLog2; }


Do we really need a target feature for this? Can we just have the getter return the correct value based on gfx version directly below?

prefer target feature over a check for every gfx version esp as new versions show up

I'm more wondering if this is a universal property that should have always been done. The prior comment says there's no benefit pre-gfx10, so why is this now needed for gfx9?

Didn't see sufficient documentation to confirm this would benefit older targets, but another amdgpu compiler is doing this for all targets. So will make it universal.

The prior comment says there's no benefit pre-gfx10, so why is this now needed for gfx9?

Yes, same doubt.

Align small loops aggresively to 32 bytes and larger loops to 16 bytes

arsenm · 2025-08-12T03:13:53Z

llvm/lib/Target/AMDGPU/AMDGPU.td

+// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
+// instruction store which can supply 4 dwords to each of the 2 waves per
+// cycle. Change default alignment to 4 dwords or 16 bytes.
+def FeaturePrefLoopAlign32B : SubtargetFeature<"loop-align",


Include the value in the feature name

arsenm · 2025-08-12T03:14:03Z

llvm/lib/Target/AMDGPU/AMDGPU.td

  "Dummy feature to disable assembler instructions"
 >;

+// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ


Suggested change

// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ

// GFX9 & higher targets have a 16-dword Instruction Buffer and per-SQ

arsenm · 2025-08-12T03:14:39Z

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h

  uint64_t getExplicitKernArgSize(const Function &F, Align &MaxAlign) const;
  unsigned getKernArgSegmentSize(const Function &F, Align &MaxAlign) const;

+  unsigned getPrefLoopAlignment() const { return PrefLoopAlignmentLog2; }


I'm more wondering if this is a universal property that should have always been done. The prior comment says there's no benefit pre-gfx10, so why is this now needed for gfx9?

arsenm · 2025-08-12T03:16:46Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

-    DisableLoopAlignment("amdgpu-disable-loop-alignment",
-                         cl::desc("Do not align and prefetch loops"),
-                         cl::init(false));
+    DisableLoopAlignmentPrefetch("amdgpu-disable-loop-alignment-prefetch",


Nobody should be using backend defined cl::opts

arsenm · 2025-08-12T03:18:41Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+static cl::opt<bool>
+    DisableAllLoopAlignment("amdgpu-disable-all-loop-alignment",
+                            cl::desc("Do not align loops"), cl::init(false));
+
 static cl::opt<bool>
    DisableLoopAlignment("amdgpu-disable-loop-alignment",
-                         cl::desc("Do not align and prefetch loops"),
+                         cl::desc("Do not align loops for prefetch"),
                         cl::init(false));


Can we not add new flags? There is a cost to them and we should be removing many of the ones we already have. Clang already has an explicit metadata driven flag to align or disable alignment of loops.

arsenm · 2025-08-12T03:19:24Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+    // Align larger loops less aggressively
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;


Can you rewrite this to use Align more directly, this is unnecessarily confusing

arsenm · 2025-08-12T03:20:07Z

llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll

@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-all-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s


Definitely should not be bulk adding this flag to tests

Following tests have been changed and don't use the flag and this should be sufficient testing for this change.

llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
llvm/test/CodeGen/AMDGPU/idiv-licm.ll
llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll
llvm/test/CodeGen/AMDGPU/loop-prefetch.ll
llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll
llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll
llvm/test/CodeGen/AMDGPU/uniform-select.ll
llvm/test/CodeGen/AMDGPU/wqm.ll

Will remove the backend cl::opts and use -align-loops=1 for the remaining 110 tests. I think its an overkill to modify 110 additional tests for alignment.

hjagasiaAMD · 2025-08-25T16:02:00Z

closing this pr and will open a new one with some performance results.

[AMDGPU] Change default loop alignment for GFX9 and higher targets

74c5549

Align small loops aggresively to 32 bytes and larger loops to 16 bytes

llvmbot added backend:AMDGPU llvm:globalisel labels Aug 11, 2025

Merge branch 'main' into amdgpu-loopalign

cfa627e

hjagasiaAMD added 2 commits August 11, 2025 13:46

Merge branch 'main' into amdgpu-loopalign

676038c

Merge branch 'main' into amdgpu-loopalign

525ccd4

ronlieb requested review from arsenm, ronlieb and shiltian August 11, 2025 21:00

shiltian reviewed Aug 11, 2025

View reviewed changes

[AMDGPU] Change default loop alignment for GFX9 and higher targets

82b6998

Align small loops aggresively to 32 bytes and larger loops to 16 bytes

arsenm reviewed Aug 12, 2025

View reviewed changes

hjagasiaAMD closed this Aug 25, 2025

hjagasiaAMD deleted the amdgpu-loopalign branch August 26, 2025 01:33

	// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
	// GFX9 & higher targets have a 16-dword Instruction Buffer and per-SQ

[AMDGPU] Change default loop alignment for GFX9 and higher targets #153065

[AMDGPU] Change default loop alignment for GFX9 and higher targets #153065

Uh oh!

Conversation

hjagasiaAMD commented Aug 11, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

llvmbot commented Aug 11, 2025

Uh oh!

llvmbot commented Aug 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hjagasiaAMD commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants