Skip to content

Conversation

@hjagasiaAMD
Copy link
Contributor

Align small loops aggresively to 32 bytes and larger loops to 16 bytes

Align small loops aggresively to 32 bytes and larger loops to 16 bytes
@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Aug 11, 2025

@llvm/pr-subscribers-llvm-globalisel

Author: None (hjagasiaAMD)

Changes

Align small loops aggresively to 32 bytes and larger loops to 16 bytes


Patch is 544.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153065.diff

118 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+14-4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h (+3)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+40-21)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+40-14)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+10-18)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx1250.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+8-67)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+8-41)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+8-41)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-add-zext-xor.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-frameindex.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-block-end-iterator-debugloc.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+8-100)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+8-112)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+16)
  • (modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-waitcnts-crash.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/issue139317-bad-opsel-reg-sequence-fold.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+8-44)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+8-44)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+8-44)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+8-44)
  • (modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-live-out-copy-undef-subrange.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-on-function-argument.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch.ll (+19-4)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_header_nopred.mir (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/mdt-preserving-crash.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+3-21)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+1-48)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/select-undef.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sgpr-to-vreg1-copy.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/structurize-hoist.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/swdev380865.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-select.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/v_swap_b16.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+7)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index f26639847be75..1d2c07b4deea9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1431,6 +1431,14 @@ def FeatureDisable : SubtargetFeature<"",
   "Dummy feature to disable assembler instructions"
 >;
 
+// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
+// instruction store which can supply 4 dwords to each of the 2 waves per
+// cycle. Change default alignment to 4 dwords or 16 bytes.
+def FeaturePrefLoopAlign32B : SubtargetFeature<"loop-align",
+  "PrefLoopAlignmentLog2",
+  "5",
+  "Prefer 32-byte alignment for loops">;
+
 //===----------------------------------------------------------------------===//
 
 class GCNSubtargetFeatureGeneration <string Value,
@@ -1495,7 +1503,8 @@ def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9",
    FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureSupportsXNACK,
    FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
    FeatureUnalignedDSAccess, FeatureNegativeScratchOffsetBug, FeatureGWS,
-   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
+   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad,
+   FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1519,7 +1528,7 @@ def FeatureGFX10 : GCNSubtargetFeatureGeneration<"GFX10",
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength63,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts,
    FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts,
-   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
+   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad, FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1542,7 +1551,7 @@ def FeatureGFX11 : GCNSubtargetFeatureGeneration<"GFX11",
    FeatureUnalignedDSAccess, FeatureGDS, FeatureGWS,
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
-   FeatureVmemWriteVgprInOrder
+   FeatureVmemWriteVgprInOrder, FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1566,7 +1575,8 @@ def FeatureGFX12 : GCNSubtargetFeatureGeneration<"GFX12",
    FeatureDefaultComponentBroadcast, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
    FeatureIEEEMinimumMaximumInsts, FeatureMinimum3Maximum3F32,
-   FeatureMinimum3Maximum3F16, FeatureAgentScopeFineGrainedRemoteMemoryAtomics
+   FeatureMinimum3Maximum3F16, FeatureAgentScopeFineGrainedRemoteMemoryAtomics,
+   FeaturePrefLoopAlign32B
   ]
 >;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 64e68ab7d753c..5124aa04550d3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -628,6 +628,7 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
   setMaxAtomicSizeInBitsSupported(64);
   setMaxDivRemBitWidthSupported(64);
   setMaxLargeFPConvertBitWidthSupported(64);
+  setPrefLoopAlignment(Align(1ULL << Subtarget->getPrefLoopAlignment()));
 }
 
 bool AMDGPUTargetLowering::mayIgnoreSignedZero(SDValue Op) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
index 6878744496cfe..fff803b40c41e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
@@ -80,6 +80,7 @@ class AMDGPUSubtarget {
   unsigned LocalMemorySize = 0;
   unsigned AddressableLocalMemorySize = 0;
   char WavefrontSizeLog2 = 0;
+  unsigned PrefLoopAlignmentLog2 = 0;
 
 public:
   AMDGPUSubtarget(Triple TT);
@@ -377,6 +378,8 @@ class AMDGPUSubtarget {
   uint64_t getExplicitKernArgSize(const Function &F, Align &MaxAlign) const;
   unsigned getKernArgSegmentSize(const Function &F, Align &MaxAlign) const;
 
+  unsigned getPrefLoopAlignment() const { return PrefLoopAlignmentLog2; }
+
   /// \returns Corresponding DWARF register number mapping flavour for the
   /// \p WavefrontSize.
   AMDGPUDwarfFlavour getAMDGPUDwarfFlavour() const;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index e866bd47e267d..bb49b8e254bbc 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -53,10 +53,14 @@ using namespace llvm::SDPatternMatch;
 
 STATISTIC(NumTailCalls, "Number of tail calls");
 
+static cl::opt<bool> DisableLoopAlignment("amdgpu-disable-loop-alignment",
+                                          cl::desc("Do not align loops"),
+                                          cl::init(false));
+
 static cl::opt<bool>
-    DisableLoopAlignment("amdgpu-disable-loop-alignment",
-                         cl::desc("Do not align and prefetch loops"),
-                         cl::init(false));
+    DisableLoopAlignmentPrefetch("amdgpu-disable-loop-alignment-prefetch",
+                                 cl::desc("Do not align and prefetch loops"),
+                                 cl::init(false));
 
 static cl::opt<bool> UseDivergentRegisterIndexing(
     "amdgpu-use-divergent-register-indexing", cl::Hidden,
@@ -17434,25 +17438,9 @@ Align SITargetLowering::computeKnownAlignForTargetInstr(
 Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
   const Align PrefAlign = TargetLowering::getPrefLoopAlignment(ML);
   const Align CacheLineAlign = Align(64);
-
-  // Pre-GFX10 target did not benefit from loop alignment
-  if (!ML || DisableLoopAlignment || !getSubtarget()->hasInstPrefetch() ||
-      getSubtarget()->hasInstFwdPrefetchBug())
-    return PrefAlign;
-
-  // On GFX10 I$ is 4 x 64 bytes cache lines.
-  // By default prefetcher keeps one cache line behind and reads two ahead.
-  // We can modify it with S_INST_PREFETCH for larger loops to have two lines
-  // behind and one ahead.
-  // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
-  // If loop fits 64 bytes it always spans no more than two cache lines and
-  // does not need an alignment.
-  // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
-  // Else if loop is less or equal 192 bytes we need two lines behind.
-
   const SIInstrInfo *TII = getSubtarget()->getInstrInfo();
   const MachineBasicBlock *Header = ML->getHeader();
-  if (Header->getAlignment() != PrefAlign)
+  if (DisableLoopAlignment || Header->getAlignment() > PrefAlign)
     return Header->getAlignment(); // Already processed.
 
   unsigned LoopSize = 0;
@@ -17465,10 +17453,41 @@ Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
     for (const MachineInstr &MI : *MBB) {
       LoopSize += TII->getInstSizeInBytes(MI);
       if (LoopSize > 192)
-        return PrefAlign;
+        break;
     }
   }
 
+  // Pre-GFX10 targets did not benefit from loop alignment driven by prefetch
+  // considerations
+  if (!ML || DisableLoopAlignmentPrefetch ||
+      !getSubtarget()->hasInstPrefetch() ||
+      getSubtarget()->hasInstFwdPrefetchBug()) {
+    // Align loops < 32 bytes agrressively
+    if (LoopSize <= 32)
+      return PrefAlign;
+    // Align larger loops less aggressively
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
+  }
+
+  // On GFX10 I$ is 4 x 64 bytes cache lines.
+  // By default prefetcher keeps one cache line behind and reads two ahead.
+  // We can modify it with S_INST_PREFETCH for larger loops to have two lines
+  // behind and one ahead.
+  // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
+  // If loop fits 64 bytes it always spans no more than two cache lines and
+  // does not need an alignment.
+  // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
+  // Else if loop is less or equal 192 bytes we need two lines behind.
+
+  // Align larger loops less aggressively
+  if (LoopSize > 192) {
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
+  }
+
   if (LoopSize <= 64)
     return PrefAlign;
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index 666523c88860c..37590f5a189ea 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -1,10 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=tonga < %s | FileCheck -check-prefix=GFX8 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=hawaii < %s | FileCheck -check-prefix=GFX7 %s
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
index 351502816ae6e..f77ff18203122 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
@@ -1,10 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=tonga < %s | FileCheck -check-prefix=GFX8 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=hawaii < %s | FileCheck -check-prefix=GFX7 %s
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
index ff26ea21390e2..4666e7f26a297 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Divergent phis that don't require lowering using lane mask merging
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
index a8a75cd2ffaa8..d36725eb8dee9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
 
 ; This file contains various tests that have divergent i1s used outside of
 ; the loop. These are lane masks is sgpr and need to have correct value in
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..a3197d9239bcd 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Simples case, if - then, that requires lane mask merging,
 ; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
index d13d6a19d332a..e27ee8b94416d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 define void @temporal_divergent_i1_phi(float %val, ptr %addr) {
 ; GFX10-LABEL: temporal_divergent_i1_phi:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
index d4e5487828c48..bf5c3cfab4d03 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 define void @temporal_divergent_i32(float %val, ptr %addr) {
 ; GFX10-LABEL: temporal_divergent_i32:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
index 6148bc2d5ae6e..12f10c78b603d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck %s
 
 ; Make sure the branch targets are correct after lowering llvm.amdgcn.if
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..36ffb47ef7ae5 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
 
 define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
 ; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..aea5eec52d076 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
 
 ; if instruction is uniform and there is available instruction, select SALU instruction
 define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
index 9c2fabce4bcde..c5cfb9eae7051 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
-; RUN: llc -global-isel -mtriple=amdgcn -m...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 11, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: None (hjagasiaAMD)

Changes

Align small loops aggresively to 32 bytes and larger loops to 16 bytes


Patch is 544.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153065.diff

118 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+14-4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h (+3)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+40-21)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+40-14)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+10-18)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx1250.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+8-67)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+8-41)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+8-41)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-add-zext-xor.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-frameindex.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-block-end-iterator-debugloc.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+8-100)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+8-112)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+8-76)
  • (modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+16)
  • (modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-waitcnts-crash.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/issue139317-bad-opsel-reg-sequence-fold.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+8-44)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+8-44)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+8-44)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+8-44)
  • (modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-live-out-copy-undef-subrange.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-on-function-argument.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch.ll (+19-4)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_header_nopred.mir (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/mdt-preserving-crash.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+3-21)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+1-48)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/select-undef.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sgpr-to-vreg1-copy.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/structurize-hoist.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/swdev380865.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-select.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/v_swap_b16.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+7)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index f26639847be75..1d2c07b4deea9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1431,6 +1431,14 @@ def FeatureDisable : SubtargetFeature<"",
   "Dummy feature to disable assembler instructions"
 >;
 
+// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
+// instruction store which can supply 4 dwords to each of the 2 waves per
+// cycle. Change default alignment to 4 dwords or 16 bytes.
+def FeaturePrefLoopAlign32B : SubtargetFeature<"loop-align",
+  "PrefLoopAlignmentLog2",
+  "5",
+  "Prefer 32-byte alignment for loops">;
+
 //===----------------------------------------------------------------------===//
 
 class GCNSubtargetFeatureGeneration <string Value,
@@ -1495,7 +1503,8 @@ def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9",
    FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureSupportsXNACK,
    FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
    FeatureUnalignedDSAccess, FeatureNegativeScratchOffsetBug, FeatureGWS,
-   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
+   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad,
+   FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1519,7 +1528,7 @@ def FeatureGFX10 : GCNSubtargetFeatureGeneration<"GFX10",
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength63,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts,
    FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts,
-   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad
+   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad, FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1542,7 +1551,7 @@ def FeatureGFX11 : GCNSubtargetFeatureGeneration<"GFX11",
    FeatureUnalignedDSAccess, FeatureGDS, FeatureGWS,
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
-   FeatureVmemWriteVgprInOrder
+   FeatureVmemWriteVgprInOrder, FeaturePrefLoopAlign32B
   ]
 >;
 
@@ -1566,7 +1575,8 @@ def FeatureGFX12 : GCNSubtargetFeatureGeneration<"GFX12",
    FeatureDefaultComponentBroadcast, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
    FeatureIEEEMinimumMaximumInsts, FeatureMinimum3Maximum3F32,
-   FeatureMinimum3Maximum3F16, FeatureAgentScopeFineGrainedRemoteMemoryAtomics
+   FeatureMinimum3Maximum3F16, FeatureAgentScopeFineGrainedRemoteMemoryAtomics,
+   FeaturePrefLoopAlign32B
   ]
 >;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 64e68ab7d753c..5124aa04550d3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -628,6 +628,7 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
   setMaxAtomicSizeInBitsSupported(64);
   setMaxDivRemBitWidthSupported(64);
   setMaxLargeFPConvertBitWidthSupported(64);
+  setPrefLoopAlignment(Align(1ULL << Subtarget->getPrefLoopAlignment()));
 }
 
 bool AMDGPUTargetLowering::mayIgnoreSignedZero(SDValue Op) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
index 6878744496cfe..fff803b40c41e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
@@ -80,6 +80,7 @@ class AMDGPUSubtarget {
   unsigned LocalMemorySize = 0;
   unsigned AddressableLocalMemorySize = 0;
   char WavefrontSizeLog2 = 0;
+  unsigned PrefLoopAlignmentLog2 = 0;
 
 public:
   AMDGPUSubtarget(Triple TT);
@@ -377,6 +378,8 @@ class AMDGPUSubtarget {
   uint64_t getExplicitKernArgSize(const Function &F, Align &MaxAlign) const;
   unsigned getKernArgSegmentSize(const Function &F, Align &MaxAlign) const;
 
+  unsigned getPrefLoopAlignment() const { return PrefLoopAlignmentLog2; }
+
   /// \returns Corresponding DWARF register number mapping flavour for the
   /// \p WavefrontSize.
   AMDGPUDwarfFlavour getAMDGPUDwarfFlavour() const;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index e866bd47e267d..bb49b8e254bbc 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -53,10 +53,14 @@ using namespace llvm::SDPatternMatch;
 
 STATISTIC(NumTailCalls, "Number of tail calls");
 
+static cl::opt<bool> DisableLoopAlignment("amdgpu-disable-loop-alignment",
+                                          cl::desc("Do not align loops"),
+                                          cl::init(false));
+
 static cl::opt<bool>
-    DisableLoopAlignment("amdgpu-disable-loop-alignment",
-                         cl::desc("Do not align and prefetch loops"),
-                         cl::init(false));
+    DisableLoopAlignmentPrefetch("amdgpu-disable-loop-alignment-prefetch",
+                                 cl::desc("Do not align and prefetch loops"),
+                                 cl::init(false));
 
 static cl::opt<bool> UseDivergentRegisterIndexing(
     "amdgpu-use-divergent-register-indexing", cl::Hidden,
@@ -17434,25 +17438,9 @@ Align SITargetLowering::computeKnownAlignForTargetInstr(
 Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
   const Align PrefAlign = TargetLowering::getPrefLoopAlignment(ML);
   const Align CacheLineAlign = Align(64);
-
-  // Pre-GFX10 target did not benefit from loop alignment
-  if (!ML || DisableLoopAlignment || !getSubtarget()->hasInstPrefetch() ||
-      getSubtarget()->hasInstFwdPrefetchBug())
-    return PrefAlign;
-
-  // On GFX10 I$ is 4 x 64 bytes cache lines.
-  // By default prefetcher keeps one cache line behind and reads two ahead.
-  // We can modify it with S_INST_PREFETCH for larger loops to have two lines
-  // behind and one ahead.
-  // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
-  // If loop fits 64 bytes it always spans no more than two cache lines and
-  // does not need an alignment.
-  // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
-  // Else if loop is less or equal 192 bytes we need two lines behind.
-
   const SIInstrInfo *TII = getSubtarget()->getInstrInfo();
   const MachineBasicBlock *Header = ML->getHeader();
-  if (Header->getAlignment() != PrefAlign)
+  if (DisableLoopAlignment || Header->getAlignment() > PrefAlign)
     return Header->getAlignment(); // Already processed.
 
   unsigned LoopSize = 0;
@@ -17465,10 +17453,41 @@ Align SITargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
     for (const MachineInstr &MI : *MBB) {
       LoopSize += TII->getInstSizeInBytes(MI);
       if (LoopSize > 192)
-        return PrefAlign;
+        break;
     }
   }
 
+  // Pre-GFX10 targets did not benefit from loop alignment driven by prefetch
+  // considerations
+  if (!ML || DisableLoopAlignmentPrefetch ||
+      !getSubtarget()->hasInstPrefetch() ||
+      getSubtarget()->hasInstFwdPrefetchBug()) {
+    // Align loops < 32 bytes agrressively
+    if (LoopSize <= 32)
+      return PrefAlign;
+    // Align larger loops less aggressively
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
+  }
+
+  // On GFX10 I$ is 4 x 64 bytes cache lines.
+  // By default prefetcher keeps one cache line behind and reads two ahead.
+  // We can modify it with S_INST_PREFETCH for larger loops to have two lines
+  // behind and one ahead.
+  // Therefor we can benefit from aligning loop headers if loop fits 192 bytes.
+  // If loop fits 64 bytes it always spans no more than two cache lines and
+  // does not need an alignment.
+  // Else if loop is less or equal 128 bytes we do not need to modify prefetch,
+  // Else if loop is less or equal 192 bytes we need two lines behind.
+
+  // Align larger loops less aggressively
+  if (LoopSize > 192) {
+    if (!ML->isInnermost())
+      return Header->getAlignment();
+    return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
+  }
+
   if (LoopSize <= 64)
     return PrefAlign;
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index 666523c88860c..37590f5a189ea 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -1,10 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=tonga < %s | FileCheck -check-prefix=GFX8 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=hawaii < %s | FileCheck -check-prefix=GFX7 %s
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
index 351502816ae6e..f77ff18203122 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
@@ -1,10 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx942 < %s | FileCheck -check-prefix=GFX942 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx90a < %s | FileCheck -check-prefix=GFX90A %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx908 < %s | FileCheck -check-prefix=GFX908 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=tonga < %s | FileCheck -check-prefix=GFX8 %s
 ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=hawaii < %s | FileCheck -check-prefix=GFX7 %s
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
index ff26ea21390e2..4666e7f26a297 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Divergent phis that don't require lowering using lane mask merging
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
index a8a75cd2ffaa8..d36725eb8dee9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
 
 ; This file contains various tests that have divergent i1s used outside of
 ; the loop. These are lane masks is sgpr and need to have correct value in
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..a3197d9239bcd 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Simples case, if - then, that requires lane mask merging,
 ; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
index d13d6a19d332a..e27ee8b94416d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 define void @temporal_divergent_i1_phi(float %val, ptr %addr) {
 ; GFX10-LABEL: temporal_divergent_i1_phi:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
index d4e5487828c48..bf5c3cfab4d03 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 define void @temporal_divergent_i32(float %val, ptr %addr) {
 ; GFX10-LABEL: temporal_divergent_i32:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
index 6148bc2d5ae6e..12f10c78b603d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck %s
 
 ; Make sure the branch targets are correct after lowering llvm.amdgcn.if
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..36ffb47ef7ae5 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
 
 define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
 ; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..aea5eec52d076 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -amdgpu-disable-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
 
 ; if instruction is uniform and there is available instruction, select SALU instruction
 define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
index 9c2fabce4bcde..c5cfb9eae7051 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
-; RUN: llc -global-isel -mtriple=amdgcn -m...
[truncated]

@ronlieb ronlieb requested review from arsenm, ronlieb and shiltian August 11, 2025 21:00
DisableLoopAlignment("amdgpu-disable-loop-alignment",
cl::desc("Do not align and prefetch loops"),
cl::init(false));
DisableLoopAlignmentPrefetch("amdgpu-disable-loop-alignment-prefetch",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's a good idea to change existing options, as it might have already been used by someone else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change that, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nobody should be using backend defined cl::opts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not used by "anyone" in the LLVM repo but who knows whether there would be downstream users using it directly. If there is no one even in the downstream, I'd just remove it.

uint64_t getExplicitKernArgSize(const Function &F, Align &MaxAlign) const;
unsigned getKernArgSegmentSize(const Function &F, Align &MaxAlign) const;

unsigned getPrefLoopAlignment() const { return PrefLoopAlignmentLog2; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a target feature for this? Can we just have the getter return the correct value based on gfx version directly below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer target feature over a check for every gfx version esp as new versions show up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more wondering if this is a universal property that should have always been done. The prior comment says there's no benefit pre-gfx10, so why is this now needed for gfx9?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see sufficient documentation to confirm this would benefit older targets, but another amdgpu compiler is doing this for all targets. So will make it universal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior comment says there's no benefit pre-gfx10, so why is this now needed for gfx9?

Yes, same doubt.

Align small loops aggresively to 32 bytes and larger loops to 16 bytes
// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
// instruction store which can supply 4 dwords to each of the 2 waves per
// cycle. Change default alignment to 4 dwords or 16 bytes.
def FeaturePrefLoopAlign32B : SubtargetFeature<"loop-align",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the value in the feature name

"Dummy feature to disable assembler instructions"
>;

// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GFX-9 & higher targets have a 16-dword Instruction Buffer and per-SQ
// GFX9 & higher targets have a 16-dword Instruction Buffer and per-SQ

uint64_t getExplicitKernArgSize(const Function &F, Align &MaxAlign) const;
unsigned getKernArgSegmentSize(const Function &F, Align &MaxAlign) const;

unsigned getPrefLoopAlignment() const { return PrefLoopAlignmentLog2; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more wondering if this is a universal property that should have always been done. The prior comment says there's no benefit pre-gfx10, so why is this now needed for gfx9?

DisableLoopAlignment("amdgpu-disable-loop-alignment",
cl::desc("Do not align and prefetch loops"),
cl::init(false));
DisableLoopAlignmentPrefetch("amdgpu-disable-loop-alignment-prefetch",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nobody should be using backend defined cl::opts

Comment on lines +56 to 63
static cl::opt<bool>
DisableAllLoopAlignment("amdgpu-disable-all-loop-alignment",
cl::desc("Do not align loops"), cl::init(false));

static cl::opt<bool>
DisableLoopAlignment("amdgpu-disable-loop-alignment",
cl::desc("Do not align and prefetch loops"),
cl::desc("Do not align loops for prefetch"),
cl::init(false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not add new flags? There is a cost to them and we should be removing many of the ones we already have. Clang already has an explicit metadata driven flag to align or disable alignment of loops.

// Align larger loops less aggressively
if (!ML->isInnermost())
return Header->getAlignment();
return (PrefAlign.value() > 1) ? Align(PrefAlign.value() >> 1) : PrefAlign;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rewrite this to use Align more directly, this is unnecessarily confusing

@@ -1,5 +1,5 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
; RUN: llc -amdgpu-disable-all-loop-alignment=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely should not be bulk adding this flag to tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following tests have been changed and don't use the flag and this should be sufficient testing for this change.

llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
llvm/test/CodeGen/AMDGPU/idiv-licm.ll
llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll
llvm/test/CodeGen/AMDGPU/loop-prefetch.ll
llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll
llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll
llvm/test/CodeGen/AMDGPU/uniform-select.ll
llvm/test/CodeGen/AMDGPU/wqm.ll

Will remove the backend cl::opts and use -align-loops=1 for the remaining 110 tests. I think its an overkill to modify 110 additional tests for alignment.

@hjagasiaAMD
Copy link
Contributor Author

closing this pr and will open a new one with some performance results.

@hjagasiaAMD hjagasiaAMD deleted the amdgpu-loopalign branch August 26, 2025 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants