[AMDGPU] siloadstoreopt generate REG_SEQUENCE with aligned operands #162088

JanekvO · 2025-10-06T13:42:26Z

Stop generating misaligned operands (misaligned compared to the dst alignment)

…gister alignment

llvmbot · 2025-10-06T13:42:59Z

@llvm/pr-subscribers-backend-amdgpu

Author: Janek van Oirschot (JanekvO)

Changes

Stop generating misaligned operands (misaligned compared to the dst alignment)

Full diff: https://github.com/llvm/llvm-project/pull/162088.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp (+27-6)
(modified) llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir (+2-2)

diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index f0d1117664983..1185fbf325932 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -1363,6 +1363,15 @@ SILoadStoreOptimizer::checkAndPrepareMerge(CombineInfo &CI,
   return Where;
 }
 
+static bool isCompatibleAlignSubReg(const TargetRegisterClass *RC, unsigned Idx,
+                                    const GCNSubtarget *STM,
+                                    const SIRegisterInfo *TRI) {
+  if (!STM->needsAlignedVGPRs() || !TRI->isVGPRClass(RC) ||
+      !TRI->isProperlyAlignedRC(*RC) || AMDGPU::getRegBitWidth(*RC) == 32)
+    return true;
+  return !(TRI->getChannelFromSubReg(Idx) & 0x1);
+}
+
 // Copy the merged load result from DestReg to the original dest regs of CI and
 // Paired.
 void SILoadStoreOptimizer::copyToDestRegs(
@@ -1411,12 +1420,24 @@ SILoadStoreOptimizer::copyFromSrcRegs(CombineInfo &CI, CombineInfo &Paired,
   const auto *Src0 = TII->getNamedOperand(*CI.I, OpName);
   const auto *Src1 = TII->getNamedOperand(*Paired.I, OpName);
 
-  BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
-      .add(*Src0)
-      .addImm(SubRegIdx0)
-      .add(*Src1)
-      .addImm(SubRegIdx1);
-
+  // Make sure the generated REG_SEQUENCE has sensibly aligned registers.
+  if (isCompatibleAlignSubReg(Paired.DataRC, SubRegIdx1, STM, TRI)) {
+    BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
+        .add(*Src0)
+        .addImm(SubRegIdx0)
+        .add(*Src1)
+        .addImm(SubRegIdx1);
+  } else {
+    auto BMI =
+        BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
+            .add(*Src0)
+            .addImm(SubRegIdx0);
+    for (unsigned i = 0; i < Paired.Width; ++i) {
+      unsigned PreviousChannel = TRI->getChannelFromSubReg(SubRegIdx0);
+      BMI.addReg(Src1->getReg(), /*flags=*/0, TRI->getSubRegFromChannel(i))
+          .addImm(TRI->getSubRegFromChannel(PreviousChannel + i + 1));
+    }
+  }
   return SrcReg;
 }
 
diff --git a/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir b/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
index a67cf22bdd1ce..b45b69010141e 100644
--- a/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
+++ b/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
@@ -213,7 +213,7 @@ body:             |
     ; GCN: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
     ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
     ; GCN-NEXT: [[DEF2:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
-    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]], %subreg.sub1_sub2
+    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]].sub0, %subreg.sub1, [[DEF2]].sub1, %subreg.sub2
     ; GCN-NEXT: FLAT_STORE_DWORDX3 [[DEF]], killed [[REG_SEQUENCE]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s96) into `ptr poison`, align 4)
     %0:vreg_64_align2 = IMPLICIT_DEF
     %1:vgpr_32 = IMPLICIT_DEF
@@ -230,7 +230,7 @@ body:             |
     ; GCN: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
     ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
     ; GCN-NEXT: [[DEF2:%[0-9]+]]:vreg_96_align2 = IMPLICIT_DEF
-    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]], %subreg.sub1_sub2_sub3
+    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]].sub0, %subreg.sub1, [[DEF2]].sub1, %subreg.sub2, [[DEF2]].sub2, %subreg.sub3
     ; GCN-NEXT: FLAT_STORE_DWORDX4 [[DEF]], killed [[REG_SEQUENCE]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s128) into `ptr poison`, align 4)
     %0:vreg_64_align2 = IMPLICIT_DEF
     %1:vgpr_32 = IMPLICIT_DEF

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

arsenm · 2025-10-08T01:58:34Z

llvm/test/CodeGen/AMDGPU/merge-global-load-store.mir

+    ; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY [[DEF1]].sub1
+    ; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF1]].sub0
+    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[COPY]], %subreg.sub1, [[COPY1]], %subreg.sub0


This is a regression, you do not need these new copies

arsenm · 2025-10-08T02:01:21Z

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

+  if (CompatRC0 != CI.DataRC)
+    BuildMI(*MBB, InsertBefore, DL, TII->get(TargetOpcode::COPY), Src0Reg)
+        .add(*Src0);


You shouldn't need new copies, and this is checking against the wrong register class? DataRC should already be a subregister class of SuperRC. Also should prefer to try constrainRegClass before falling back to copy, but I don't think that will happen in any real case

this is checking against the wrong register class?

Yeah, needs a better compatibility check

DataRC should already be a subregister class of SuperRC

I assumed so but might not be the case if the operand register has subreg idx

try constrainRegClass before falling back to copy

From trying it out, it seems to always use the aligned variant (i.e., constrainRegClass(%1:vreg_64_align2, VReg_64) will always continue using vreg_64_align2 instead of forcing VReg_64) so may require COPY. Might just try to construct the REG_SEQUENCE in vgpr_32 parts only, is probably easier as well for further optimizations afterwards. (do let me know if I'm missing low hanging fruit on this end, though)

I assumed so but might not be the case if the operand register has subreg idx

Mostly no, but you should probably still call use getMatchingSuperRegClass to find the common case. I still think this check is just wrong, none of the copies in the tests should be necessary.

Also need to add new case where you do need to copy

Actually for this I think you should always be able to constrain, no copies required

Actually for this I think you should always be able to constrain, no copies required

Am I do to something direct like setRegClass for this? Because trying to constrain an align2 to non-align2 equivalent (e.g., constrainRegClass(%1:vreg_64_align2, VReg_64)) will always fallback on the align2 variant which is exactly what I don't want with the generation of REG_SEQUENCE operands here.

You shouldn't be trying to go that direction? You cannot simply relax register classes without accounting for all the uses of the register. But transforms should generally work in a constraint increasing direction to enable folds

arsenm

Can you add the new test that produced the unaligned use? I'm not really sure how you ended up with this problem in the first place

JanekvO · 2025-10-10T17:55:45Z

Can you add the new test that produced the unaligned use? I'm not really sure how you ended up with this problem in the first place

It'll be similar to some of the tests I'm trying to add in #160547 (similarly, wont be able to reproduce end-to-end as I reverted #154115 for now as I try fixing these vgpr alignment issues)
Edit: misread error for other buffer-merge.mir; however, my statement above still holds in that my combine PR did end up with misaligned vgpr that won't work with misaligned REG_SEQUENCE operand align2s

arsenm · 2025-10-11T03:22:20Z

Can you add the new test that produced the unaligned use? I'm not really sure how you ended up with this problem in the first place

similarly, wont be able to reproduce end-to-end as

For the purposes of this PR, you don't need to. You should still add the pre-merge MIR here that induced the unaligned use after

[AMDGPU] siloadstoreopt generate REG_SEQUENCE with correct operand re…

3517b8e

…gister alignment

JanekvO requested review from arsenm and jayfoad October 6, 2025 13:42

llvmbot added the backend:AMDGPU label Oct 6, 2025

arsenm reviewed Oct 6, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp Outdated Show resolved Hide resolved

arsenm reviewed Oct 6, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp Outdated Show resolved Hide resolved

Different strategy for emitting legal REG_SEQUENCE operands

7040bb2

arsenm reviewed Oct 8, 2025

View reviewed changes

arsenm reviewed Oct 10, 2025

View reviewed changes

Scalarize if the regclasses are incompatible, add test

6dbfe1f

[AMDGPU] siloadstoreopt generate REG_SEQUENCE with aligned operands #162088

Are you sure you want to change the base?

[AMDGPU] siloadstoreopt generate REG_SEQUENCE with aligned operands #162088

Uh oh!

Conversation

JanekvO commented Oct 6, 2025

Uh oh!

llvmbot commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

arsenm Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

JanekvO Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

JanekvO Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

JanekvO commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JanekvO commented Oct 10, 2025 •

edited

Loading