Skip to content

Conversation

@JanekvO
Copy link
Contributor

@JanekvO JanekvO commented Oct 6, 2025

Stop generating misaligned operands (misaligned compared to the dst alignment)

@llvmbot
Copy link
Member

llvmbot commented Oct 6, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Janek van Oirschot (JanekvO)

Changes

Stop generating misaligned operands (misaligned compared to the dst alignment)


Full diff: https://github.com/llvm/llvm-project/pull/162088.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp (+27-6)
  • (modified) llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir (+2-2)
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index f0d1117664983..1185fbf325932 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -1363,6 +1363,15 @@ SILoadStoreOptimizer::checkAndPrepareMerge(CombineInfo &CI,
   return Where;
 }
 
+static bool isCompatibleAlignSubReg(const TargetRegisterClass *RC, unsigned Idx,
+                                    const GCNSubtarget *STM,
+                                    const SIRegisterInfo *TRI) {
+  if (!STM->needsAlignedVGPRs() || !TRI->isVGPRClass(RC) ||
+      !TRI->isProperlyAlignedRC(*RC) || AMDGPU::getRegBitWidth(*RC) == 32)
+    return true;
+  return !(TRI->getChannelFromSubReg(Idx) & 0x1);
+}
+
 // Copy the merged load result from DestReg to the original dest regs of CI and
 // Paired.
 void SILoadStoreOptimizer::copyToDestRegs(
@@ -1411,12 +1420,24 @@ SILoadStoreOptimizer::copyFromSrcRegs(CombineInfo &CI, CombineInfo &Paired,
   const auto *Src0 = TII->getNamedOperand(*CI.I, OpName);
   const auto *Src1 = TII->getNamedOperand(*Paired.I, OpName);
 
-  BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
-      .add(*Src0)
-      .addImm(SubRegIdx0)
-      .add(*Src1)
-      .addImm(SubRegIdx1);
-
+  // Make sure the generated REG_SEQUENCE has sensibly aligned registers.
+  if (isCompatibleAlignSubReg(Paired.DataRC, SubRegIdx1, STM, TRI)) {
+    BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
+        .add(*Src0)
+        .addImm(SubRegIdx0)
+        .add(*Src1)
+        .addImm(SubRegIdx1);
+  } else {
+    auto BMI =
+        BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
+            .add(*Src0)
+            .addImm(SubRegIdx0);
+    for (unsigned i = 0; i < Paired.Width; ++i) {
+      unsigned PreviousChannel = TRI->getChannelFromSubReg(SubRegIdx0);
+      BMI.addReg(Src1->getReg(), /*flags=*/0, TRI->getSubRegFromChannel(i))
+          .addImm(TRI->getSubRegFromChannel(PreviousChannel + i + 1));
+    }
+  }
   return SrcReg;
 }
 
diff --git a/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir b/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
index a67cf22bdd1ce..b45b69010141e 100644
--- a/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
+++ b/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
@@ -213,7 +213,7 @@ body:             |
     ; GCN: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
     ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
     ; GCN-NEXT: [[DEF2:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
-    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]], %subreg.sub1_sub2
+    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]].sub0, %subreg.sub1, [[DEF2]].sub1, %subreg.sub2
     ; GCN-NEXT: FLAT_STORE_DWORDX3 [[DEF]], killed [[REG_SEQUENCE]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s96) into `ptr poison`, align 4)
     %0:vreg_64_align2 = IMPLICIT_DEF
     %1:vgpr_32 = IMPLICIT_DEF
@@ -230,7 +230,7 @@ body:             |
     ; GCN: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
     ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
     ; GCN-NEXT: [[DEF2:%[0-9]+]]:vreg_96_align2 = IMPLICIT_DEF
-    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]], %subreg.sub1_sub2_sub3
+    ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]].sub0, %subreg.sub1, [[DEF2]].sub1, %subreg.sub2, [[DEF2]].sub2, %subreg.sub3
     ; GCN-NEXT: FLAT_STORE_DWORDX4 [[DEF]], killed [[REG_SEQUENCE]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s128) into `ptr poison`, align 4)
     %0:vreg_64_align2 = IMPLICIT_DEF
     %1:vgpr_32 = IMPLICIT_DEF

Comment on lines 487 to 489
; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY [[DEF1]].sub1
; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF1]].sub0
; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[COPY]], %subreg.sub1, [[COPY1]], %subreg.sub0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression, you do not need these new copies

Comment on lines 1429 to 1431
if (CompatRC0 != CI.DataRC)
BuildMI(*MBB, InsertBefore, DL, TII->get(TargetOpcode::COPY), Src0Reg)
.add(*Src0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need new copies, and this is checking against the wrong register class? DataRC should already be a subregister class of SuperRC. Also should prefer to try constrainRegClass before falling back to copy, but I don't think that will happen in any real case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is checking against the wrong register class?

Yeah, needs a better compatibility check

DataRC should already be a subregister class of SuperRC

I assumed so but might not be the case if the operand register has subreg idx

try constrainRegClass before falling back to copy

From trying it out, it seems to always use the aligned variant (i.e., constrainRegClass(%1:vreg_64_align2, VReg_64) will always continue using vreg_64_align2 instead of forcing VReg_64) so may require COPY. Might just try to construct the REG_SEQUENCE in vgpr_32 parts only, is probably easier as well for further optimizations afterwards. (do let me know if I'm missing low hanging fruit on this end, though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed so but might not be the case if the operand register has subreg idx

Mostly no, but you should probably still call use getMatchingSuperRegClass to find the common case. I still think this check is just wrong, none of the copies in the tests should be necessary.

Also need to add new case where you do need to copy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually for this I think you should always be able to constrain, no copies required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually for this I think you should always be able to constrain, no copies required

Am I do to something direct like setRegClass for this? Because trying to constrain an align2 to non-align2 equivalent (e.g., constrainRegClass(%1:vreg_64_align2, VReg_64)) will always fallback on the align2 variant which is exactly what I don't want with the generation of REG_SEQUENCE operands here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't be trying to go that direction? You cannot simply relax register classes without accounting for all the uses of the register. But transforms should generally work in a constraint increasing direction to enable folds

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the new test that produced the unaligned use? I'm not really sure how you ended up with this problem in the first place

@JanekvO
Copy link
Contributor Author

JanekvO commented Oct 10, 2025

Can you add the new test that produced the unaligned use? I'm not really sure how you ended up with this problem in the first place

It'll be similar to some of the tests I'm trying to add in #160547 (similarly, wont be able to reproduce end-to-end as I reverted #154115 for now as I try fixing these vgpr alignment issues)
Edit: misread error for other buffer-merge.mir; however, my statement above still holds in that my combine PR did end up with misaligned vgpr that won't work with misaligned REG_SEQUENCE operand align2s

@arsenm
Copy link
Contributor

arsenm commented Oct 11, 2025

Can you add the new test that produced the unaligned use? I'm not really sure how you ended up with this problem in the first place

similarly, wont be able to reproduce end-to-end as

For the purposes of this PR, you don't need to. You should still add the pre-merge MIR here that induced the unaligned use after

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants