[AMDGPU] LiveRegOptimizer: fix PHI same-BB filter; consider i8/i16 binops on SDWA #155800

michaelselehov · 2025-08-28T09:27:23Z

PHI-node part was merged with PR#160909.

Extend isOpLegal to treat 8/16-bit vector add/sub/and/or/xor as profitable on SDWA targets (stores and intrinsics remain profitable). This repacks loop-carried values to i32 across BBs and restores SDWA lowering instead of scattered lshr/lshl/or sequences.

Testing:

Local: check-llvm-codegen-amdgpu is green (4314/4320 passed, 6 XFAIL).
Additional: validated in AMD internal CI

Fix a bug in isCoercionProfitable where the same-block filter checked the def (II) instead of the user (CII), pruning valid paths. Also allow same-BB non-lookthrough users when the def is a PHI, so loop headers can be coerced across the backedge. Extend isOpLegal to treat 8/16-bit vector add/sub/and/or/xor as profitable on SDWA targets (stores and intrinsics remain profitable). This repacks loop-carried values to i32 across BBs and restores SDWA lowering instead of scattered lshr/lshl/or sequences.

llvmbot · 2025-08-28T09:27:59Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (michaelselehov)

Changes

Fix a bug in isCoercionProfitable where the same-block filter checked the def (II) instead of the user (CII), pruning valid paths. Also allow same-BB non-lookthrough users when the def is a PHI, so loop headers can be coerced across the backedge.

Extend isOpLegal to treat 8/16-bit vector add/sub/and/or/xor as profitable on SDWA targets (stores and intrinsics remain profitable). This repacks loop-carried values to i32 across BBs and restores SDWA lowering instead of scattered lshr/lshl/or sequences.

Testing:

Local: check-llvm-codegen-amdgpu is green (4314/4320 passed, 6 XFAIL).
Additional: validated in AMD internal CI

Full diff: https://github.com/llvm/llvm-project/pull/155800.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp (+37-2)
(added) llvm/test/CodeGen/AMDGPU/lro-coerce-v4i8-phi-loop.ll (+67)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index 38718c43a61dd..e4866405c6ad4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -126,7 +126,37 @@ class LiveRegOptimizer {
     return LK.first != TargetLoweringBase::TypeLegal;
   }
 
-  bool isOpLegal(Instruction *I) { return isa<StoreInst, IntrinsicInst>(I); }
+  bool isOpLegal(Instruction *I) {
+    if (auto *Intr = dyn_cast<IntrinsicInst>(I))
+      return true; // FIXME: narrow to known native intrinsics (DOT/MFMA/tbuffer) or use TTI cost.
+
+    // Any store is a profitable sink (prevents flip-flopping)
+    if (isa<StoreInst>(I))
+      return true;
+
+    // Treat small-int vector binops as profitable when SDWA is available
+    if (auto *BO = dyn_cast<BinaryOperator>(I)) {
+      if (auto *VTy = dyn_cast<VectorType>(BO->getType())) {
+        Type *Elt = VTy->getElementType();
+        // Treat small-int vector binops as profitable when SDWA is available.
+        // We explicitly gate to 8/16-bit to avoid i1 vectors and keep behavior tight.
+        if ((Elt->isIntegerTy(8) || (Elt->isIntegerTy(16)) && ST.hasSDWA())) {
+          switch (BO->getOpcode()) {
+          case Instruction::Add:
+          case Instruction::Sub:
+          case Instruction::And:
+          case Instruction::Or:
+          case Instruction::Xor:
+            return true;
+          default:
+            break;
+          }
+        }
+      }
+    }
+
+    return false;
+  }
 
   bool isCoercionProfitable(Instruction *II) {
     SmallPtrSet<Instruction *, 4> CVisited;
@@ -150,7 +180,12 @@ class LiveRegOptimizer {
       if (!CVisited.insert(CII).second)
         continue;
 
-      if (CII->getParent() == II->getParent() && !IsLookThru(II))
+      // Allow same-BB non-lookthrough users when the def is a PHI:
+      // loop headers frequently consume the carried value in the header block
+      // (e.g. byte-wise vector binops). We *do* want to coerce across the backedge
+      // in that common case to enable packed i32 + SDWA lowering.
+      if (CII->getParent() == II->getParent() && !IsLookThru(CII) &&
+          !isa<PHINode>(II))
         continue;
 
       if (isOpLegal(CII))
diff --git a/llvm/test/CodeGen/AMDGPU/lro-coerce-v4i8-phi-loop.ll b/llvm/test/CodeGen/AMDGPU/lro-coerce-v4i8-phi-loop.ll
new file mode 100644
index 0000000000000..a37aaf154520b
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lro-coerce-v4i8-phi-loop.ll
@@ -0,0 +1,67 @@
+; REQUIRES: amdgpu-registered-target
+; RUN: opt -S -passes=amdgpu-late-codegenprepare \
+; RUN:   -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a %s | FileCheck %s
+
+; Purpose:
+;  - Input has a loop-carried PHI of type <4 x i8> and byte-wise adds in the
+;    loop header (same basic block as the PHI).
+;  - After amdgpu-late-codegenprepare, the PHI must be coerced to i32 across
+;    the backedge, and a single dominating "bitcast i32 -> <4 x i8>" must be
+;    placed in the header (enabling SDWA-friendly lowering later).
+;
+; What we check:
+;  - PHI is i32 (no loop-carried <4 x i8> PHI remains).
+;  - A header-local bitcast i32 -> <4 x i8> exists and feeds the vector add.
+;  - The loopexit produces a bitcast <4 x i8> -> i32 for the backedge.
+
+target triple = "amdgcn-amd-amdhsa"
+
+define amdgpu_kernel void @lro_coerce_v4i8_phi(i8* nocapture %p, i32 %n) #0 {
+entry:
+  br label %loop
+
+loop:
+  ; Loop index
+  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
+
+  ; Loop-carried accumulator in vector-of-bytes form (problematic on input).
+  %acc = phi <4 x i8> [ zeroinitializer, %entry ], [ %acc.next, %loop ]
+
+  ; Make up four i8 values derived from %i to avoid memory noise.
+  %i0 = trunc i32 %i to i8
+  %i1i = add i32 %i, 1
+  %i1 = trunc i32 %i1i to i8
+  %i2i = add i32 %i, 2
+  %i2 = trunc i32 %i2i to i8
+  %i3i = add i32 %i, 3
+  %i3 = trunc i32 %i3i to i8
+
+  ; Pack them into <4 x i8>.
+  %v01 = insertelement <4 x i8> undef, i8 %i0, i32 0
+  %v02 = insertelement <4 x i8> %v01,  i8 %i1, i32 1
+  %v03 = insertelement <4 x i8> %v02,  i8 %i2, i32 2
+  %v   = insertelement <4 x i8> %v03,  i8 %i3, i32 3
+
+  ; Byte-wise add in the same block as the PHI (this must make coercion profitable).
+  %acc.next = add <4 x i8> %acc, %v
+
+  ; Loop control.
+  %i.next = add i32 %i, 4
+  %cond = icmp slt i32 %i.next, %n
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  ret void
+}
+
+attributes #0 = { "target-cpu"="gfx90a" }
+
+; CHECK-LABEL: define amdgpu_kernel void @lro_coerce_v4i8_phi(
+; CHECK: loop:
+; CHECK: %i = phi i32
+; CHECK-NOT: phi <4 x i8>
+; CHECK: %[[ACCI32:[^ ]+]] = phi i32
+; CHECK-NEXT: %[[HDRCAST:[^ ]+]] = bitcast i32 %[[ACCI32]] to <4 x i8>
+; CHECK: add <4 x i8> %[[HDRCAST]],
+; CHECK: br i1
+

michaelselehov · 2025-08-28T09:28:19Z

@choikwa please review and/or add other reviewers as you see fit

github-actions · 2025-08-28T09:30:34Z

✅ With the latest revision this PR passed the C/C++ code formatter.

github-actions · 2025-08-28T09:30:34Z

✅ With the latest revision this PR passed the undef deprecator.

arsenm · 2025-08-28T10:00:37Z

llvm/test/CodeGen/AMDGPU/lro-coerce-v4i8-phi-loop.ll

+;  - A header-local bitcast i32 -> <4 x i8> exists and feeds the vector add.
+;  - The loopexit produces a bitcast <4 x i8> -> i32 for the backedge.
+
+target triple = "amdgcn-amd-amdhsa"


triple and target-cpu attribute are redundant with the command line, remove one or the other

Fixed - now only in the command line

arsenm · 2025-08-28T10:00:55Z

llvm/test/CodeGen/AMDGPU/lro-coerce-v4i8-phi-loop.ll

+
+target triple = "amdgcn-amd-amdhsa"
+
+define amdgpu_kernel void @lro_coerce_v4i8_phi(i8* nocapture %p, i32 %n) #0 {


Use opaque pointers

Fixed i8* -> ptr

arsenm · 2025-08-28T10:01:25Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

-  bool isOpLegal(Instruction *I) { return isa<StoreInst, IntrinsicInst>(I); }
+  bool isOpLegal(Instruction *I) {
+    if (auto *Intr = dyn_cast<IntrinsicInst>(I))
+      return true; // FIXME: narrow to known native intrinsics (DOT/MFMA/tbuffer) or use TTI cost.


Can't TTI directly handle the entire function already?

Thanks, Matt — agreed that relying on TTI is the right direction here.

Before I reshuffle the functional block, what do you think about rewriting
isOpLegal as a profitability check and making TTI the primary signal, with a
very narrow SDWA safety-net to avoid re-introducing the regression we just fixed?

bool isProfitableSink(Instruction *I) { // 1) Always-profitable sinks (status quo). if (isa<StoreInst>(I) || isa<IntrinsicInst>(I)) return true; // 2) SDWA safety-net: tiny vector binops that are known to lower well. // Keeps the v4i8/v2i16 loop-header case profitable on gfx9+. if (auto *BO = dyn_cast<BinaryOperator>(I)) if (auto *VT = dyn_cast<VectorType>(BO->getType())) if (auto *Elt = VT->getElementType()) { std::optional<unsigned> Bits = VT->getPrimitiveSizeInBits(); if (Elt->isIntegerTy() && (Elt->getIntegerBitWidth() == 8 || (Elt->getIntegerBitWidth() == 16 && ST.hasSDWA())) && Bits && *Bits <= 32) { switch (BO->getOpcode()) { case Instruction::Add: case Instruction::Sub: case Instruction::And: case Instruction::Or: case Instruction::Xor: return true; default: break; } } } // 3) Default: use TTI cost (same spirit as the earlier change). auto C = TTI.getInstructionCost( I, TargetTransformInfo::TargetCostKind::TCK_SizeAndLatency); return C.isValid() && C.getValue() >= 8; }```

I think TTI already has its own collection of hacks for these costs already?We shouldn't need to implement them twice?

Thanks. I'll look what TTI can provide. If TTI already provides a reliable signal for these cases, I’ll switch the gate to be TTI-only. If it doesn’t, I'm not sure if the change to TTI should be a part of this PR.

Sorry, posted this reply in the wrong place. Reposting here.

@arsenm, I instrumented the TTI queries on gfx90a: add <4 x i8> comes out at cost 4 with TCK_SizeAndLatency (and likewise for getArithmeticInstrCost), which is below the previous profitability threshold (8). So switching to TTI-only would reintroduce the regression. I propose to keep the very narrow SDWA safety-net for v4i8/v2i16 (≤32b) here and look at improving AMDGPU TTI separately if needed.

arsenm · 2025-08-28T10:02:36Z

llvm/test/CodeGen/AMDGPU/lro-coerce-v4i8-phi-loop.ll

+  %i3 = trunc i32 %i3i to i8
+
+  ; Pack them into <4 x i8>.
+  %v01 = insertelement <4 x i8> undef, i8 %i0, i32 0


Suggested change

%v01 = insertelement <4 x i8> undef, i8 %i0, i32 0

%v01 = insertelement <4 x i8> poison, i8 %i0, i32 0

Changed to zeroinitializer already. Can change to poison if you think it's better.

jayfoad · 2025-08-28T10:13:40Z

[AMDGPU][LRO] LRO fix PHI same-BB filter; treat i8/i16 binops as profitable

What is LRO? No need to mention it twice, anyway.

michaelselehov · 2025-08-28T10:37:28Z

[AMDGPU][LRO] LRO fix PHI same-BB filter; treat i8/i16 binops as profitable

What is LRO? No need to mention it twice, anyway.

Updated the title

jplehr · 2025-09-04T11:16:18Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+    if (auto *BO = dyn_cast<BinaryOperator>(I)) {
+      if (auto *VTy = dyn_cast<VectorType>(BO->getType())) {
+        Type *Elt = VTy->getElementType();
+        // Treat small-int vector binops as profitable when SDWA is available.


duplicate comment

Thank you! Fixed.

choikwa · 2025-09-04T15:40:44Z

LGTM, I will defer to others

arsenm · 2025-09-10T14:27:32Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      // (e.g. byte-wise vector binops). We *do* want to coerce across the
+      // backedge in that common case to enable packed i32 + SDWA lowering.
+      if (CII->getParent() == II->getParent() && !IsLookThru(CII) &&
+          !isa<PHINode>(II))


The addition of phi here feels like a separate change from the above?

You’re right—there are two separate changes:

The II → CII fix is a correctness bug: we must examine the user when pruning same-BB paths, otherwise the walk can terminate prematurely.

The PHI exception is a small policy tweak: loop headers commonly consume the carried value in the header block via non-lookthrough ops (e.g., byte-wise vector binops). Without allowing that same-BB non-lookthrough use for PHI, the walk never reaches the profitable sink—even with a better cost model—so the regression remains.

Can you split this into a separate PR? It's hard to see that all the cases are adequately tested when they're together

Sure, I can. Given this PR will be for isOpLegal/isProfitableSink improvement, what would you like to see separated? II->CII fix? PHI exception? Or maybe two separate PRs, one for each?

@arsenm, I instrumented the TTI queries on gfx90a: add <4 x i8> comes out at cost 4 with TCK_SizeAndLatency (and likewise for getArithmeticInstrCost), which is below the previous profitability threshold (8). So switching to TTI-only would reintroduce the regression. I propose to keep the very narrow SDWA safety-net for v4i8/v2i16 (≤32b) here and look at improving AMDGPU TTI separately if needed.

Yes, TTI is a mess and we need to go fix all the costs, especially for illegal operations. In particular the costs for memory instructions are way too low

arsenm · 2025-09-26T02:10:11Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp


-  bool isOpLegal(Instruction *I) { return isa<StoreInst, IntrinsicInst>(I); }
+  bool isOpLegal(Instruction *I) {
+    if (dyn_cast<IntrinsicInst>(I))


Suggested change

if (dyn_cast<IntrinsicInst>(I))

if (isa<IntrinsicInst>(I))

But probably should restrict this to target intrinsics

arsenm · 2025-09-26T02:10:33Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

  }

-  bool isOpLegal(Instruction *I) { return isa<StoreInst, IntrinsicInst>(I); }
+  bool isOpLegal(Instruction *I) {


Suggested change

bool isOpLegal(Instruction *I) {

bool isOpLegal(const Instruction *I) {

arsenm · 2025-09-26T02:22:47Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+        // We explicitly gate to 8/16-bit to avoid i1 vectors and keep behavior
+        // tight.
+        // Require SDWA for both i8 and i16, and keep vectors within 32 bits.
+        std::optional<unsigned> Bits = VTy->getPrimitiveSizeInBits();


If you're so specifically checking these types, might as well keep this in terms of element type + element count instead of checking the full width of the vector

arsenm · 2025-09-26T02:23:34Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      // (e.g. byte-wise vector binops). We *do* want to coerce across the
+      // backedge in that common case to enable packed i32 + SDWA lowering.
+      if (CII->getParent() == II->getParent() && !IsLookThru(CII) &&
+          !isa<PHINode>(II))


Yes, TTI is a mess and we need to go fix all the costs, especially for illegal operations. In particular the costs for memory instructions are way too low

…ce-v4i8-phi-loop

michaelselehov · 2025-10-07T16:08:05Z

@arsenm would you please come back to this PR? Thank you!

ronlieb

LGTM. need @arsenm to approve as well

llvmbot added the backend:AMDGPU label Aug 28, 2025

ronlieb requested review from arsenm, choikwa and jayfoad August 28, 2025 09:44

arsenm reviewed Aug 28, 2025

View reviewed changes

Fix clang-format

7e1412f

michaelselehov added 2 commits August 28, 2025 05:17

Fix undef in test

555dada

Fix reviewer comments in test

58630e3

michaelselehov changed the title ~~[AMDGPU][LRO] LRO fix PHI same-BB filter; treat i8/i16 binops as profitable~~ [AMDGPU] LiveRegOptimizer: fix PHI same-BB filter; consider i8/i16 binops on SDWA Aug 28, 2025

jplehr reviewed Sep 4, 2025

View reviewed changes

michaelselehov added 5 commits September 4, 2025 06:32

Fixed duplicate comment

e49e804

Merge branch 'main' into amdgpu-fix-lro-coerce-v4i8-phi-loop

77da27f

Fix for Werrors

34bc9a5

Require SDWA for both i8 and i16, and keep vectors within 32 bits

10e5f32

Fix compilation error

63cd09e

arsenm reviewed Sep 10, 2025

View reviewed changes

arsenm reviewed Sep 26, 2025

View reviewed changes

michaelselehov added 3 commits September 26, 2025 09:56

Use bit-width * NumElements

a5f189c

Merge remote-tracking branch 'upstream/main' into amdgpu-fix-lro-coer…

a239189

…ce-v4i8-phi-loop

Removed phi-node part (merged with PR#160909)

e61e251

michaelselehov force-pushed the amdgpu-fix-lro-coerce-v4i8-phi-loop branch from fbda858 to e61e251 Compare September 30, 2025 12:50

ronlieb self-requested a review October 9, 2025 12:37

ronlieb approved these changes Oct 9, 2025

View reviewed changes

Merge branch 'main' into amdgpu-fix-lro-coerce-v4i8-phi-loop

9c1028e

ronlieb requested a review from bcahoon October 23, 2025 11:55


		target triple = "amdgcn-amd-amdhsa"

		define amdgpu_kernel void @lro_coerce_v4i8_phi(i8* nocapture %p, i32 %n) #0 {

	%v01 = insertelement <4 x i8> undef, i8 %i0, i32 0
	%v01 = insertelement <4 x i8> poison, i8 %i0, i32 0

	bool isOpLegal(Instruction *I) {
	bool isOpLegal(const Instruction *I) {

[AMDGPU] LiveRegOptimizer: fix PHI same-BB filter; consider i8/i16 binops on SDWA #155800

Are you sure you want to change the base?

[AMDGPU] LiveRegOptimizer: fix PHI same-BB filter; consider i8/i16 binops on SDWA #155800

Conversation

michaelselehov commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 28, 2025

Uh oh!

michaelselehov commented Aug 28, 2025

Uh oh!

github-actions bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelselehov Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayfoad commented Aug 28, 2025

Uh oh!

michaelselehov commented Aug 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

choikwa commented Sep 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelselehov commented Oct 7, 2025

Uh oh!

ronlieb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

michaelselehov commented Aug 28, 2025 •

edited

Loading

github-actions bot commented Aug 28, 2025 •

edited

Loading

github-actions bot commented Aug 28, 2025 •

edited

Loading

michaelselehov Aug 28, 2025 •

edited

Loading