Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 37 additions & 2 deletions llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,37 @@ class LiveRegOptimizer {
return LK.first != TargetLoweringBase::TypeLegal;
}

bool isOpLegal(Instruction *I) { return isa<StoreInst, IntrinsicInst>(I); }
bool isOpLegal(Instruction *I) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool isOpLegal(Instruction *I) {
bool isOpLegal(const Instruction *I) {

if (auto *Intr = dyn_cast<IntrinsicInst>(I))
return true; // FIXME: narrow to known native intrinsics (DOT/MFMA/tbuffer) or use TTI cost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't TTI directly handle the entire function already?

Copy link
Contributor Author

@michaelselehov michaelselehov Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Matt — agreed that relying on TTI is the right direction here.

Before I reshuffle the functional block, what do you think about rewriting
isOpLegal as a profitability check and making TTI the primary signal, with a
very narrow SDWA safety-net to avoid re-introducing the regression we just fixed?

bool isProfitableSink(Instruction *I) {
  // 1) Always-profitable sinks (status quo).
  if (isa<StoreInst>(I) || isa<IntrinsicInst>(I))
    return true;

  // 2) SDWA safety-net: tiny vector binops that are known to lower well.
  // Keeps the v4i8/v2i16 loop-header case profitable on gfx9+.
  if (auto *BO = dyn_cast<BinaryOperator>(I))
    if (auto *VT = dyn_cast<VectorType>(BO->getType()))
      if (auto *Elt = VT->getElementType()) {
        std::optional<unsigned> Bits = VT->getPrimitiveSizeInBits();
        if (Elt->isIntegerTy() &&
            (Elt->getIntegerBitWidth() == 8 ||
             (Elt->getIntegerBitWidth() == 16 && ST.hasSDWA())) &&
            Bits && *Bits <= 32) {
          switch (BO->getOpcode()) {
          case Instruction::Add:
          case Instruction::Sub:
          case Instruction::And:
          case Instruction::Or:
          case Instruction::Xor:
            return true;
          default:
            break;
          }
        }
      }

  // 3) Default: use TTI cost (same spirit as the earlier change).
  auto C = TTI.getInstructionCost(
      I, TargetTransformInfo::TargetCostKind::TCK_SizeAndLatency);
  return C.isValid() && C.getValue() >= 8;
}```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TTI already has its own collection of hacks for these costs already?We shouldn't need to implement them twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll look what TTI can provide. If TTI already provides a reliable signal for these cases, I’ll switch the gate to be TTI-only. If it doesn’t, I'm not sure if the change to TTI should be a part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, posted this reply in the wrong place. Reposting here.

@arsenm, I instrumented the TTI queries on gfx90a: add <4 x i8> comes out at cost 4 with TCK_SizeAndLatency (and likewise for getArithmeticInstrCost), which is below the previous profitability threshold (8). So switching to TTI-only would reintroduce the regression. I propose to keep the very narrow SDWA safety-net for v4i8/v2i16 (≤32b) here and look at improving AMDGPU TTI separately if needed.


// Any store is a profitable sink (prevents flip-flopping)
if (isa<StoreInst>(I))
return true;

// Treat small-int vector binops as profitable when SDWA is available
if (auto *BO = dyn_cast<BinaryOperator>(I)) {
if (auto *VTy = dyn_cast<VectorType>(BO->getType())) {
Type *Elt = VTy->getElementType();
// Treat small-int vector binops as profitable when SDWA is available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Fixed.

// We explicitly gate to 8/16-bit to avoid i1 vectors and keep behavior tight.
if ((Elt->isIntegerTy(8) || (Elt->isIntegerTy(16)) && ST.hasSDWA())) {
switch (BO->getOpcode()) {
case Instruction::Add:
case Instruction::Sub:
case Instruction::And:
case Instruction::Or:
case Instruction::Xor:
return true;
default:
break;
}
}
}
}

return false;
}

bool isCoercionProfitable(Instruction *II) {
SmallPtrSet<Instruction *, 4> CVisited;
Expand All @@ -150,7 +180,12 @@ class LiveRegOptimizer {
if (!CVisited.insert(CII).second)
continue;

if (CII->getParent() == II->getParent() && !IsLookThru(II))
// Allow same-BB non-lookthrough users when the def is a PHI:
// loop headers frequently consume the carried value in the header block
// (e.g. byte-wise vector binops). We *do* want to coerce across the backedge
// in that common case to enable packed i32 + SDWA lowering.
if (CII->getParent() == II->getParent() && !IsLookThru(CII) &&
!isa<PHINode>(II))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of phi here feels like a separate change from the above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right—there are two separate changes:

  • The II → CII fix is a correctness bug: we must examine the user when pruning same-BB paths, otherwise the walk can terminate prematurely.
  • The PHI exception is a small policy tweak: loop headers commonly consume the carried value in the header block via non-lookthrough ops (e.g., byte-wise vector binops). Without allowing that same-BB non-lookthrough use for PHI, the walk never reaches the profitable sink—even with a better cost model—so the regression remains.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this into a separate PR? It's hard to see that all the cases are adequately tested when they're together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can. Given this PR will be for isOpLegal/isProfitableSink improvement, what would you like to see separated? II->CII fix? PHI exception? Or maybe two separate PRs, one for each?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm, I instrumented the TTI queries on gfx90a: add <4 x i8> comes out at cost 4 with TCK_SizeAndLatency (and likewise for getArithmeticInstrCost), which is below the previous profitability threshold (8). So switching to TTI-only would reintroduce the regression. I propose to keep the very narrow SDWA safety-net for v4i8/v2i16 (≤32b) here and look at improving AMDGPU TTI separately if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, TTI is a mess and we need to go fix all the costs, especially for illegal operations. In particular the costs for memory instructions are way too low

continue;

if (isOpLegal(CII))
Expand Down
67 changes: 67 additions & 0 deletions llvm/test/CodeGen/AMDGPU/lro-coerce-v4i8-phi-loop.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
; REQUIRES: amdgpu-registered-target
; RUN: opt -S -passes=amdgpu-late-codegenprepare \
; RUN: -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a %s | FileCheck %s

; Purpose:
; - Input has a loop-carried PHI of type <4 x i8> and byte-wise adds in the
; loop header (same basic block as the PHI).
; - After amdgpu-late-codegenprepare, the PHI must be coerced to i32 across
; the backedge, and a single dominating "bitcast i32 -> <4 x i8>" must be
; placed in the header (enabling SDWA-friendly lowering later).
;
; What we check:
; - PHI is i32 (no loop-carried <4 x i8> PHI remains).
; - A header-local bitcast i32 -> <4 x i8> exists and feeds the vector add.
; - The loopexit produces a bitcast <4 x i8> -> i32 for the backedge.

target triple = "amdgcn-amd-amdhsa"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

triple and target-cpu attribute are redundant with the command line, remove one or the other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - now only in the command line


define amdgpu_kernel void @lro_coerce_v4i8_phi(i8* nocapture %p, i32 %n) #0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use opaque pointers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed i8* -> ptr

entry:
br label %loop

loop:
; Loop index
%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]

; Loop-carried accumulator in vector-of-bytes form (problematic on input).
%acc = phi <4 x i8> [ zeroinitializer, %entry ], [ %acc.next, %loop ]

; Make up four i8 values derived from %i to avoid memory noise.
%i0 = trunc i32 %i to i8
%i1i = add i32 %i, 1
%i1 = trunc i32 %i1i to i8
%i2i = add i32 %i, 2
%i2 = trunc i32 %i2i to i8
%i3i = add i32 %i, 3
%i3 = trunc i32 %i3i to i8

; Pack them into <4 x i8>.
%v01 = insertelement <4 x i8> undef, i8 %i0, i32 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%v01 = insertelement <4 x i8> undef, i8 %i0, i32 0
%v01 = insertelement <4 x i8> poison, i8 %i0, i32 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to zeroinitializer already. Can change to poison if you think it's better.

%v02 = insertelement <4 x i8> %v01, i8 %i1, i32 1
%v03 = insertelement <4 x i8> %v02, i8 %i2, i32 2
%v = insertelement <4 x i8> %v03, i8 %i3, i32 3

; Byte-wise add in the same block as the PHI (this must make coercion profitable).
%acc.next = add <4 x i8> %acc, %v

; Loop control.
%i.next = add i32 %i, 4
%cond = icmp slt i32 %i.next, %n
br i1 %cond, label %loop, label %exit

exit:
ret void
}

attributes #0 = { "target-cpu"="gfx90a" }

; CHECK-LABEL: define amdgpu_kernel void @lro_coerce_v4i8_phi(
; CHECK: loop:
; CHECK: %i = phi i32
; CHECK-NOT: phi <4 x i8>
; CHECK: %[[ACCI32:[^ ]+]] = phi i32
; CHECK-NEXT: %[[HDRCAST:[^ ]+]] = bitcast i32 %[[ACCI32]] to <4 x i8>
; CHECK: add <4 x i8> %[[HDRCAST]],
; CHECK: br i1

Loading