Skip to content

Conversation

@VigneshwarJ
Copy link
Contributor

@VigneshwarJ VigneshwarJ commented May 12, 2025

The order of if and else blocks can introduce unnecessary VGPR copies.
Consider the case of an if-else block where the incoming phi from the 'Else block' only contains zero-cost instructions, and the 'Then' block modifies some value. There would be no interference when coalescing because only one value is live at any point before structurization. However, in the structurized CFG, the Then value is live at 'Else' block due to the path if→flow→else, leading to additional VGPR copies.

This patch addresses the issue by:

  • Identifying PHI nodes with zero-cost incoming values from the Else block and hoisting those values to the nearest common dominator of the Then and Else blocks.
  • Updating Flow PHI nodes by replacing poison entries (on the if→flow edge) with the correct hoisted values.

Then and Else block order in SCC is arbitrary. But based on the
order, after structurization there are cases where there might be
extra VGPR copies due to interference during register coelescing.

This patch introduces heuristics to order the then and else block
based on the potential VGPR copies to maximize coelescing.
@llvmbot
Copy link
Member

llvmbot commented May 12, 2025

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Author: Vigneshwar Jayakumar (VigneshwarJ)

Changes

Then and Else block order in SCC is arbitrary. But based on the order, after structurization there are cases where there might be extra VGPR copies due to interference during register coelescing.

This patch introduces heuristics to order the then and else block based on the potential VGPR copies to maximize coelescing.


Full diff: https://github.com/llvm/llvm-project/pull/139605.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/StructurizeCFG.cpp (+80)
  • (added) llvm/test/Transforms/StructurizeCFG/order-if-else.ll (+129)
diff --git a/llvm/lib/Transforms/Scalar/StructurizeCFG.cpp b/llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
index eb22b50532695..ec54f53d6165b 100644
--- a/llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
+++ b/llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
@@ -307,6 +307,8 @@ class StructurizeCFG {
 
   RegionNode *PrevNode;
 
+  void reorderIfElseBlock(BasicBlock *BB, unsigned Idx);
+
   void orderNodes();
 
   void analyzeLoops(RegionNode *N);
@@ -409,6 +411,31 @@ class StructurizeCFGLegacyPass : public RegionPass {
 
 } // end anonymous namespace
 
+/// Helper function for heuristics to order if else block
+/// Checks whether an instruction is potential vector copy instruction, if so,
+/// checks if the operands are from different BB. if so, returns True.
+// Then there's a possibility of coelescing without interference when ordered
+// first.
+static bool hasAffectingInstructions(Instruction *I, BasicBlock *BB) {
+
+  if (!I || I->getParent() != BB)
+    return true;
+
+  // If the instruction is not a poterntial copy instructoin, return true.
+  if (!isa<ExtractElementInst>(*I) && !isa<ExtractValueInst>(*I))
+    return false;
+
+  // Check if any operands are instructions defined in the same block.
+  for (unsigned i = 0, e = I->getNumOperands(); i < e; ++i) {
+    if (auto *OpI = dyn_cast<Instruction>(I->getOperand(i))) {
+      if (OpI->getParent() == BB)
+        return false;
+    }
+  }
+
+  return true;
+}
+
 char StructurizeCFGLegacyPass::ID = 0;
 
 INITIALIZE_PASS_BEGIN(StructurizeCFGLegacyPass, "structurizecfg",
@@ -419,6 +446,58 @@ INITIALIZE_PASS_DEPENDENCY(RegionInfoPass)
 INITIALIZE_PASS_END(StructurizeCFGLegacyPass, "structurizecfg",
                     "Structurize the CFG", false, false)
 
+/// Then and Else block order in SCC is arbitrary. But based on the
+/// order, after structurization there are cases where there might be extra 
+/// VGPR copies due to interference during register coelescing.
+///  eg:- incoming phi values from Else block contains only vgpr copies and
+///  incoming phis in Then block has are some modification for the vgprs.
+/// after structurization, there would be interference when coelesing when Then
+/// block is ordered first. But those copies can be coelesced when Else is
+/// ordered first.
+///
+/// This function checks the incoming phi values in the merge block and
+/// orders based on the following heuristics  of Then and Else block. Checks
+/// whether an incoming phi can be potential copy instructions and if so 
+/// checks whether copy within the block or not. 
+/// Increases score if its a potential copy from outside the block. 
+/// the higher scored block is ordered first.
+void StructurizeCFG::reorderIfElseBlock(BasicBlock *BB, unsigned Idx) {
+  BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+
+  if (Term && Term->isConditional()) {
+    BasicBlock *ThenBB = Term->getSuccessor(0);
+    BasicBlock *ElseBB = Term->getSuccessor(1);
+    BasicBlock *ThenSucc = ThenBB->getSingleSuccessor();
+
+    if (BB == ThenBB->getSinglePredecessor() &&
+        (ThenBB->getSinglePredecessor() == ElseBB->getSinglePredecessor()) &&
+        (ThenSucc && ThenSucc == ElseBB->getSingleSuccessor())) {
+      unsigned ThenScore = 0, ElseScore = 0;
+
+      for (PHINode &Phi : ThenSucc->phis()) {
+        Value *ThenVal = Phi.getIncomingValueForBlock(ThenBB);
+        Value *ElseVal = Phi.getIncomingValueForBlock(ElseBB);
+
+        if (auto *Inst = dyn_cast<Instruction>(ThenVal))
+          ThenScore += hasAffectingInstructions(Inst, ThenBB);
+        if (auto *Inst = dyn_cast<Instruction>(ElseVal))
+          ElseScore += hasAffectingInstructions(Inst, ElseBB);
+      }
+
+      if (ThenScore != ElseScore) {
+        if (ThenScore < ElseScore)
+          std::swap(ThenBB, ElseBB);
+
+        // reorder the last two inserted elements in Order
+        if (Idx >= 2 && Order[Idx - 1]->getEntry() == ElseBB &&
+            Order[Idx - 2]->getEntry() == ThenBB) {
+          std::swap(Order[Idx - 1], Order[Idx - 2]);
+        }
+      }
+    }
+  }
+}
+
 /// Build up the general order of nodes, by performing a topological sort of the
 /// parent region's nodes, while ensuring that there is no outer cycle node
 /// between any two inner cycle nodes.
@@ -452,6 +531,7 @@ void StructurizeCFG::orderNodes() {
       // Add the SCC nodes to the Order array.
       for (const auto &N : SCC) {
         assert(I < E && "SCC size mismatch!");
+        reorderIfElseBlock(N.first->getEntry(), I);
         Order[I++] = N.first;
       }
     }
diff --git a/llvm/test/Transforms/StructurizeCFG/order-if-else.ll b/llvm/test/Transforms/StructurizeCFG/order-if-else.ll
new file mode 100644
index 0000000000000..02641f405f3b4
--- /dev/null
+++ b/llvm/test/Transforms/StructurizeCFG/order-if-else.ll
@@ -0,0 +1,129 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -structurizecfg %s -o - | FileCheck %s
+; RUN: opt -S -passes=structurizecfg %s -o - | FileCheck %s
+
+define amdgpu_kernel void @test_extractelement_1(<4 x i32> %vec, i1 %cond, ptr %ptr) {
+; CHECK-LABEL: define amdgpu_kernel void @test_extractelement_1(
+; CHECK-SAME: <4 x i32> [[VEC:%.*]], i1 [[COND:%.*]], ptr [[PTR:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[COND_INV:%.*]] = xor i1 [[COND]], true
+; CHECK-NEXT:    br i1 [[COND_INV]], label %[[ELSE:.*]], label %[[FLOW:.*]]
+; CHECK:       [[FLOW]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = phi i32 [ [[A:%.*]], %[[ELSE]] ], [ poison, %[[ENTRY]] ]
+; CHECK-NEXT:    [[TMP1:%.*]] = phi i1 [ false, %[[ELSE]] ], [ true, %[[ENTRY]] ]
+; CHECK-NEXT:    br i1 [[TMP1]], label %[[THEN:.*]], label %[[MERGE:.*]]
+; CHECK:       [[THEN]]:
+; CHECK-NEXT:    [[X:%.*]] = extractelement <4 x i32> [[VEC]], i32 0
+; CHECK-NEXT:    [[Z:%.*]] = add i32 [[X]], 1
+; CHECK-NEXT:    br label %[[MERGE]]
+; CHECK:       [[ELSE]]:
+; CHECK-NEXT:    [[A]] = extractelement <4 x i32> [[VEC]], i32 1
+; CHECK-NEXT:    br label %[[FLOW]]
+; CHECK:       [[MERGE]]:
+; CHECK-NEXT:    [[PHI:%.*]] = phi i32 [ [[TMP0]], %[[FLOW]] ], [ [[Z]], %[[THEN]] ]
+; CHECK-NEXT:    store i32 [[PHI]], ptr [[PTR]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  br i1 %cond, label %then, label %else
+
+then:
+  %x = extractelement <4 x i32> %vec, i32 0
+  %z = add i32 %x, 1
+  br label %merge
+
+else:
+  %a = extractelement <4 x i32> %vec, i32 1
+  br label %merge
+
+merge:
+  %phi = phi i32 [ %z, %then ], [ %a, %else ]
+  store i32 %phi, ptr  %ptr
+  ret void
+}
+
+define amdgpu_kernel void @test_extractelement_2(<4 x i32> %vec, i1 %cond, ptr %ptr) {
+; CHECK-LABEL: define amdgpu_kernel void @test_extractelement_2(
+; CHECK-SAME: <4 x i32> [[VEC:%.*]], i1 [[COND:%.*]], ptr [[PTR:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[COND_INV:%.*]] = xor i1 [[COND]], true
+; CHECK-NEXT:    br i1 [[COND_INV]], label %[[ELSE:.*]], label %[[FLOW:.*]]
+; CHECK:       [[FLOW]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = phi i32 [ [[A:%.*]], %[[ELSE]] ], [ poison, %[[ENTRY]] ]
+; CHECK-NEXT:    [[TMP1:%.*]] = phi i1 [ false, %[[ELSE]] ], [ true, %[[ENTRY]] ]
+; CHECK-NEXT:    br i1 [[TMP1]], label %[[THEN:.*]], label %[[MERGE:.*]]
+; CHECK:       [[THEN]]:
+; CHECK-NEXT:    [[X:%.*]] = extractelement <4 x i32> [[VEC]], i32 1
+; CHECK-NEXT:    [[Y:%.*]] = add i32 [[X]], 1
+; CHECK-NEXT:    [[VEC1:%.*]] = insertelement <4 x i32> poison, i32 [[Y]], i32 0
+; CHECK-NEXT:    [[Z:%.*]] = extractelement <4 x i32> [[VEC1]], i32 0
+; CHECK-NEXT:    br label %[[MERGE]]
+; CHECK:       [[ELSE]]:
+; CHECK-NEXT:    [[A]] = extractelement <4 x i32> [[VEC]], i32 1
+; CHECK-NEXT:    br label %[[FLOW]]
+; CHECK:       [[MERGE]]:
+; CHECK-NEXT:    [[PHI:%.*]] = phi i32 [ [[TMP0]], %[[FLOW]] ], [ [[Z]], %[[THEN]] ]
+; CHECK-NEXT:    store i32 [[PHI]], ptr [[PTR]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  br i1 %cond, label %then, label %else
+
+then:
+  %x = extractelement <4 x i32> %vec, i32 1
+  %y = add i32 %x, 1
+  %vec1 =  insertelement <4 x i32>  poison, i32 %y,  i32 0
+  %z = extractelement <4 x i32> %vec1, i32 0
+  br label %merge
+
+else:
+  %a = extractelement <4 x i32> %vec, i32 1
+  br label %merge
+
+merge:
+  %phi = phi i32 [ %z, %then ], [ %a, %else ]
+  store i32 %phi, ptr  %ptr
+  ret void
+}
+
+%pair = type { i32, i32 }
+define amdgpu_kernel void @test_extractvalue(ptr %ptr, i1 %cond) {
+; CHECK-LABEL: define amdgpu_kernel void @test_extractvalue(
+; CHECK-SAME: ptr [[PTR:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[LOAD_THEN:%.*]] = load [[PAIR:%.*]], ptr [[PTR]], align 4
+; CHECK-NEXT:    br i1 [[COND]], label %[[THEN:.*]], label %[[FLOW:.*]]
+; CHECK:       [[THEN]]:
+; CHECK-NEXT:    [[A_THEN:%.*]] = extractvalue [[PAIR]] [[LOAD_THEN]], 0
+; CHECK-NEXT:    br label %[[FLOW]]
+; CHECK:       [[FLOW]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = phi i32 [ [[A_THEN]], %[[THEN]] ], [ poison, %[[ENTRY]] ]
+; CHECK-NEXT:    [[TMP1:%.*]] = phi i1 [ false, %[[THEN]] ], [ true, %[[ENTRY]] ]
+; CHECK-NEXT:    br i1 [[TMP1]], label %[[ELSE:.*]], label %[[MERGE:.*]]
+; CHECK:       [[ELSE]]:
+; CHECK-NEXT:    [[A_ELSE:%.*]] = extractvalue [[PAIR]] [[LOAD_THEN]], 0
+; CHECK-NEXT:    [[SUM_ELSE:%.*]] = add i32 [[A_ELSE]], 1
+; CHECK-NEXT:    br label %[[MERGE]]
+; CHECK:       [[MERGE]]:
+; CHECK-NEXT:    [[PHI:%.*]] = phi i32 [ [[TMP0]], %[[FLOW]] ], [ [[SUM_ELSE]], %[[ELSE]] ]
+; CHECK-NEXT:    store i32 [[PHI]], ptr [[PTR]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %load_then = load %pair, ptr %ptr
+  br i1 %cond, label %then, label %else
+
+then:
+  %a_then = extractvalue %pair %load_then, 0
+  br label %merge
+
+else:
+  %a_else = extractvalue %pair %load_then, 0
+  %sum_else = add i32 %a_else, 1
+  br label %merge
+
+merge:
+  %phi = phi i32  [ %a_then, %then ], [ %sum_else, %else ]
+  store i32 %phi, ptr  %ptr
+  ret void
+}

@github-actions
Copy link

github-actions bot commented May 12, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Comment on lines 425 to 426
if (!isa<ExtractElementInst>(*I) && !isa<ExtractValueInst>(*I))
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand looking for special case instruction types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way these heuristics will work is if we know which are going to be copy instructions after lowering. Here I just look at these two because mostly these are getting lowered to copy instruction. Another way of calculating heuristics for ordering was to see whether the incoming phi value instruction defined by the operands of the same block or different block. but this is too broad and can have unintentional reorders.

Comment on lines 449 to 450
/// Then and Else block order in SCC is arbitrary. But based on the
/// order, after structurization there are cases where there might be extra
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just sort blocks in topological order or something? I'm not sure I understand this heuristic

Copy link
Contributor Author

@VigneshwarJ VigneshwarJ May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocks are already sorted in the topological order. But in case of if-then-else, topologically then,else or else,then is same. But based on the order, there are extra copies that leads to spill in large cases.

This is specifically seen in case where one block just copies values and other block modifies values.
godbolt link

if:
  %vec = <4 x i32> ...
  br i1 %c %then %else
then:
  %x = extractelement <4 x i32> %vec, i32 0
  %z = add i32 %x, 1
  br label %merge
else:
  %a = extractelement <4 x i32> %vec, i32 0
  br label %merge
merge:
  %phi = phi i32 [ %z, %then ], [ %a, %else ]
  store i32 %phi, ptr  %ptr
  ret void

After structurization and machine code , if 'then' block is ordered first
then at register coalescer

if:
   [v0 - v4] = ...
    cond_branch_scc then
tmp:
    v5 = 0
    branch Flow
then:
   v5 = copy v0  ; eliminated in reg coalescer
   v5 = v5 + 1  ; v5 = v0 + 1
flow:
   v6 = v5 ; eliminated
   cond_branch final
else:
   v6 = copy v0 ;  v5 = copy v0 
 ; this copy v5 = copy v0 can't be coalesced further then->flow->else live range 
final:
   store v6 ; store v5

but at the same time, if 'else' block is ordered first,

if:
   [v0 - v4] = ...
    cond_branch_scc else
tmp:
    v5 = 0
    branch Flow
else:
   v5 = copy v0 ; eliminated
flow:
   v6 = copy v5 ; eliminated
   cond_branch final
then:
   v6 = copy v0  ; eliminated 
   v6 = v6 + 1  ; v6 = v0 + 1 ; can be coalesced to v0 = v0 + 1
final:
   store v6 ; coalesced to  store v0

Theres an extra copy just due to ordering in the first case. Though then and else block won't be live at same time due to structurization, we see this.

This heuristics just sees instructions that can be lowered to copy instructions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything can be lowered to copied, initially most cross block values emit copies. Does this really mean "0" cost instructions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changed the code to check for zero cost instructions

@arsenm arsenm requested review from ruiling and vpykhtin May 12, 2025 19:04
@ruiling
Copy link
Contributor

ruiling commented May 14, 2025

It feels like we hit the limitation in the register coalescer that it cannot do coalescing in the optimal way, and it is hard to solve the problem there. But ordering the blocks does not feel like the optimal way to solve the problem. Think what if you have two simultaneous values, one is modified in then and the other is modified in else. Each order would only help one value and leave a copy for the other value, right?

If you look at the IR before structurization, there is only one value alive at any program point. But after structurization, we get something like:

%entry:
  ...
  br i1 %cond, label %then, label %Flow

then:                                             ; preds = %entry
  %x = extractelement <4 x i32> %vec.load2, i32 0
  %z = add i32 %x, 1
  br label %Flow

Flow:                                             ; preds = %then, %entry
  %3 = phi i32 [ %z, %then ], [ poison, %entry ]
  %4 = phi i1 [ false, %then ], [ true, %entry ]
  br i1 %4, label %else, label %merge

else:                                             ; preds = %Flow
  %a = extractelement <4 x i32> %vec.load2, i32 1
  br label %merge

merge:                                            ; preds = %else, %Flow
  %phi = phi i32 [ %3, %Flow ], [ %a, %else ]

This is quite like the problem I solved in #101301, but more challenging. Can we try to simplify the phi by moving the extractelemnt into dominator? Like:

%entry:
  ...
  %a = extractelement <4 x i32> %vec.load2, i32 1
  br i1 %cond, label %then, label %Flow

then:                                             ; preds = %entry
  %x = extractelement <4 x i32> %vec.load2, i32 0
  %z = add i32 %x, 1
  br label %Flow

Flow:                                             ; preds = %then, %entry
  %3 = phi i32 [ %z, %then ], [ %a, %entry ]
  %4 = phi i1 [ false, %then ], [ true, %entry ]
  br i1 %4, label %else, label %merge

else:                                             ; preds = %Flow
  ; the extractelement was hoisted into the dominator.
  br label %merge

merge:                                            ; preds = %else, %Flow
  %phi = phi i32 [ %3, %Flow ], [ %3, %else ] ; this is actually %3, it would be better we can avoid the phi, but it does not hurt to leave the phi.

By doing this, we still have only one copy of the value being alive on the CFG post-structurization.

If we allocate VGPR on thread-CFG, this would be trivially solved, but it is a large project.

@VigneshwarJ
Copy link
Contributor Author

VigneshwarJ commented Jun 10, 2025

@ruiling Thanks for the suggestion.
I tried your suggestion of hoisting the zero-cost instructions from the if or else block to the parent block. But there are still extra copies due to interference.
Consider this code,

if:
  %load_then = load %pair, ptr %ptr
  br i1 %cond, label %then, label %else

then:
  %a_then = extractvalue %pair %load_then, 0
  br label %merge

else:
  %a_else = extractvalue %pair %load_then, 0
  %sum_else = add i32 %a_else, 1
  br label %merge

merge:
  %phi = phi i32  [ %a_then, %then ], [ %sum_else, %else ]
  store i32 %phi, ptr  %ptr
  ret void

structurizeCFG in trunk - https://godbolt.org/z/Gc6Wbocef

ISA with structurizeCFG with hoisting zero-cost instruction change - https://godbolt.org/z/cEzs31dYq.
ISA with structurizeCFG with reorder change - https://godbolt.org/z/WMcYhG3Wv.

There is an extra copy even with hoisting change because, in the default SCC order, 'else' block comes first arbitrarily. Though the 'then' block doesn't contain any code now, 'copy' can't be coalesced due to interference (else->flow->then path) like I mentioned in Matt's comment.

In the above example, using a pass like gvnhoist/simplifycfg would have eliminated the 'then' block after hoisting and simplifying. But if that block contains any other instructions, then the block will not be eliminated, and the problem will still exist.

But yes, the reorder heuristics is not perfect and will not work for the case you mentioned. But I felt that's still a better trade-off than the current implementation.

@ruiling
Copy link
Contributor

ruiling commented Jun 12, 2025

ISA with structurizeCFG with hoisting zero-cost instruction change - https://godbolt.org/z/cEzs31dYq.

You missed something important. Hoisting is only part of the suggestion. The phi network also needs some change.
The key-point is there is no phi with poison incoming value in Flow block. It is assigned the hoisted value for threads taking the if->Flow edge.

%pair = type { i32, i32 }
define amdgpu_kernel void @test_extractelement_then_else(<4 x i32> %vec, i1 %cond, ptr %ptr) {

if:
  %cond.inv = xor i1 %cond, true
  %load_then = load %pair, ptr %ptr, align 4
  %a_then = extractvalue %pair %load_then, 0
  br i1 %cond.inv, label %else, label %Flow

else:                                             ; preds = %if
  %a_else = extractvalue %pair %load_then, 0
  %sum_else = add i32 %a_else, 1
  br label %Flow

Flow:                                             ; preds = %else, %if
  %phi = phi i32 [ %sum_else, %else ], [ %a_then, %if ]
  %1 = phi i1 [ false, %else ], [ true, %if ]
  br i1 %1, label %then, label %merge

then:                                             ; preds = %Flow
  br label %merge

merge:                                            ; preds = %then, %Flow
  store i32 %phi, ptr %ptr, align 4
  ret void
}

@VigneshwarJ VigneshwarJ changed the title [StructurizeCFG] Order IF Else block using Heuristics [StructurizeCFG] Hoist zero-cost instructions in Else phi values Jun 24, 2025
@VigneshwarJ VigneshwarJ changed the title [StructurizeCFG] Hoist zero-cost instructions in Else phi values [StructurizeCFG] Hoist and simplify zero-cost incoming else phi values Jun 24, 2025
@VigneshwarJ
Copy link
Contributor Author

@ruiling , Sorry that was my bad, I completely missed that point. Fixed it now.

if (Visited.count(Other) && !Loops.count(Other) &&
!Pred.count(Other) && !Pred.count(P)) {

hoistZeroCostElseBlockPhiValues(Succ, Other);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added hoisting logic here, because the else block for structurized CFG is getting defined here. And there's no point in hoisting then block zero-cost instruction phis.

Copy link
Contributor

@ruiling ruiling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this! Frankly speaking, I am not quite sure whether this would cause register pressure increase in some case. But I think this mostly looks reasonable to me. Just some small nits.

INITIALIZE_PASS_END(StructurizeCFGLegacyPass, "structurizecfg",
"Structurize the CFG", false, false)

/// Because the SCC order of Then and Else blocks is arbitrary, structurization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking about the arbitrary order of then/else block is a little bit confusing here. Because we treated the first visited successor as Then block, and the other one as Else block throughout the code implementation. Maybe just remove it to avoid such confusion.

@@ -0,0 +1,163 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: opt -S -structurizecfg %s -o - | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think -structurizecfg is the old way to run pass with opt? We only need the second run.

@VigneshwarJ VigneshwarJ requested a review from arsenm July 2, 2025 00:15
/// if its hoisted to predecessor block. So, this returns true.
static bool isHoistableInstruction(Instruction *I, BasicBlock *BB,
TargetTransformInfo *TTI) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

return false;

// If the instruction is not a zero cost instruction, return false.
auto Cost = TTI->getInstructionCost(I, TargetTransformInfo::TCK_CodeSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why code size metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are looking for instructions that are just cross-block zero cost copies, I thought Latency or recipThroughput did not matter. But if I am wrong in my understanding, I can change the metric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AMDGPU cost model has seen close to 0 effort on getting accurate code sizes. You'll probably get better results for the default latency

return false;

// Check if any operands are instructions defined in the same block.
for (unsigned i = 0, e = I->getNumOperands(); i < e; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use range loop

break;
}
if (PoisonValBBIdx == -1 ||
!DT->dominates(HoistedValues[V],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double map lookup, save the iterator in a find above instead of using count

@@ -0,0 +1,180 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX900 %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX900 %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX900 %s

Comment on lines 3 to 4


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@@ -0,0 +1,162 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: opt -S -passes=structurizecfg %s -o - | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
; RUN: opt -S -passes=structurizecfg %s -o - | FileCheck %s
; RUN: opt -S -passes=structurizecfg < %s | FileCheck %s

@VigneshwarJ VigneshwarJ requested a review from arsenm July 2, 2025 16:09
ConstantInt *BoolFalse;
Value *BoolPoison;

TargetTransformInfo *TTI;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TargetTransformInfo *TTI;
const TargetTransformInfo *TTI;

return false;
}
Function *F = R->getEntry()->getParent();
TargetTransformInfo *TTI =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TargetTransformInfo *TTI =
const TargetTransformInfo *TTI =

@VigneshwarJ VigneshwarJ merged commit 8d3f497 into llvm:main Jul 10, 2025
9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jul 10, 2025

LLVM Buildbot has detected a new failure on builder clang-hip-vega20 running on hip-vega20-0 while building llvm at step 3 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/23108

Here is the relevant piece of the build log for the reference
Step 3 (annotate) failure: '../llvm-zorg/zorg/buildbot/builders/annotated/hip-build.sh --jobs=' (failure)
...
[22/59] Generating testplan.py
[23/59] Building C object tools/CMakeFiles/timeit-target.dir/timeit.c.o
[24/59] [TEST_SUITE_HOST_CC] Linking host executable timeit
[25/59] [TEST_SUITE_HOST_CC] Linking host executable fpcmp
[26/59] Building C object tools/CMakeFiles/fpcmp-target.dir/fpcmp.c.o
[27/59] Linking C executable tools/timeit-target
[28/59] Linking C executable tools/fpcmp-target
[29/59] Generating new.hip
[30/59] Generating algorithm.hip
[31/59] Building CXX object External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o
FAILED: External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o 
/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/opt/botworker/llvm/External/hip/rocm-6.3.0 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -MD -MT External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o -MF External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o.d -o External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o -c /home/botworker/bbot/clang-hip-vega20/llvm-test-suite/External/HIP/memmove.hip
PHI nodes not grouped at top of basic block!
  %i.06.ph = phi i64 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
label %for.body.preheader
in function _Z11init_kernelPhm
fatal error: error in backend: Broken function found, compilation aborted!
clang++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 21.0.0git (https://github.com/llvm/llvm-project.git 8d3f497eb834a84b954241b8c4293f8387e75576)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin
Build config: +assertions
clang++: note: diagnostic msg: Error generating preprocessed source(s).
[32/59] Building CXX object External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o
FAILED: External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o 
/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG -DSTDLIB_VERSION=9999  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/opt/botworker/llvm/External/hip/rocm-6.3.0 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -MD -MT External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o -MF External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o.d -o External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o -c /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/new.hip
PHI nodes not grouped at top of basic block!
  %3103 = phi ptr addrspace(1) [ %3135, %3133 ], [ %3099, %3097 ]
label %3097
in function __ockl_dm_alloc
fatal error: error in backend: Broken function found, compilation aborted!
clang++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 21.0.0git (https://github.com/llvm/llvm-project.git 8d3f497eb834a84b954241b8c4293f8387e75576)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin
Build config: +assertions
clang++: note: diagnostic msg: Error generating preprocessed source(s).
[33/59] Building CXX object External/HIP/CMakeFiles/empty-hip-6.3.0.dir/empty.hip.o
[34/59] Building CXX object External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o
FAILED: External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o 
/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/opt/botworker/llvm/External/hip/rocm-6.3.0 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -xhip -mfma -MD -MT External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o -MF External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o.d -o External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o -c /home/botworker/bbot/clang-hip-vega20/llvm-test-suite/External/HIP/workload/ray-tracing/TheNextWeek/main.cc
PHI nodes not grouped at top of basic block!
  %3103 = phi ptr addrspace(1) [ %3135, %3133 ], [ %3099, %3097 ]
label %3097
in function __ockl_dm_alloc
fatal error: error in backend: Broken function found, compilation aborted!
clang++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Step 11 (Building HIP test-suite) failure: Building HIP test-suite (failure)
...
[22/59] Generating testplan.py
[23/59] Building C object tools/CMakeFiles/timeit-target.dir/timeit.c.o
[24/59] [TEST_SUITE_HOST_CC] Linking host executable timeit
[25/59] [TEST_SUITE_HOST_CC] Linking host executable fpcmp
[26/59] Building C object tools/CMakeFiles/fpcmp-target.dir/fpcmp.c.o
[27/59] Linking C executable tools/timeit-target
[28/59] Linking C executable tools/fpcmp-target
[29/59] Generating new.hip
[30/59] Generating algorithm.hip
[31/59] Building CXX object External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o
FAILED: External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o 
/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/opt/botworker/llvm/External/hip/rocm-6.3.0 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -MD -MT External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o -MF External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o.d -o External/HIP/CMakeFiles/memmove-hip-6.3.0.dir/memmove.hip.o -c /home/botworker/bbot/clang-hip-vega20/llvm-test-suite/External/HIP/memmove.hip
PHI nodes not grouped at top of basic block!
  %i.06.ph = phi i64 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
label %for.body.preheader
in function _Z11init_kernelPhm
fatal error: error in backend: Broken function found, compilation aborted!
clang++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 21.0.0git (https://github.com/llvm/llvm-project.git 8d3f497eb834a84b954241b8c4293f8387e75576)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin
Build config: +assertions
clang++: note: diagnostic msg: Error generating preprocessed source(s).
[32/59] Building CXX object External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o
FAILED: External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o 
/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG -DSTDLIB_VERSION=9999  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/opt/botworker/llvm/External/hip/rocm-6.3.0 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -MD -MT External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o -MF External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o.d -o External/HIP/CMakeFiles/new-hip-6.3.0.dir/new.hip.o -c /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/new.hip
PHI nodes not grouped at top of basic block!
  %3103 = phi ptr addrspace(1) [ %3135, %3133 ], [ %3099, %3097 ]
label %3097
in function __ockl_dm_alloc
fatal error: error in backend: Broken function found, compilation aborted!
clang++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 21.0.0git (https://github.com/llvm/llvm-project.git 8d3f497eb834a84b954241b8c4293f8387e75576)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin
Build config: +assertions
clang++: note: diagnostic msg: Error generating preprocessed source(s).
[33/59] Building CXX object External/HIP/CMakeFiles/empty-hip-6.3.0.dir/empty.hip.o
[34/59] Building CXX object External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o
FAILED: External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o 
/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/opt/botworker/llvm/External/hip/rocm-6.3.0 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -xhip -mfma -MD -MT External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o -MF External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o.d -o External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o -c /home/botworker/bbot/clang-hip-vega20/llvm-test-suite/External/HIP/workload/ray-tracing/TheNextWeek/main.cc
PHI nodes not grouped at top of basic block!
  %3103 = phi ptr addrspace(1) [ %3135, %3133 ], [ %3099, %3097 ]
label %3097
in function __ockl_dm_alloc
fatal error: error in backend: Broken function found, compilation aborted!
clang++: error: clang frontend command failed with exit code 70 (use -v to see invocation)

@Kewen12
Copy link
Contributor

Kewen12 commented Jul 10, 2025

Hello! This PR breaks our bot, could you please take a look? Thanks!

Bot: https://lab.llvm.org/buildbot/#/builders/10/builds/9154

VigneshwarJ added a commit to VigneshwarJ/llvm-project that referenced this pull request Jul 20, 2025
…hi values (llvm#139605)"

This relands commit b11523b494b with the fix for llvm-buildbot failures
"clang-hip-vega20" and "openmp-offload-amdgpu-runtime-2". The reland
prevents hoisting the phi node which fixes the issue.

Original PR description:

The order of if and else blocks can introduce unnecessary VGPR copies.
Consider the case of an if-else block where the incoming phi from the
'Else block' only contains zero-cost instructions, and the 'Then' block
modifies some value. There would be no interference when coalescing
because only one value is live at any point before structurization.
However, in the structurized CFG, the Then value is live at 'Else' block
due to the path if→flow→else, leading to additional VGPR copies.

This patch addresses the issue by:
- Identifying PHI nodes with zero-cost incoming values from the Else
block and hoisting those values to the nearest common dominator of the
Then and Else blocks.
- Updating Flow PHI nodes by replacing poison entries (on the if→flow
edge) with the correct hoisted values.
VigneshwarJ added a commit that referenced this pull request Jul 25, 2025
#149744)

…hi values (#139605)"

This relands commit b11523b494b with the fix for llvm-buildbot failures
"clang-hip-vega20" and "openmp-offload-amdgpu-runtime-2". The reland
prevents hoisting the phi node which fixes the issue.

Original PR description:

The order of if and else blocks can introduce unnecessary VGPR copies.
Consider the case of an if-else block where the incoming phi from the
'Else block' only contains zero-cost instructions, and the 'Then' block
modifies some value. There would be no interference when coalescing
because only one value is live at any point before structurization.
However, in the structurized CFG, the Then value is live at 'Else' block
due to the path if→flow→else, leading to additional VGPR copies.

This patch addresses the issue by:
- Identifying PHI nodes with zero-cost incoming values from the Else
block and hoisting those values to the nearest common dominator of the
Then and Else blocks.
- Updating Flow PHI nodes by replacing poison entries (on the if→flow
edge) with the correct hoisted values.
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Jul 28, 2025
llvm#149744)

…hi values (llvm#139605)"

This relands commit b11523b494b with the fix for llvm-buildbot failures
"clang-hip-vega20" and "openmp-offload-amdgpu-runtime-2". The reland
prevents hoisting the phi node which fixes the issue.

Original PR description:

The order of if and else blocks can introduce unnecessary VGPR copies.
Consider the case of an if-else block where the incoming phi from the
'Else block' only contains zero-cost instructions, and the 'Then' block
modifies some value. There would be no interference when coalescing
because only one value is live at any point before structurization.
However, in the structurized CFG, the Then value is live at 'Else' block
due to the path if→flow→else, leading to additional VGPR copies.

This patch addresses the issue by:
- Identifying PHI nodes with zero-cost incoming values from the Else
block and hoisting those values to the nearest common dominator of the
Then and Else blocks.
- Updating Flow PHI nodes by replacing poison entries (on the if→flow
edge) with the correct hoisted values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants