[AMDGPU][InstCombine] Fold ballot intrinsic based on llvm.assume hints #160670

TejaX-Alaghari · 2025-09-25T09:27:36Z

This PR aims to leverage the truth value hints provided by llvm.assume intrinsic and propagate them to amdgcn.ballot intrinsic in below scenarios -

Pattern 1: All lanes active

%cmp = icmp eq i32 %x, 0
%ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
%all = icmp eq i64 %ballot, -1
call void @llvm.assume(i1 %all)
br i1 %cmp, label %foo, label %bar

; After optimization
br i1 true, label %foo, label %bar  ; %cmp replaced with true

Pattern 2: No lanes active

%cmp = icmp eq i32 %x, 0
%ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
%none = icmp eq i64 %ballot, 0
call void @llvm.assume(i1 %none)
br i1 %cmp, label %foo, label %bar

; After optimization
br i1 false, label %foo, label %bar  ; %cmp replaced with false

Pattern 3: Exec mask (all active lanes)

%cmp = icmp eq i32 %x, 0
%ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
%exec = call i64 @llvm.amdgcn.ballot.i64(i1 true)
%all = icmp eq i64 %ballot, %exec
call void @llvm.assume(i1 %all)
br i1 %cmp, label %foo, label %bar

; After optimization
br i1 true, label %foo, label %bar  ; %cmp replaced with true

Status: WIP

Gathering feedback on identifying the right pass

github-actions · 2025-09-25T09:27:54Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-09-25T09:28:27Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Teja Alaghari (TejaX-Alaghari)

Changes

Summary

This PR implements an InstCombine optimization that recognizes when AMDGPU ballot intrinsics are used with assumptions about uniformity, specifically the pattern assume(ballot(cmp) == -1).

Problem

In AMDGPU code, developers often use ballot intrinsics to test uniformity of conditions across a wavefront:

%cmp = icmp eq i32 %x, 0
%ballot = call i64 @<!-- -->llvm.amdgcn.ballot.i64(i1 %cmp)
%uniform = icmp eq i64 %ballot, -1
call void @<!-- -->llvm.assume(i1 %uniform)
br i1 %cmp, label %then, label %else

When ballot(cmp) == -1, we know that cmp evaluates to true on all active lanes, making it uniform. However, existing optimizations didn't recognize this pattern, leaving expensive ballot calls and potentially divergent control flow.

Solution

This optimization adds pattern matching in InstCombine's visitCallInst for llvm.assume intrinsics to detect:

assume(icmp eq (ballot(cmp), -1)) patterns
Both i32 and i64 ballot variants (@llvm.amdgcn.ballot.i32 and @llvm.amdgcn.ballot.i64)

When detected, it:

Replaces the ballot condition cmp with ConstantInt::getTrue()
Simplifies the assumption to always true
Enables subsequent passes (like SimplifyCFG) to eliminate dead branches

Example Transformation

Before:

%cmp = icmp eq i32 %x, 0
%ballot = call i64 @<!-- -->llvm.amdgcn.ballot.i64(i1 %cmp)
%uniform = icmp eq i64 %ballot, -1
call void @<!-- -->llvm.assume(i1 %uniform)

After InstCombine:

br i1 true, label %then, label %else

After SimplifyCFG:

; Direct branch to %then, %else eliminated

Full diff: https://github.com/llvm/llvm-project/pull/160670.diff

2 Files Affected:

(modified) llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp (+33)
(added) llvm/test/Transforms/InstCombine/amdgpu-assume-ballot-uniform.ll (+108)

diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
index 6ad493772d170..c23a4e3dfbaf3 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -3519,6 +3519,39 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
       }
     }
 
+    // Optimize AMDGPU ballot uniformity assumptions:
+    // assume(icmp eq (ballot(cmp), -1)) implies that cmp is uniform and true
+    // This allows us to optimize away the ballot and replace cmp with true
+    Value *BallotInst;
+    if (match(IIOperand, m_SpecificICmp(ICmpInst::ICMP_EQ, m_Value(BallotInst),
+                                        m_AllOnes()))) {
+      // Check if this is an AMDGPU ballot intrinsic
+      if (auto *BallotCall = dyn_cast<IntrinsicInst>(BallotInst)) {
+        if (BallotCall->getIntrinsicID() == Intrinsic::amdgcn_ballot) {
+          Value *BallotCondition = BallotCall->getArgOperand(0);
+
+          // If ballot(cmp) == -1, then cmp is uniform across all lanes and
+          // evaluates to true We can safely replace BallotCondition with true
+          // since ballot == -1 implies all lanes are true
+          if (BallotCondition->getType()->isIntOrIntVectorTy(1) &&
+              !isa<Constant>(BallotCondition)) {
+
+            // Add the condition to the worklist for further optimization
+            Worklist.pushValue(BallotCondition);
+
+            // Replace BallotCondition with true
+            BallotCondition->replaceAllUsesWith(
+                ConstantInt::getTrue(BallotCondition->getType()));
+
+            // The assumption is now always true, so we can simplify it
+            replaceUse(II->getOperandUse(0),
+                       ConstantInt::getTrue(II->getContext()));
+            return II;
+          }
+        }
+      }
+    }
+
     // If there is a dominating assume with the same condition as this one,
     // then this one is redundant, and should be removed.
     KnownBits Known(1);
diff --git a/llvm/test/Transforms/InstCombine/amdgpu-assume-ballot-uniform.ll b/llvm/test/Transforms/InstCombine/amdgpu-assume-ballot-uniform.ll
new file mode 100644
index 0000000000000..3bf3b317b0771
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/amdgpu-assume-ballot-uniform.ll
@@ -0,0 +1,108 @@
+; RUN: opt < %s -passes=instcombine -S | FileCheck %s
+
+; Test case for optimizing AMDGPU ballot + assume patterns
+; When we assume that ballot(cmp) == -1, we know that cmp is uniform
+; This allows us to optimize away the ballot and directly branch
+
+define void @test_assume_ballot_uniform(i32 %x) {
+; CHECK-LABEL: @test_assume_ballot_uniform(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 true, label [[FOO:%.*]], label [[BAR:%.*]]
+; CHECK:       foo:
+; CHECK-NEXT:    ret void
+; CHECK:       bar:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %cmp = icmp eq i32 %x, 0
+  %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
+  %all = icmp eq i64 %ballot, -1
+  call void @llvm.assume(i1 %all)
+  br i1 %cmp, label %foo, label %bar
+
+foo:
+  ret void
+
+bar:
+  ret void
+}
+
+; Test case with partial optimization - only ballot removal without branch optimization
+define void @test_assume_ballot_partial(i32 %x) {
+; CHECK-LABEL: @test_assume_ballot_partial(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 true, label [[FOO:%.*]], label [[BAR:%.*]]
+; CHECK:       foo:
+; CHECK-NEXT:    ret void
+; CHECK:       bar:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %cmp = icmp eq i32 %x, 0
+  %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
+  %all = icmp eq i64 %ballot, -1
+  call void @llvm.assume(i1 %all)
+  br i1 %cmp, label %foo, label %bar
+
+foo:
+  ret void
+
+bar:
+  ret void
+}
+
+; Negative test - ballot not compared to -1
+define void @test_assume_ballot_not_uniform(i32 %x) {
+; CHECK-LABEL: @test_assume_ballot_not_uniform(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[X:%.*]], 0
+; CHECK-NEXT:    [[BALLOT:%.*]] = call i64 @llvm.amdgcn.ballot.i64(i1 [[CMP]])
+; CHECK-NEXT:    [[SOME:%.*]] = icmp ne i64 [[BALLOT]], 0
+; CHECK-NEXT:    call void @llvm.assume(i1 [[SOME]])
+; CHECK-NEXT:    br i1 [[CMP]], label [[FOO:%.*]], label [[BAR:%.*]]
+; CHECK:       foo:
+; CHECK-NEXT:    ret void
+; CHECK:       bar:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %cmp = icmp eq i32 %x, 0
+  %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp)
+  %some = icmp ne i64 %ballot, 0
+  call void @llvm.assume(i1 %some)
+  br i1 %cmp, label %foo, label %bar
+
+foo:
+  ret void
+
+bar:
+  ret void
+}
+
+; Test with 32-bit ballot
+define void @test_assume_ballot_uniform_i32(i32 %x) {
+; CHECK-LABEL: @test_assume_ballot_uniform_i32(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 true, label [[FOO:%.*]], label [[BAR:%.*]]
+; CHECK:       foo:
+; CHECK-NEXT:    ret void
+; CHECK:       bar:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %cmp = icmp eq i32 %x, 0
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %cmp)
+  %all = icmp eq i32 %ballot, -1  
+  call void @llvm.assume(i1 %all)
+  br i1 %cmp, label %foo, label %bar
+
+foo:
+  ret void
+
+bar:
+  ret void
+}
+
+declare i64 @llvm.amdgcn.ballot.i64(i1)
+declare i32 @llvm.amdgcn.ballot.i32(i1)
+declare void @llvm.assume(i1)

arsenm · 2025-09-25T09:37:31Z

I don't understand why this is being folded in the context of the assume argument. If we want to do that fold, this is probably not the right place for it. The assume isn't doing anything here

In AMDGPU code, developers often use ballot intrinsics to test uniformity of conditions across a wavefront:

Do they? Does this actually do anything? I attempted to make use of this in https://reviews.llvm.org/D137142, but it was never completed. Did that get separately implemented?

assume(icmp eq (ballot(cmp), -1))

Isn't the ballot result anded with exec? Such that this assume is only correct for full dispatches at the entry to the kernel?

PankajDwivedi-25 · 2025-09-25T10:29:47Z

It's either way; here, uniformity will report values as divergent because it propagates the uniformity forward. The presence of compiler hint @llvm.assume here makes its operand True, which we know will hold for all the lanes. Hence, there is scope for pattern optimization if the @llvm.assume operand is derived from some operation that is used in the branch, because the branch will always choose the same path, which will be statically known in the presence of @llvm.assume.

PankajDwivedi-25 · 2025-09-25T10:32:11Z

This work should be part of the general pattern optimization, not sure if we have some dedicated pass instCombine for AMDGPU @arsenm.

PankajDwivedi-25 · 2025-09-26T11:48:12Z

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

+          // evaluates to true We can safely replace BallotCondition with true
+          // since ballot == -1 implies all lanes are true
+          if (BallotCondition->getType()->isIntOrIntVectorTy(1) &&
+              !isa<Constant>(BallotCondition)) {


The ballot here can not be assumed to be uniform; the right way would be to query the UI, and similar work is already been done in patch #116953.

The point here is that even UI reports it as divergent because it doesn't consider the presence of @llvm.assume later.

better we apply after considering the presence of @llvm.assume.

Updated the code accordingly now.

arsenm

Previous questions unanswered. I don't understand why this would be performed in the context of an assume, and not on the free-form pattern. Also, I would expect the assume to inform uniformity analysis, not just be dropped here

PankajDwivedi-25 · 2025-09-29T10:18:27Z

assume(icmp eq (ballot(cmp), -1))

Isn't the ballot result anded with exec?

I don't know why this point is relevant here; all I can see is that the ballot result must be true for all lanes.

Such that this assume is only correct for full dispatches at the entry to the kernel?

I thought we just assume its operand is true irrespective of the context where it is present.

TejaX-Alaghari · 2025-09-29T12:21:39Z

Hey @arsenm,

Thanks for your quick response :)

The current PR description doesn't make justice to the problem that's being solved! which has caused a lot of confusion (my bad). I'll mark the PR as a draft until the implementation details are ironed out!

Let me re-iterate the problem statement and the approach I'm aiming for, in my view -

Per Uniformity Analysis (UA), %cmp value is considered divergent in the example kernel above. But llvm.assume flips that assumption if we follow backwards through the use-def chain (%uniform -> %ballot -> %cmp)

I don't understand why this is being folded in the context of the assume argument.

The presence of llvm.assume line makes us safely infer that %uniform, %ballot & %cmp are uniform. Which is why, the implementation is designed to fold relying on llvm.assume.

I attempted to make use of this in https://reviews.llvm.org/D137142, but it was never completed. Did that get separately implemented?

I'm not sure whether it got implemented (these patterns don't currently exist in meaningful numbers). The optimization I'm proposing is based on theoretical utility that the pattern could be useful for explicit uniformity assertions.

Also, I would expect the assume to inform uniformity analysis, not just be dropped here

Currently, there is a limitation with the UA that it cannot propagate the assumptions backward (like uniform -> ballot -> cmp). So, this implementation is trying to play the role of a temporary patch which can be removed when this functionality is implemented in the UA.

ssahasra · 2025-10-01T08:48:25Z

Isn't the ballot result anded with exec? Such that this assume is only correct for full dispatches at the entry to the kernel?

AMDGPU backend only cares about uniformity within the active lanes of a wave. I didn't really understand the second sentence.

ssahasra · 2025-10-01T08:53:38Z

I don't see why uniformity should be mentioned anywhere in this patch at all. If the result of a ballot is "-1", then its operand is "1" in every lane of the wave. But that's all that is needed. The further implication that the operand is uniform is not relevant to this optimization. It's an unnecessary conflation between the facts implied by an assume on the one hand and uniformity implied by those facts on the other hand. The latter is not needed and should be removed from this patch.

ssahasra · 2025-10-01T08:55:44Z

Isn't the ballot result anded with exec? Such that this assume is only correct for full dispatches at the entry to the kernel?

AMDGPU backend only cares about uniformity within the active lanes of a wave. I didn't really understand the second sentence.

On second thought, I think I know what you mean. Inside a divergent branch, if the program wants to check if a value is uniform for the active lanes, the result of the ballot will be checked against execmask and not "-1". That pattern should also be added to this patch.

nikic

InstCombine is not really the right place for this kind of transform, or at least not in this form.

If you want to propagate the condition being true or an icmp equality, the place for that would be the equality propagation in GVN.

TejaX-Alaghari · 2025-10-02T07:43:59Z

Updated Implementation - Addressing Reviewer Feedback

Thank you all for the valuable feedback! I've significantly revised the implementation based on your comments, particularly @ssahasra's critical insight about avoiding uniformity concepts.

Key Changes Made

1. Removed All Uniformity Concepts ✅
Following @ssahasra's guidance, the new implementation uses pure mathematical facts without any uniformity analysis:

Removed: Complex optimizeAssumedUniformValues() framework
Removed: All uniformity terminology and concepts
Added: Two focused, fact-based optimizations

2. Simple Mathematical Facts Only

The implementation now relies solely on basic mathematical properties:

// Optimization 1: Basic equality
// assume(x == 42) → dominated uses of x become 42
// Pure fact: if we assume x equals a constant, we can use that constant

// Optimization 2: AMDGPU ballot property  
// assume(ballot(cmp) == -1) → dominated uses of cmp become true
// Pure fact: ballot == -1 means cmp is true in all lanes (no uniformity needed)

3. Dominance-Based Safety

Both optimizations use isValidAssumeForContext() to ensure we only replace uses that are dominated by the assume, making the transformation safe and correct.

Testing

The implementation covers two independent patterns:

Generic equality pattern (works for any target):

assume(icmp eq %x, 42)
%use = add i32 %x, 1  ; → add i32 42, 1

AMDGPU ballot pattern (target-specific but using generic approach):

assume(icmp eq (ballot(%cmp), -1))
br i1 %cmp  ; → br i1 true

Addressing Specific Concerns

@ssahasra's requirement: Fully addressed - no uniformity concepts, just mathematical facts

@arsenm's exec mask concern: The mathematical fact holds regardless - if ballot(cmp) == -1 is assumed true, then cmp must be true in all lanes that contributed to that ballot result

Open Question: Architecture

@nikic, I'd like to understand the concern regarding InstCombine being the right place for this optimization better.

My reasoning for InstCombine:

InstCombine already processes assume intrinsics extensively
Early optimization enables constant propagation that benefits later passes
Simple pattern matching straightforward rewrites, avoiding complex dataflow analysis
InstCombine already does similar assume-based optimizations (see simple test case)

However, if the consensus is that equality propagation from assumes belongs in GVN, I'm happy to move it there. Could you clarify:

Is the concern specific to the assume-based equality propagation in general?
Or is it about the AMDGPU-specific ballot pattern being in generic InstCombine?
Would splitting this into two separate patches help (generic equality in GVN, ballot pattern in InstCombine)?

Summary

The implementation is now clean, focused, and uses only simple mathematical facts without uniformity concepts. I believe it addresses the main technical concerns, but I'd appreciate guidance on the architectural question before proceeding.

ssahasra

Still need to answer whether this is the right place for doing this transform. Also, another feature request for future work:

If ballot == -1 appears inside a conditional branch, then UA should assume that the branch condition is uniform.

This is a whole new enhancement, and not in scope for the current patch.

ssahasra · 2025-10-05T03:44:53Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

@@ -1341,6 +1336,15 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
      Call->takeName(&II);
      return IC.replaceInstUsesWith(II, Call);
    }
+
+    if (auto *Src = dyn_cast<ConstantInt>(Arg)) {


Why was this moved?

The ballot(false) → 0 optimization moved to execute after Wave32 conversion instead of before. This ensures:

Wave32 conversion happens first (i64 → i32 ballot + zext)

Then constant folding applies
Which feels like a more consistent optimization order

Is this the right approach, or should it stay before Wave32 conversion?

ssahasra · 2025-10-05T03:45:51Z

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

@@ -3540,6 +3540,79 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
      }
    }

+    // Basic assume equality optimization: assume(x == c) -> replace dominated uses of x with c


Suggested change

// Basic assume equality optimization: assume(x == c) -> replace dominated uses of x with c

// Basic assume equality optimization: assume(x == c) -> replace uses of x with c

LLVM IR is SSA ... a value always dominates all its uses.

Of course uses of x are dominated by the definition of x. This comment refers to uses of x dominated by the "assume".

Ouch! I totally missed that. Now I see that isValidAssumeForContext() actually checks dominance to make sure that the assume applies at the use of x.

ssahasra · 2025-10-05T03:47:49Z

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

@@ -3540,6 +3540,79 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
      }
    }

+    // Basic assume equality optimization: assume(x == c) -> replace dominated uses of x with c
+    if (auto *ICmp = dyn_cast<ICmpInst>(IIOperand)) {
+      if (ICmp->getPredicate() == ICmpInst::ICMP_EQ) {


Out of curiousity, does LLVM already have utilities to match patterns like this in the IR? Seems like rather repetitive work.

Currently using PatternMatch.h (m_SpecificICmp, m_Value, m_AllOnes). Are you aware of higher-level utilities that could simplify this? Happy to refactor if there's a better approach!

Can the pattern match be used here at this point?

ssahasra · 2025-10-05T03:49:55Z

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

+        if (Variable && ConstantVal && Variable->hasUseList()) {
+          SmallVector<Use *, 8> DominatedUses;
+          for (Use &U : Variable->uses()) {
+            if (auto *UseInst = dyn_cast<Instruction>(U.getUser())) {


Out of curiosity, when a user of a value not an instruction? Should this be a static cast?

Using dyn_cast<Instruction> filters for instructions since we need the dominance check. Should I add a comment explaining this?

Rephrasing, I believe that the dyn_cast will always succeed, and hence it can be replaced with a cast. But I do need confirmation that my belief is right. Also, what dominance check?

ssahasra · 2025-10-05T03:55:05Z

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

+          SmallVector<Use *, 8> DominatedUses;
+          for (Use &U : Variable->uses()) {
+            if (auto *UseInst = dyn_cast<Instruction>(U.getUser())) {
+              if (UseInst != II && UseInst != ICmp &&


We are deep inside a conditional where II is a call to @llvm.assume and its only operand is ICmp. I don't think we need to check if UseInst is II.

Removed UseInst != II check in the ICmp block (we're deep inside a conditional where II is the assume)

Kept UseInst != ICmp to avoid replacing uses in the comparison itself

Kept UseInst != IntrCall in ballot block to avoid replacing uses in the ballot call

ssahasra · 2025-10-05T03:56:58Z

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

+
+          for (Use *U : DominatedUses) {
+            U->set(ConstantVal);
+            Worklist.pushValue(U->getUser());


I don't know how the whole traversal works, but do we need to push Variable as well as its uses? Will that result in duplicate work?

We push both:

U->getUser(): Instructions now using constants (enables further simplification)

Variable: The modified value (allows DCE if it becomes dead)

The worklist uses a set-based structure that deduplicates, so no duplicate work. Should I add a comment?

Deduplication at insertion into a set is itself duplicate work. If the uses will eventually be visited anyway, they should not be pushed here.

ssahasra · 2025-10-05T03:59:28Z

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

+          Value *BallotArg = IntrCall->getArgOperand(0);
+          if (BallotArg->getType()->isIntegerTy(1) && BallotArg->hasUseList()) {
+            // Find dominated uses and replace with true
+            SmallVector<Use *, 8> DominatedUses;


Suggested change

SmallVector<Use *, 8> DominatedUses;

SmallVector<Use *, 8> Uses;

Everywhere in the code, dominated is always true and shouldn't be included in the name.

Updated comments: "replace uses of x with c" (not "replace dominated uses")

Renamed variables: Uses instead of DominatedUses

Updated test comments

ssahasra · 2025-10-05T04:02:52Z

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

+    }
+
+    // Optimize AMDGPU ballot patterns in assumes:
+    // assume(ballot(cmp) == -1) means cmp is true on all active lanes


I am doubtful about the value of this comparison in real life. I would expect that in the vast majority of programs, the actual comparison is with execmask and not -1. If that is the case, this part needs to be enhanced to cover execmask.

You're right! Inside divergent branches, code would use:

%ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %cmp) %uniform = icmp eq i64 %ballot, %exec call void @llvm.assume(i1 %uniform)

@ssahasra, Can you please guide me in figuring out how to access the exec mask value in InstCombine?

ssahasra · 2025-10-05T04:05:00Z

llvm/test/Transforms/InstCombine/assume.ll

@@ -1034,6 +1034,35 @@ define i1 @neg_assume_trunc_eq_one(i8 %x) {
  ret i1 %q
 }

+; Test AMDGPU ballot pattern optimization  
+; assume(ballot(cmp) == -1) means cmp is true on all active lanes
+; so dominated uses of cmp can be replaced with true


Really please remove "dominated" from everywhere.

jayfoad

Looks good overall.

jayfoad · 2025-11-11T11:56:08Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+    // assume(ballot(x) == ballot(true)) -> x = true
+    // assume(ballot(x) == -1)           -> x = true
+    // assume(ballot(x) == 0)            -> x = false
+    if (Arg->getType()->isIntegerTy(1)) {


Don't need to check this. It is always true.

jayfoad · 2025-11-11T12:00:24Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+        } else if (match(CompareValue,
+                         m_Intrinsic<Intrinsic::amdgcn_ballot>(m_Zero()))) {


Don't need to do this since you can rely on InstCombine combining ballot(0) -> 0.

Similar to ballot(true), I tried to cover ballot(false) as another trivial case.
But if it gets folded to 0, then this logic is not necessary. I'll remove it.

jayfoad · 2025-11-11T12:06:23Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+        Arg->replaceUsesWithIf(ReplacementValue, [&](Use &U) {
+          Instruction *UserInst = dyn_cast<Instruction>(U.getUser());
+          bool Dominates =
+              UserInst && IC.getDominatorTree().dominates(Assume, U);


Don't quite understand why you need to check UserInst here. And if you do need to check it, do you also need to check that it belongs to the current function?

Added this to explicitly check for replacing the uses which are Instructions.
Because, I got a crash inside the dominates function here when the use U is not an Instruction.

I think it would be better to check up-front that Arg is either an Argument or an Instruction. We don't want to waste time doing this optimization if Arg is something else like a Constant.

jayfoad · 2025-11-11T12:08:23Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+    // assume(ballot(x) == ballot(true)) -> x = true
+    // assume(ballot(x) == -1)           -> x = true
+    // assume(ballot(x) == 0)            -> x = false


We allow ballot.i32 in wave64 mode, which would break these optimizations.

I was not aware of this! thanks for pointing out.

We allow ballot.i32 in wave64 mode, which would break these optimizations.

If you're referring to this condition in Instruction selection, doesn't it mean that "emitting i64 ballots in wave32 mode is supported" and not the other way around.

Please let me know if I'm missing something in understanding this logic.

Hmm. I thought there was also a plan to support "emitting i32 ballots in wave64 mode". Maybe that part was never implemented. I guess your patch is OK then, but personally I would add a check that the ballot size matches the wave size just for safety.

I added a check in a recent commit for proceeding with this optimization only when ballotWidth >= waveSize.

If you are going to all that trouble, then the tests should check all four combinations: i32 ballot in wave32, i32 ballot in wave64, etc.

jayfoad · 2025-11-11T12:08:56Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+          else if (CI->isZero())
+            InferredCondValue = false;
+        } else if (match(CompareValue,
+                         m_Intrinsic<Intrinsic::amdgcn_ballot>(m_One()))) {


Again you need to guard against ballot.i32 in wave64 mode.

PankajDwivedi-25 · 2025-11-12T13:01:42Z

llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.ballot-assume-wave32.ll

+  %cmp = icmp eq i32 %x, 0
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %cmp)
+  %all = icmp eq i32 %ballot, -1
+  call void @llvm.assume(i1 %all)
+  br i1 %cmp, label %foo, label %bar
+
+foo:


should we add a run line for end-to-end? Which shows the branch that will not be considered will be eliminated? also, the redundant code in the function.

Added a separate case for E2E scenario. Please review and let me know if it matches the expected impact.

PankajDwivedi-25 · 2025-11-12T13:03:03Z

LGTM, except the above review if it's relevant here.

arsenm

I don't understand why this where it is. Why is this starting at the ballot, then looking at the downstream assume uses for assumes?

arsenm · 2025-11-13T02:49:26Z

llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.ballot-assume-wave32.ll

@@ -0,0 +1,434 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt < %s -mtriple=amdgcn-amd-amdhsa -mattr=+wavefrontsize32 -passes=instcombine -S | FileCheck %s
+; RUN: opt < %s -mtriple=amdgcn-amd-amdhsa -mattr=+wavefrontsize32 -O2 -S | FileCheck %s --check-prefix=O2


This should not be running -O2

arsenm · 2025-11-13T02:54:12Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+        if (CI->isMinusOne())
+          InferredCondValue = true;
+        // ballot(x) == 0 means all lanes have x = false.
+        else if (CI->isZero())


Braces and fix indent of comment

arsenm · 2025-11-13T02:54:44Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+
+      // Extract the ballot and the value being compared against it.
+      Value *LHS = ICI->getOperand(0), *RHS = ICI->getOperand(1);
+      Value *CompareValue = (LHS == &II) ? RHS : (RHS == &II) ? LHS : nullptr;


Use the commutative icmp matcher?

arsenm · 2025-11-13T04:09:24Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+      }
+
+      if (!InferredCondValue)


else

Suggested change

}

if (!InferredCondValue)

} else

continue

Don't need InferredCondValue to be optional

arsenm · 2025-11-13T04:10:08Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+      break;
+
+    // Skip if ballot width < wave size (e.g., ballot.i32 on wave64).
+    if (ST->isWave64() && II.getType()->getIntegerBitWidth() == 32)


Suggested change

if (ST->isWave64() && II.getType()->getIntegerBitWidth() == 32)

if (WavefrontSize != II.getType()->getIntegerBitWidth())

TejaX-Alaghari · 2025-11-13T11:28:32Z

I don't understand why this where it is. Why is this starting at the ballot, then looking at the downstream assume uses for assumes?

Another approach I explored was to visit llvm.assume first and then check if its arg is a amdgcn.ballot. But this would require knowledge about target specific intrinsic (amdgcn.ballot) in generic assume code using a TTI callback.

Also, the approach of visiting the icmp condition (%cmp = icmp eq i32 %x, 0) first and then checking if each of those instances is used by ballot with assume didn't look efficient.

So, I decided to go with the current approach.

If you have any better suggestion for me to explore, please let me know!

jayfoad · 2025-11-13T12:22:45Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+    // Skip if Arg is not an instruction (e.g., constant, argument).
+    if (!isa<Instruction>(Arg))


I think it would make sense to do this for Arguments? Should have a test case for that.

When we encounter assume(ballot(cmp) == -1), we know that cmp is uniform across all lanes and evaluates to true. This optimization recognizes this pattern and replaces the condition with a constant true, allowing subsequent passes to eliminate dead code and optimize control flow. The optimization handles both i32 and i64 ballot intrinsics and only applies when the ballot result is compared against -1 (all lanes active). This is a conservative approach that ensures correctness while enabling significant optimizations for uniform control flow patterns.

Address reviewer feedback by implementing free-form ballot intrinsic optimization instead of assume-dependent patterns. This approach: 1. Optimizes ballot(constant) directly as a standard intrinsic optimization 2. Allows uniformity analysis to handle assumes through proper channels 3. Follows established AMDGPU intrinsic patterns (amdgcn_cos, amdgcn_sin) 4. Enables broader optimization opportunities beyond assume contexts Optimizations: - ballot(true) -> -1 (all lanes active) - ballot(false) -> 0 (no lanes active) This addresses the core reviewer concern about performing optimization in assume context rather than as a free-form pattern, and lets the uniformity analysis framework handle assumes as intended. Test cases focus on constant folding rather than assume-specific patterns, demonstrating the more general applicability of this approach.

Implement a comprehensive generic optimization for assume intrinsics that extracts uniformity information and optimizes dominated uses. The optimization recognizes multiple patterns that establish value uniformity and replaces dominated uses with uniform constants. Addresses uniformity analysis optimization opportunities identified in AMDGPU ballot/readfirstlane + assume patterns for improved code generation through constant propagation.

This commit implements two targeted optimizations for assume intrinsics: 1. Basic equality optimization: assume(x == c) replaces dominated uses of x with c 2. AMDGPU ballot optimization: assume(ballot(cmp) == -1) replaces dominated uses of cmp with true, since ballot == -1 means cmp is true on all active lanes Key design principles: - No uniformity analysis concepts - uses simple mathematical facts - Dominance-based replacement for correctness - Clean pattern matching without complex framework - Addresses reviewer feedback to keep it simple and focused Examples: assume(x == 42); use = add x, 1 → use = 43 assume(ballot(cmp) == -1); br cmp → br true This enables better optimization of GPU code patterns while remaining architecture-agnostic through the mathematical properties of the operations.

- Remove 'dominated' terminology from comments and variable names (SSA values always dominate their uses) - Rename DominatedUses -> Uses throughout - Remove redundant UseInst != II check in ICmp block - Fix code formatting (clang-format) - Split long comment lines - Remove extra blank lines at EOF

- Remove redundant const propagration (assume equality opt) from InstCombine. - Moved assume(ballot(cmp) == -1) optimization from InstCombine to GVN.

…ethod

1. Add logic to handle swapped operands in icmp 2. Introduce priliminary logic for identifying an exec mask 3. Add a separate test file and include comprehensive cases for ballot with assume

…nsic.cpp

…n Ballot arg is modified

…es wave sixe 2. Modify logic to make the InferredCondValue non-optional 3. Simplified cmp arg match logic 4. Added missing braces and comments indents 5. Removed O2 checks from the test cases

TejaX-Alaghari requested a review from nikic as a code owner September 25, 2025 09:27

llvmbot added backend:AMDGPU llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Sep 25, 2025

arsenm requested a review from PankajDwivedi-25 September 25, 2025 09:29

PankajDwivedi-25 reviewed Sep 26, 2025

View reviewed changes

arsenm requested changes Sep 29, 2025

View reviewed changes

TejaX-Alaghari marked this pull request as draft September 29, 2025 12:21

TejaX-Alaghari changed the title ~~[InstCombine] Optimize AMDGPU ballot + assume uniformity patterns~~ [WIP][Uniformity Analysis][Assume] Optimize AMDGPU ballot + assume uniformity patterns Sep 29, 2025

TejaX-Alaghari force-pushed the DA_Assume branch 3 times, most recently from 45e5946 to 0df4a7d Compare September 30, 2025 12:01

TejaX-Alaghari changed the title ~~[WIP][Uniformity Analysis][Assume] Optimize AMDGPU ballot + assume uniformity patterns~~ [WIP][Uniformity Analysis][Assume] Generic assume-based uniformity optimization Sep 30, 2025

nikic reviewed Oct 1, 2025

View reviewed changes

TejaX-Alaghari force-pushed the DA_Assume branch from 0df4a7d to 6999ef9 Compare October 2, 2025 07:44

TejaX-Alaghari changed the title ~~[WIP][Uniformity Analysis][Assume] Generic assume-based uniformity optimization~~ [WIP][Assume] Generic assume-based uniformity optimization Oct 2, 2025

TejaX-Alaghari force-pushed the DA_Assume branch from 6999ef9 to 770a011 Compare October 3, 2025 06:47

TejaX-Alaghari marked this pull request as ready for review October 3, 2025 06:47

ssahasra reviewed Oct 5, 2025

View reviewed changes

TejaX-Alaghari marked this pull request as ready for review November 11, 2025 12:02

jayfoad reviewed Nov 11, 2025

View reviewed changes

TejaX-Alaghari force-pushed the DA_Assume branch from e434695 to 08e5759 Compare November 11, 2025 17:48

PankajDwivedi-25 reviewed Nov 12, 2025

View reviewed changes

PankajDwivedi-25 approved these changes Nov 12, 2025

View reviewed changes

TejaX-Alaghari force-pushed the DA_Assume branch from 08e5759 to 987a0c1 Compare November 12, 2025 16:06

arsenm requested changes Nov 13, 2025

View reviewed changes

arsenm reviewed Nov 13, 2025

View reviewed changes

TejaX-Alaghari force-pushed the DA_Assume branch from 987a0c1 to 679f7b0 Compare November 13, 2025 10:45

jayfoad reviewed Nov 13, 2025

View reviewed changes

TejaX-Alaghari added 16 commits November 13, 2025 19:11

Address feedback on the location of the opt

b7a013c

- Remove redundant const propagration (assume equality opt) from InstCombine. - Moved assume(ballot(cmp) == -1) optimization from InstCombine to GVN.

Refactored the ballot optimization condition into propagateEquality m…

1fbc805

…ethod

Implement reviewer's suggestions to -

c7a1f6a

1. Add logic to handle swapped operands in icmp 2. Introduce priliminary logic for identifying an exec mask 3. Add a separate test file and include comprehensive cases for ballot with assume

Moved the assume based ballot folding logic to AMDGPUInstCombineIntri…

08b6db2

…nsic.cpp

Simplified the logic to check for CompareValue and return nullptr whe…

20ad9d2

…n Ballot arg is modified

Skip the optimization when ballot size < wave size

5e3d47e

Only proceed with the optiization if ballot arg is an instruction

9b6e186

Add additional check to demonstrate E2E impact of this optimization

0312bf5

Address feedback:1. Add the condition to make sure ballot width match…

bc961db

…es wave sixe 2. Modify logic to make the InferredCondValue non-optional 3. Simplified cmp arg match logic 4. Added missing braces and comments indents 5. Removed O2 checks from the test cases

Removed unnecessary temp varaible for holding wave front size

f4f1277

Skip only constant args

b19b765

TejaX-Alaghari force-pushed the DA_Assume branch from eb4a469 to b19b765 Compare November 13, 2025 13:42

	// Basic assume equality optimization: assume(x == c) -> replace dominated uses of x with c
	// Basic assume equality optimization: assume(x == c) -> replace uses of x with c

	SmallVector<Use *, 8> DominatedUses;
	SmallVector<Use *, 8> Uses;

		} else if (match(CompareValue,
		m_Intrinsic<Intrinsic::amdgcn_ballot>(m_Zero()))) {

	if (ST->isWave64() && II.getType()->getIntegerBitWidth() == 32)
	if (WavefrontSize != II.getType()->getIntegerBitWidth())

		// Skip if Arg is not an instruction (e.g., constant, argument).
		if (!isa<Instruction>(Arg))

[AMDGPU][InstCombine] Fold ballot intrinsic based on llvm.assume hints #160670

Are you sure you want to change the base?

[AMDGPU][InstCombine] Fold ballot intrinsic based on llvm.assume hints #160670

Conversation

TejaX-Alaghari commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

llvmbot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Example Transformation

Uh oh!

arsenm commented Sep 25, 2025

Uh oh!

PankajDwivedi-25 commented Sep 25, 2025

Uh oh!

PankajDwivedi-25 commented Sep 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

PankajDwivedi-25 commented Sep 29, 2025

Uh oh!

TejaX-Alaghari commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssahasra commented Oct 1, 2025

Uh oh!

ssahasra commented Oct 1, 2025

Uh oh!

ssahasra commented Oct 1, 2025

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

TejaX-Alaghari commented Oct 2, 2025

Updated Implementation - Addressing Reviewer Feedback

Key Changes Made

Testing

Addressing Specific Concerns

Open Question: Architecture

Summary

Uh oh!

ssahasra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TejaX-Alaghari Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TejaX-Alaghari commented Sep 25, 2025 •

edited

Loading

llvmbot commented Sep 25, 2025 •

edited

Loading

TejaX-Alaghari commented Sep 29, 2025 •

edited

Loading

TejaX-Alaghari Oct 5, 2025 •

edited

Loading