[AMDGPU] Introduce "amdgpu-uniform-intrinsic-combine" pass to combine uniform AMDGPU lane Intrinsics. #116953

PankajDwivedi-25 · 2024-11-20T11:00:09Z

This pass introduces optimizations for AMDGPU intrinsics by leveraging the uniformity of their arguments. When an intrinsic's arguments are detected as uniform, redundant computations are eliminated, and the intrinsic calls are simplified accordingly.

By utilizing the UniformityInfo analysis, this pass identifies cases where intrinsic calls are uniform across all lanes, allowing transformations that reduce unnecessary operations and improve the IR's efficiency.

These changes enhance performance by streamlining intrinsic usage in uniform scenarios without altering the program's semantics.

For background, see PR #99878

llvmbot · 2024-11-20T11:00:44Z

@llvm/pr-subscribers-backend-amdgpu

Author: Pankaj Dwivedi (PankajDwivedi-25)

Changes

As per the draft PR: #99878 discussion, it turned out to have UniformityAnalysis after InstCombine may not fully utilize all the scopes, but it is not going to cause the dependency troubles.

This pass allows folding of lane* intrinsics for a subset of patterns.

Co-authored by @Acim-Maravic

Patch is 26.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/116953.diff

5 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+30-19)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+23-20)
(added) llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp (+184)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll (+221)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..ca72ac6d3670ce 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -92,7 +92,7 @@ class SILowerI1CopiesPass : public PassInfoMixin<SILowerI1CopiesPass> {
 
 void initializeAMDGPUDAGToDAGISelLegacyPass(PassRegistry &);
 
-void initializeAMDGPUAlwaysInlinePass(PassRegistry&);
+void initializeAMDGPUAlwaysInlinePass(PassRegistry &);
 
 Pass *createAMDGPUAnnotateKernelFeaturesPass();
 Pass *createAMDGPUAttributorLegacyPass();
@@ -226,11 +226,11 @@ extern char &GCNRegPressurePrinterID;
 
 // Passes common to R600 and SI
 FunctionPass *createAMDGPUPromoteAlloca();
-void initializeAMDGPUPromoteAllocaPass(PassRegistry&);
+void initializeAMDGPUPromoteAllocaPass(PassRegistry &);
 extern char &AMDGPUPromoteAllocaID;
 
 FunctionPass *createAMDGPUPromoteAllocaToVector();
-void initializeAMDGPUPromoteAllocaToVectorPass(PassRegistry&);
+void initializeAMDGPUPromoteAllocaToVectorPass(PassRegistry &);
 extern char &AMDGPUPromoteAllocaToVectorID;
 
 struct AMDGPUPromoteAllocaPass : PassInfoMixin<AMDGPUPromoteAllocaPass> {
@@ -299,7 +299,7 @@ class AMDGPULateCodeGenPreparePass
   const GCNTargetMachine &TM;
 
 public:
-  AMDGPULateCodeGenPreparePass(const GCNTargetMachine &TM) : TM(TM) {};
+  AMDGPULateCodeGenPreparePass(const GCNTargetMachine &TM) : TM(TM){};
   PreservedAnalyses run(Function &, FunctionAnalysisManager &);
 };
 
@@ -325,7 +325,7 @@ class AMDGPUAttributorPass : public PassInfoMixin<AMDGPUAttributorPass> {
 
 public:
   AMDGPUAttributorPass(TargetMachine &TM, AMDGPUAttributorOptions Options = {})
-      : TM(TM), Options(Options) {};
+      : TM(TM), Options(Options){};
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
 };
 
@@ -339,7 +339,7 @@ class AMDGPUAnnotateUniformValuesPass
 FunctionPass *createAMDGPUAnnotateUniformValuesLegacy();
 
 ModulePass *createAMDGPUPrintfRuntimeBinding();
-void initializeAMDGPUPrintfRuntimeBindingPass(PassRegistry&);
+void initializeAMDGPUPrintfRuntimeBindingPass(PassRegistry &);
 extern char &AMDGPUPrintfRuntimeBindingID;
 
 void initializeAMDGPUResourceUsageAnalysisPass(PassRegistry &);
@@ -350,15 +350,15 @@ struct AMDGPUPrintfRuntimeBindingPass
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
 };
 
-ModulePass* createAMDGPUUnifyMetadataPass();
-void initializeAMDGPUUnifyMetadataPass(PassRegistry&);
+ModulePass *createAMDGPUUnifyMetadataPass();
+void initializeAMDGPUUnifyMetadataPass(PassRegistry &);
 extern char &AMDGPUUnifyMetadataID;
 
 struct AMDGPUUnifyMetadataPass : PassInfoMixin<AMDGPUUnifyMetadataPass> {
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
 };
 
-void initializeSIOptimizeExecMaskingPreRAPass(PassRegistry&);
+void initializeSIOptimizeExecMaskingPreRAPass(PassRegistry &);
 extern char &SIOptimizeExecMaskingPreRAID;
 
 void initializeSIOptimizeVGPRLiveRangePass(PassRegistry &);
@@ -367,7 +367,7 @@ extern char &SIOptimizeVGPRLiveRangeID;
 void initializeAMDGPUAnnotateUniformValuesLegacyPass(PassRegistry &);
 extern char &AMDGPUAnnotateUniformValuesLegacyPassID;
 
-void initializeAMDGPUCodeGenPreparePass(PassRegistry&);
+void initializeAMDGPUCodeGenPreparePass(PassRegistry &);
 extern char &AMDGPUCodeGenPrepareID;
 
 void initializeAMDGPURemoveIncompatibleFunctionsPass(PassRegistry &);
@@ -400,10 +400,10 @@ class SIAnnotateControlFlowPass
 void initializeSIAnnotateControlFlowLegacyPass(PassRegistry &);
 extern char &SIAnnotateControlFlowLegacyPassID;
 
-void initializeSIMemoryLegalizerPass(PassRegistry&);
+void initializeSIMemoryLegalizerPass(PassRegistry &);
 extern char &SIMemoryLegalizerID;
 
-void initializeSIModeRegisterPass(PassRegistry&);
+void initializeSIModeRegisterPass(PassRegistry &);
 extern char &SIModeRegisterID;
 
 void initializeAMDGPUInsertDelayAluPass(PassRegistry &);
@@ -412,25 +412,25 @@ extern char &AMDGPUInsertDelayAluID;
 void initializeSIInsertHardClausesPass(PassRegistry &);
 extern char &SIInsertHardClausesID;
 
-void initializeSIInsertWaitcntsPass(PassRegistry&);
+void initializeSIInsertWaitcntsPass(PassRegistry &);
 extern char &SIInsertWaitcntsID;
 
-void initializeSIFormMemoryClausesPass(PassRegistry&);
+void initializeSIFormMemoryClausesPass(PassRegistry &);
 extern char &SIFormMemoryClausesID;
 
-void initializeSIPostRABundlerPass(PassRegistry&);
+void initializeSIPostRABundlerPass(PassRegistry &);
 extern char &SIPostRABundlerID;
 
 void initializeGCNCreateVOPDPass(PassRegistry &);
 extern char &GCNCreateVOPDID;
 
-void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry&);
+void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry &);
 extern char &AMDGPUUnifyDivergentExitNodesID;
 
 ImmutablePass *createAMDGPUAAWrapperPass();
-void initializeAMDGPUAAWrapperPassPass(PassRegistry&);
+void initializeAMDGPUAAWrapperPassPass(PassRegistry &);
 ImmutablePass *createAMDGPUExternalAAWrapperPass();
-void initializeAMDGPUExternalAAWrapperPass(PassRegistry&);
+void initializeAMDGPUExternalAAWrapperPass(PassRegistry &);
 
 void initializeAMDGPUArgumentUsageInfoPass(PassRegistry &);
 
@@ -453,6 +453,17 @@ void initializeAMDGPUSetWavePriorityPass(PassRegistry &);
 void initializeGCNRewritePartialRegUsesPass(llvm::PassRegistry &);
 extern char &GCNRewritePartialRegUsesID;
 
+void initializeAMDGPUUniformIntrinsicCombinePass(PassRegistry &);
+extern char &AMDGPUUniformIntrinsicCombineID;
+FunctionPass *createAMDGPUUniformIntrinsicCombinePass();
+
+struct AMDGPUUniformIntrinsicCombinePass
+    : public PassInfoMixin<AMDGPUUniformIntrinsicCombinePass> {
+  const AMDGPUTargetMachine &TM;
+  AMDGPUUniformIntrinsicCombinePass(const AMDGPUTargetMachine &TM_) : TM(TM_) {}
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
 namespace AMDGPU {
 enum TargetIndex {
   TI_CONSTDATA_START,
@@ -488,7 +499,7 @@ static inline bool addrspacesMayAlias(unsigned AS1, unsigned AS2) {
   return ASAliasRules[AS1][AS2];
 }
 
-}
+} // namespace AMDGPU
 
 } // End namespace llvm
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..1f6a791ed73eb1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -22,9 +22,9 @@ MODULE_PASS("amdgpu-lower-buffer-fat-pointers",
 MODULE_PASS("amdgpu-lower-ctor-dtor", AMDGPUCtorDtorLoweringPass())
 MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass(*this))
 MODULE_PASS("amdgpu-lower-module-lds", AMDGPULowerModuleLDSPass(*this))
-MODULE_PASS("amdgpu-perf-hint",
-            AMDGPUPerfHintAnalysisPass(
-              *static_cast<const GCNTargetMachine *>(this)))
+MODULE_PASS(
+    "amdgpu-perf-hint",
+    AMDGPUPerfHintAnalysisPass(*static_cast<const GCNTargetMachine *>(this)))
 MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
 #undef MODULE_PASS
@@ -32,12 +32,11 @@ MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
 #ifndef MODULE_PASS_WITH_PARAMS
 #define MODULE_PASS_WITH_PARAMS(NAME, CLASS, CREATE_PASS, PARSER, PARAMS)
 #endif
-MODULE_PASS_WITH_PARAMS(
-    "amdgpu-attributor", "AMDGPUAttributorPass",
-    [=](AMDGPUAttributorOptions Options) {
-      return AMDGPUAttributorPass(*this, Options);
-    },
-    parseAMDGPUAttributorPassOptions, "closed-world")
+MODULE_PASS_WITH_PARAMS("amdgpu-attributor", "AMDGPUAttributorPass",
+                        [=](AMDGPUAttributorOptions Options) {
+                          return AMDGPUAttributorPass(*this, Options);
+                        },
+                        parseAMDGPUAttributorPassOptions, "closed-world")
 #undef MODULE_PASS_WITH_PARAMS
 
 #ifndef FUNCTION_PASS
@@ -47,9 +46,9 @@ FUNCTION_PASS("amdgpu-annotate-uniform", AMDGPUAnnotateUniformValuesPass())
 FUNCTION_PASS("amdgpu-codegenprepare", AMDGPUCodeGenPreparePass(*this))
 FUNCTION_PASS("amdgpu-image-intrinsic-opt",
               AMDGPUImageIntrinsicOptimizerPass(*this))
-FUNCTION_PASS("amdgpu-late-codegenprepare",
-              AMDGPULateCodeGenPreparePass(
-                *static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS(
+    "amdgpu-late-codegenprepare",
+    AMDGPULateCodeGenPreparePass(*static_cast<const GCNTargetMachine *>(this)))
 FUNCTION_PASS("amdgpu-lower-kernel-arguments",
               AMDGPULowerKernelArgumentsPass(*this))
 FUNCTION_PASS("amdgpu-lower-kernel-attributes",
@@ -64,7 +63,11 @@ FUNCTION_PASS("amdgpu-rewrite-undef-for-phi", AMDGPURewriteUndefForPHIPass())
 FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
               AMDGPUUnifyDivergentExitNodesPass())
 FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
-FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS(
+    "si-annotate-control-flow",
+    SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS("amdgpu-uniformIntrinsic-combine",
+              AMDGPUUniformIntrinsicCombinePass(*this))
 #undef FUNCTION_PASS
 
 #ifndef FUNCTION_ANALYSIS
@@ -82,13 +85,13 @@ FUNCTION_ALIAS_ANALYSIS("amdgpu-aa", AMDGPUAA())
 #ifndef FUNCTION_PASS_WITH_PARAMS
 #define FUNCTION_PASS_WITH_PARAMS(NAME, CLASS, CREATE_PASS, PARSER, PARAMS)
 #endif
-FUNCTION_PASS_WITH_PARAMS(
-    "amdgpu-atomic-optimizer",
-    "AMDGPUAtomicOptimizerPass",
-    [=](ScanOptions Strategy) {
-      return AMDGPUAtomicOptimizerPass(*this, Strategy);
-    },
-    parseAMDGPUAtomicOptimizerStrategy, "strategy=dpp|iterative|none")
+FUNCTION_PASS_WITH_PARAMS("amdgpu-atomic-optimizer",
+                          "AMDGPUAtomicOptimizerPass",
+                          [=](ScanOptions Strategy) {
+                            return AMDGPUAtomicOptimizerPass(*this, Strategy);
+                          },
+                          parseAMDGPUAtomicOptimizerStrategy,
+                          "strategy=dpp|iterative|none")
 #undef FUNCTION_PASS_WITH_PARAMS
 
 #ifndef MACHINE_FUNCTION_PASS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp
new file mode 100644
index 00000000000000..d35c30cf42a7a6
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp
@@ -0,0 +1,184 @@
+//===-- AMDGPUUniformIntrinsicCombine.cpp
+//-----------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// This pass combines uniform intrinsic instructions.
+/// Unifrom Intrinsic combine uses pattern match to identify and optimize
+/// redundent intrinsic instruction.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
+#include "llvm/Analysis/UniformityAnalysis.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstVisitor.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Target/TargetMachine.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+
+#define DEBUG_TYPE "amdgpu-uniformIntrinsic-combine"
+
+using namespace llvm;
+using namespace llvm::AMDGPU;
+using namespace llvm::PatternMatch;
+
+namespace {
+
+class AMDGPUUniformIntrinsicCombine : public FunctionPass {
+public:
+  static char ID;
+  AMDGPUUniformIntrinsicCombine() : FunctionPass(ID) {}
+
+  bool runOnFunction(Function &F) override;
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addPreserved<DominatorTreeWrapperPass>();
+    AU.addRequired<UniformityInfoWrapperPass>();
+    AU.addRequired<TargetPassConfig>();
+  }
+};
+
+class AMDGPUUniformIntrinsicCombineImpl
+    : public InstVisitor<AMDGPUUniformIntrinsicCombineImpl> {
+private:
+  const UniformityInfo *UI;
+
+  void optimizeUniformIntrinsicInst(IntrinsicInst &II) const;
+
+public:
+  AMDGPUUniformIntrinsicCombineImpl() = delete;
+
+  AMDGPUUniformIntrinsicCombineImpl(const UniformityInfo *UI) : UI(UI) {}
+
+  bool run(Function &F);
+};
+
+} // namespace
+
+char AMDGPUUniformIntrinsicCombine::ID = 0;
+
+char &llvm::AMDGPUUniformIntrinsicCombineID = AMDGPUUniformIntrinsicCombine::ID;
+
+bool AMDGPUUniformIntrinsicCombine::runOnFunction(Function &F) {
+  if (skipFunction(F)) {
+    return false;
+  }
+
+  const UniformityInfo *UI =
+      &getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
+
+  return AMDGPUUniformIntrinsicCombineImpl(UI).run(F);
+}
+
+PreservedAnalyses
+AMDGPUUniformIntrinsicCombinePass::run(Function &F,
+                                       FunctionAnalysisManager &AM) {
+
+  const auto *UI = &AM.getResult<UniformityInfoAnalysis>(F);
+  const DataLayout *DL = &F.getDataLayout();
+
+  // @todo check if it is required that this method must return bool, if so
+  // figure out what can be returned.
+  bool IsChanged = AMDGPUUniformIntrinsicCombineImpl(UI).run(F);
+
+  if (!IsChanged) {
+    return PreservedAnalyses::all();
+  }
+
+  PreservedAnalyses PA;
+  PA.preserve<DominatorTreeAnalysis>();
+  return PA;
+}
+
+bool AMDGPUUniformIntrinsicCombineImpl::run(Function &F) {
+
+  // @todo check if it is required that this method must return bool, if so
+  // figure out what can be returned.
+  const bool IsChanged{false};
+
+  // Iterate over each instruction in the function to get the desired intrinsic
+  // inst to check for optimization.
+  for (BasicBlock &BB : F) {
+    for (Instruction &I : BB) {
+      if (auto *Call = dyn_cast<CallInst>(&I)) {
+        if (auto *Intrinsic = dyn_cast<IntrinsicInst>(Call)) {
+          optimizeUniformIntrinsicInst(*Intrinsic);
+        }
+      }
+    }
+  }
+
+  return IsChanged;
+}
+
+void AMDGPUUniformIntrinsicCombineImpl::optimizeUniformIntrinsicInst(
+    IntrinsicInst &II) const {
+  llvm::Intrinsic::ID IID = II.getIntrinsicID();
+
+  switch (IID) {
+  case Intrinsic::amdgcn_permlane64: {
+    Value *Src = II.getOperand(0);
+    if (UI->isUniform(Src)) {
+      return II.replaceAllUsesWith(Src);
+    }
+    break;
+  }
+  case Intrinsic::amdgcn_readfirstlane:
+  case Intrinsic::amdgcn_readlane: {
+    Value *Srcv = II.getOperand(0);
+    if (UI->isUniform(Srcv)) {
+      return II.replaceAllUsesWith(Srcv);
+    }
+
+    // The rest of these may not be safe if the exec may not be the same between
+    // the def and use.
+    Value *Src = II.getArgOperand(0);
+    Instruction *SrcInst = dyn_cast<Instruction>(Src);
+    if (SrcInst && SrcInst->getParent() != II.getParent())
+      break;
+
+    // readfirstlane (readfirstlane x) -> readfirstlane x
+    // readlane (readfirstlane x), y -> readfirstlane x
+    if (match(Src,
+              PatternMatch::m_Intrinsic<Intrinsic::amdgcn_readfirstlane>())) {
+      return II.replaceAllUsesWith(Src);
+    }
+
+    if (IID == Intrinsic::amdgcn_readfirstlane) {
+      // readfirstlane (readlane x, y) -> readlane x, y
+      if (match(Src, PatternMatch::m_Intrinsic<Intrinsic::amdgcn_readlane>())) {
+        return II.replaceAllUsesWith(Src);
+      }
+    } else {
+      // readlane (readlane x, y), y -> readlane x, y
+      if (match(Src, PatternMatch::m_Intrinsic<Intrinsic::amdgcn_readlane>(
+                         PatternMatch::m_Value(),
+                         PatternMatch::m_Specific(II.getArgOperand(1))))) {
+        return II.replaceAllUsesWith(Src);
+      }
+    }
+    break;
+  }
+  }
+}
+
+INITIALIZE_PASS_BEGIN(AMDGPUUniformIntrinsicCombine, DEBUG_TYPE,
+                      "AMDGPU uniformIntrinsic Combine", false, false)
+INITIALIZE_PASS_DEPENDENCY(UniformityInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
+INITIALIZE_PASS_END(AMDGPUUniformIntrinsicCombine, DEBUG_TYPE,
+                    "AMDGPU uniformIntrinsic Combine", false, false)
+
+FunctionPass *llvm::createAMDGPUUniformIntrinsicCombinePass() {
+  return new AMDGPUUniformIntrinsicCombine();
+}
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..13e0fc61a82443 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -61,6 +61,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUHSAMetadataStreamer.cpp
   AMDGPUInsertDelayAlu.cpp
   AMDGPUInstCombineIntrinsic.cpp
+  AMDGPUUniformIntrinsicCombine.cpp
   AMDGPUInstrInfo.cpp
   AMDGPUInstructionSelector.cpp
   AMDGPUISelDAGToDAG.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll
new file mode 100644
index 00000000000000..cdea0f9ec8fca5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll
@@ -0,0 +1,221 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 -passes="instcombine,amdgpu-uniformIntrinsic-combine" -S < %s | FileCheck %s --check-prefixes=GFX,GFX10
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -passes="instcombine,amdgpu-uniformIntrinsic-combine" -S < %s | FileCheck %s --check-prefixes=GFX,GFX11
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1200 -passes="instcombine,amdgpu-uniformIntrinsic-combine" -S < %s | FileCheck %s --check-prefixes=GFX,GFX12
+
+define amdgpu_kernel void @permlane64_constant(ptr addrspace(1) %out) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_constant(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]]) #[[ATTR0:[0-9]+]] {
+; GFX-NEXT:    store i32 77, ptr addrspace(1) [[OUT]], align 4
+; GFX-NEXT:    ret void
+;
+  %v = call i32 @llvm.amdgcn.permlane64(i32 77)
+  store i32 %v, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @permlane64_undef(ptr addrspace(1) %out) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_undef(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]]) #[[ATTR0]] {
+; GFX-NEXT:    ret void
+;
+  %v = call i32 @llvm.amdgcn.permlane64(i32 undef)
+  store i32 %v, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @permlane64_sgpr(ptr addrspace(1) %out, i32 %src) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_sgpr(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[SRC:%.*]]) #[[ATTR0]] {
+; GFX-NEXT:    ret void
+;
+  %v = call i32 @llvm.amdgcn.permlane64(i32 undef)
+  store i32 %v, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @permlane64_vgpr(i32 addrspace(1)* %out) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_vgpr(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]]) #[[ATTR0]] {
+; GFX-NEXT:    [[TID:%.*]] = call i32 @llvm.amdgcn.workitem.id.x()
+; GFX-NEXT:    [[V:%.*]] = call i32 @llvm.amdgcn.permlane64.i32(i32 [[TID]])
+; GFX-NEXT:    [[TMP1:%.*]] = sext i32 [[TID]] to i64
+; GFX-NEXT:    [[OUT_PTR:%.*]] = getelementptr i32, ptr addrspace(1) [[OUT]], i64 [[TMP1]]
+; GFX-NEXT:    store i32 [[V]], ptr addrspace(1) [[OUT_PTR]], align 4
+; GFX-NEXT:    ret void
+;
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %v = call i32 @llvm.amdgcn.permlane64(i32 %tid)
+  %out_ptr = getelementptr i32, i32 addrspace(1)* %out, i32 %tid
+  store i32 %v, i32 addrspace(1)* %out_ptr
+  ret void
+}
+
+define amdgpu_kernel void @permlane64_vgpr_expression(i32 addrspace(1)* %out) {
+; GFX-LABEL: define amdgpu_kernel void @permlane64_vgpr_expression(
+; GFX-SAME: ptr addrspace(1) [[OUT:%.*]]) #[[ATTR0]] {
+; GFX-NEXT:    [[TID:%.*]] = call i32 @llvm.amdgcn.workitem.id.x()
+; GFX-NEXT:    [[TID2:%.*]] = add i32 [[TID]], 1
+; GFX-NEXT:    [[V:%.*]] = call i32 @llvm.amdgcn.permlane64.i32(i32 [[TID2]])
+; GFX-NEXT:    [[TMP1:%.*]] = sext i32 [[TID]] to i64
+; GFX-NEXT:    [[OUT_PTR:%.*]] = getelementptr i32, ptr addrspace(1) [[OUT]], i64 [[TMP1]]
+; GFX-NEXT:    store i32 [[V]], ptr addrspace(1) [[OUT_PTR]], align 4
+; GFX-NEXT:    ret void
+;
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %tid2 = add i32 %tid, 1
+  %v = call i32 @llvm.amdgcn.permlane64(i32 %tid2)
+  %out_ptr = getelementptr i32, i32 addrspace(1)* %out, i32 %tid
+  store i32 %v, i32 addrspace(1)* %out_ptr
+  ret void
+}
+
+define amdgpu_kernel void @readlane_constant(ptr addrspace(1) %out) {
+; GFX-LABEL: define amdgpu_kernel void @readlan...
[truncated]

github-actions · 2024-11-20T11:03:45Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/AMDGPU/AMDGPU.h

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

ssahasra

This change should include an actual use of the pass in the AMDGPU pipeline using AMDGPUTargetMachine. The commandline option should be used enable/disable the pass in the pipeline.

There need to be tests that demonstrate how this optimization enables other optimizations like CSE and hoisting/sinking since convergent calls are getting eliminated.

Also, there must be a test that shows how this pass can be used to eliminate a trivial waterfall loop. That is a loop which depends on "wfall", and all threads in the wave finish their work in the first iteration itself.

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

ssahasra · 2024-11-26T04:02:29Z

As per the draft PR: #99878 discussion, it turned out to have UniformityAnalysis after InstCombine may not fully utilize all the scopes, but it is not going to cause the dependency troubles.

I didn't understand this sentence. Can you please elaborate?

ssahasra

Also add a test where a loop has a divergent exit, and a value that is uniform inside the loop is used by one of these intrinsics outside the loop. It should not be optimized because of temporal divergence.

llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll

ssahasra · 2024-11-26T04:27:52Z

This change should include an actual use of the pass in the AMDGPU pipeline using AMDGPUTargetMachine. The commandline option should be used enable/disable the pass in the pipeline.

This should be a separate commandline option like amdgpu-enable-uniformity-optzn or something like that.

arsenm

I am concerned this does not have enough context to be definitively correct after later transformations. Consider a case like this:

  if (ballot_all(x)) {
    uniform_x = readfirstlane x
    speculatable_use(uniform_x)
    ...
  }

This transform will then produce:

  if (ballot_all(x)) {
    speculatable_use(x)
     ...
  }

Which may then later be incorrectly hoisted:

  speculatable_use(x) // x is not actually uniform anymore
  if (ballot_all(x)) { ... }

There needs to be something tying the uniformity to the use position

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

ssahasra · 2024-12-06T07:31:05Z

I am concerned this does not have enough context to be definitively correct after later transformations. Consider a case like this:
  if (ballot_all(x)) {
    uniform_x = readfirstlane x
    speculatable_use(uniform_x)
    ...
  }

If speculatable_use() depends on the uniformity of its argument, then shouldn't it be marked convergent? Clearly it has cross-thread semantics and should not be moved in the control flow.

Alternatively, @jayfoad , is this a use-case for readanylane? Perhaps this pass can replace readfirstlane with readanylane, which eventually becomes a nop if x is in an sreg, sufficiently far down in the codegen flow?

jayfoad · 2024-12-06T10:08:30Z

If speculatable_use() depends on the uniformity of its argument, then shouldn't it be marked convergent? Clearly it has cross-thread semantics and should not be moved in the control flow.

Agreed. The only kind of speculatable_use I can think of that might be affected by this is a simple SALU-only logical operation like this one, where we choose to legalize VGPR inputs with readfirstlane instead of a waterfall loop:

// Lowers to S_BITREPLICATE_B64_B32.
// The argument must be uniform; otherwise, the result is undefined.
def int_amdgcn_s_bitreplicate :
  DefaultAttrsIntrinsic<[llvm_i64_ty], [llvm_i32_ty], [IntrNoMem, IntrConvergent]>;

But it is already marked as convergent, so I don't think there's a problem here after all.

arsenm · 2024-12-06T12:49:29Z

If speculatable_use() depends on the uniformity of its argument, then shouldn't it be marked convergent? Clearly it has cross-thread semantics and should not be moved in the control flow.

No, it could be anything, like a simple add. The transitive users may be convergent, but I don't think it really matters. I think there would need to be some kind of convergent operation inserted here to maintain the use position

jayfoad · 2024-12-06T13:19:32Z

If speculatable_use() depends on the uniformity of its argument, then shouldn't it be marked convergent? Clearly it has cross-thread semantics and should not be moved in the control flow.

No, it could be anything, like a simple add. The transitive users may be convergent, but I don't think it really matters. I think there would need to be some kind of convergent operation inserted here to maintain the use position

If it was a simple add then hoisting it would not break anything. So there is no problem.

ssahasra · 2024-12-09T05:19:11Z

No, it could be anything, like a simple add. The transitive users may be convergent, but I don't think it really matters. I think there would need to be some kind of convergent operation inserted here to maintain the use position

This still sounds overly conservative. If a speculatable use depends on the value being uniform, then it has crossthread semantics and needs to be marked convergent. Still not seeing a reason to always maintain the position of the use.

ssahasra · 2024-12-10T03:07:59Z

  if (ballot_all(x)) {
    speculatable_use(x)
     ...
  }
Which may then later be incorrectly hoisted:
  speculatable_use(x) // x is not actually uniform anymore
  if (ballot_all(x)) { ... }
There needs to be something tying the uniformity to the use position

Actually there is more going on here than just a lack of convergent. ballot_all(x) is a uniform condition. It will allow convergent operations to be hoisted over it. So if speculatable_use(x) really does depend on x being uniform, then it convergent is not enough to stop it. But should this use have speculatable in the first place? By explicitly putting the speculatable attribute, the user is saying that the compiler is free to move hoist this use wherever other conditions are met. Perhaps they shouldn't be using speculatable in the absence of nuance about uniformity?

PankajDwivedi-25 · 2024-12-12T08:24:19Z

Also, there must be a test that shows how this pass can be used to eliminate a trivial waterfall loop. That is a loop which depends on "wfall", and all threads in the wave finish their work in the first iteration itself.

I see this case is not getting optimized in the checks generated & the loop structure remains intact.

jayfoad · 2024-12-12T14:16:21Z

  if (ballot_all(x)) {
    speculatable_use(x)
     ...
  }
Which may then later be incorrectly hoisted:
  speculatable_use(x) // x is not actually uniform anymore
  if (ballot_all(x)) { ... }
There needs to be something tying the uniformity to the use position
Actually there is more going on here than just a lack of convergent. ballot_all(x) is a uniform condition. It will allow convergent operations to be hoisted over it. So if speculatable_use(x) really does depend on x being uniform, then it convergent is not enough to stop it. But should this use have speculatable in the first place? By explicitly putting the speculatable attribute, the user is saying that the compiler is free to move hoist this use wherever other conditions are met. Perhaps they shouldn't be using speculatable in the absence of nuance about uniformity?

I still don't see any problem here. I would assume that any speculatable use does not have side effects, so really it just computes some result, e.g.:

  if (ballot_all(x)) {
    y = speculatable_use(x) // convergent
    use(y)
  }

After hoisting:

  y = speculatable_use(x) // convergent
  if (ballot_all(x)) {
    use(y)
  }

So x may not be uniform when we call speculatable_use, but we only use y if x was uniform. This seems fine, even if speculatable_use returns poison when its argument is not uniform. (As long as it doesn't trigger undefined behavior...)

ssahasra · 2024-12-12T15:08:19Z

After hoisting:
  y = speculatable_use(x) // convergent
  if (ballot_all(x)) {
    use(y)
  }
So x may not be uniform when we call speculatable_use, but we only use y if x was uniform. This seems fine, even if speculatable_use returns poison when its argument is not uniform. (As long as it doesn't trigger undefined behavior...)

So bottom-line is that both speculatable and convergent are orthogonal to this transformation.

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

arsenm · 2025-09-16T13:13:47Z

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

+    // If there are no ICmp users, return early.
+    if (none_of(II.users(), [](User *U) { return isa<ICmpInst>(U); }))
+      return false;


I'd just drop this. It's an extra use list walk

arsenm · 2025-09-16T13:18:30Z

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

+}
+
+/// Iterate over intrinsics in the module to optimise.
+static bool runUniformIntrinsicCombine(Module &M, ModuleAnalysisManager &AM) {


This should probably be a function pass

as discussed in here :#116953 (comment)
#116953 (comment)
We will be inspecting intrinsic declarations outside of the function.

Looking at the intrinsic declarations is fine, but you can also just not use the function use list. You can just directly pattern match inside the function

you can also just not use the function use list

Inside a module pass it should be ok right?

You can just directly pattern match inside the function

right, that is going to be costlier. we have to iterate over all the instructions in the function.

We had a series of discussions on this and ended up making it a module pass.

arsenm · 2025-09-16T13:18:47Z

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

+      if (ParentF->isDeclaration())
+        continue;


This cannot happen

ssahasra · 2025-09-17T04:30:40Z

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp

+using namespace llvm::PatternMatch;
+
+/// Tracks uniformity of newly created instructions.
+using UniformityTracker = ValueMap<const Value *, bool>;


This type is used to declare exactly one variable later. We don't really need to declare the type. You could just directly specify it in the declaration of Tracker itself.

I am not really sure if this should be a ValueMap. When something gets RAUW'ed, the ValueMap will update itself and discard the original information. But RAUW does not necessarily mean that the original Value will be destroyed. If some new use is introduced, then we do want to remember that it was uniform. At best, we want a map whose key type is a WeakVH.

correct, but that should not happen; if we replace all the uses with a new value, that means we don't want the old one.
and probably it will be removed by DCE or an equivalent dead code cleaner pass.

PankajDwivedi-25 · 2025-09-18T10:24:56Z

Ping: @arsenm I have added the description for the above reason in the pass itself, explaining why it is a module pass.

Please let me know if we can proceed.

PankajDwivedi-25 · 2025-09-30T11:17:29Z

Ping;
@ssahasra @jayfoad @arsenm
This work has been pending for some time. I have addressed all the review comments, including the justification for implementing it as a ModulePass.

Please review and approve if there are no further concerns.

ssahasra

I do believe that we had already agreed to make it a ModulePass because traversing over the uses of the intrinsic declaration is cheaper than traversing over every function body looking for calls to very specific intrinsics.

LGTM, please wait 24 hours before submitting so that other reviewers have a chance to comment.

PankajDwivedi-25 · 2025-10-08T16:47:46Z

I am not sure what changes have been made to the function and module classes recently, but it affects what F.users() reports.

Isn't F.users() where F is a declaration function, reports the list of uses of this Function? @ssahasra @arsenm @jayfoad.

…terator

PankajDwivedi-25 · 2025-10-09T07:11:42Z

I am not sure what changes have been made to the function and module classes recently, but it affects what F.users() reports.

Isn't F.users() where F is a declaration function, reports the list of uses of this Function? @ssahasra @arsenm @jayfoad.

please ignore this.

merging as checks passed.

llvm-ci · 2025-10-09T08:27:42Z

LLVM Buildbot has detected a new failure on builder clang-armv8-quick running on linaro-clang-armv8-quick while building llvm at step 5 "ninja check 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/154/builds/22525

Here is the relevant piece of the build log for the reference

Step 5 (ninja check 1) failure: stage 1 checked (failure)
******************** TEST 'lit :: max-time.py' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 5
env -u FILECHECK_OPTS "/usr/bin/python3.10" /home/tcwg-buildbot/worker/clang-armv8-quick/llvm/llvm/utils/lit/lit.py -j1 --order=lexical Inputs/max-time --max-time=5 2>&1  |  FileCheck /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/max-time.py
# executed command: env -u FILECHECK_OPTS /usr/bin/python3.10 /home/tcwg-buildbot/worker/clang-armv8-quick/llvm/llvm/utils/lit/lit.py -j1 --order=lexical Inputs/max-time --max-time=5
# executed command: FileCheck /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/max-time.py
# .---command stderr------------
# | /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/max-time.py:8:10: error: CHECK: expected string not found in input
# | # CHECK: Skipped: 1
# |          ^
# | <stdin>:2:51: note: scanning from here
# | warning: reached timeout, skipping remaining tests
# |                                                   ^
# | <stdin>:7:2: note: possible intended match here
# |  Skipped: 2 (100.00%)
# |  ^
# | 
# | Input file: <stdin>
# | Check file: /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/max-time.py
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |            1: -- Testing: 2 tests, 1 workers -- 
# |            2: warning: reached timeout, skipping remaining tests 
# | check:8'0                                                       X error: no match found
# |            3:  
# | check:8'0     ~
# |            4: Testing Time: 6.22s 
# | check:8'0     ~~~~~~~~~~~~~~~~~~~~
# |            5:  
# | check:8'0     ~
# |            6: Total Discovered Tests: 2 
# | check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            7:  Skipped: 2 (100.00%) 
# | check:8'0     ~~~~~~~~~~~~~~~~~~~~~~
# | check:8'1      ?                     possible intended match
# | >>>>>>
# `-----------------------------
# error: command failed with exit status: 1

--

********************

… uniform AMDGPU lane Intrinsics. (#116953) This pass introduces optimizations for AMDGPU intrinsics by leveraging the uniformity of their arguments. When an intrinsic's arguments are detected as uniform, redundant computations are eliminated, and the intrinsic calls are simplified accordingly. By utilizing the UniformityInfo analysis, this pass identifies cases where intrinsic calls are uniform across all lanes, allowing transformations that reduce unnecessary operations and improve the IR's efficiency. These changes enhance performance by streamlining intrinsic usage in uniform scenarios without altering the program's semantics. For background, see PR #99878

… uniform AMDGPU lane Intrinsics. (llvm#116953) This pass introduces optimizations for AMDGPU intrinsics by leveraging the uniformity of their arguments. When an intrinsic's arguments are detected as uniform, redundant computations are eliminated, and the intrinsic calls are simplified accordingly. By utilizing the UniformityInfo analysis, this pass identifies cases where intrinsic calls are uniform across all lanes, allowing transformations that reduce unnecessary operations and improve the IR's efficiency. These changes enhance performance by streamlining intrinsic usage in uniform scenarios without altering the program's semantics. For background, see PR llvm#99878

PankajDwivedi-25 requested review from ssahasra and Acim-Maravic November 20, 2024 11:00

llvmbot added the backend:AMDGPU label Nov 20, 2024

jayfoad reviewed Nov 20, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPU.h Outdated Show resolved Hide resolved

jayfoad reviewed Nov 20, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp Outdated Show resolved Hide resolved

PankajDwivedi-25 force-pushed the users/pkd-25/applyUniformityToInstCombine branch from 5ea4ecb to 1f450cf Compare November 21, 2024 07:17

rovka reviewed Nov 21, 2024

View reviewed changes

jayfoad reviewed Nov 21, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp Outdated Show resolved Hide resolved

jayfoad reviewed Nov 22, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp Outdated Show resolved Hide resolved

ssahasra reviewed Nov 26, 2024

View reviewed changes

ssahasra requested a review from arsenm November 26, 2024 04:27

pravinjagtap self-requested a review November 26, 2024 12:15

arsenm requested changes Dec 6, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp Outdated Show resolved Hide resolved

jmmartinez reviewed Jan 10, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUUniformIntrinsicCombine.cpp Outdated Show resolved Hide resolved

llvm/test/CodeGen/AMDGPU/amdgpu-uniform-intrinsic-combine.ll Outdated Show resolved Hide resolved

jmmartinez reviewed Feb 19, 2025

View reviewed changes

PankajDwivedi-25 force-pushed the users/pkd-25/applyUniformityToInstCombine branch from 51eacb9 to 7ee0cca Compare February 24, 2025 06:44

arsenm reviewed Sep 16, 2025

View reviewed changes

ssahasra reviewed Sep 17, 2025

View reviewed changes

PankajDwivedi-25 mentioned this pull request Sep 26, 2025

[WIP][InstCombine] Add assume-based optimizations for equality and AMDGPU ballot patterns #160670

Draft

PankajDwivedi-25 requested a review from ssahasra September 30, 2025 06:59

PankajDwivedi-25 requested review from jayfoad, jmmartinez and arsenm September 30, 2025 11:18

PankajDwivedi-25 force-pushed the users/pkd-25/applyUniformityToInstCombine branch from 9ee3555 to e965482 Compare October 3, 2025 10:19

ssahasra approved these changes Oct 6, 2025

View reviewed changes

PankajDwivedi-25 enabled auto-merge (squash) October 7, 2025 07:15

PankajDwivedi-25 disabled auto-merge October 7, 2025 07:27

PankajDwivedi-25 and others added 8 commits October 7, 2025 14:43

[AMDGPU] combine uniform AMDGPU lane Intrinsics

096a801

update description

0c9ff2f

store newly inserted inst and its uniformity

583094e

track newly inserted instruction

5a7fee1

address review

138384a

address review

bb5b4b0

add description module pass over function

dbcb4a7

update description, add 03 run line in test

6292fcc

PankajDwivedi-25 force-pushed the users/pkd-25/applyUniformityToInstCombine branch from 777951d to 6292fcc Compare October 7, 2025 09:18

update checks

2738565

[FIX] while iterating over declaration users make sure we increment i…

8f5521c

…terator

PankajDwivedi-25 merged commit 53aad35 into llvm:main Oct 9, 2025
9 checks passed

[AMDGPU] Introduce "amdgpu-uniform-intrinsic-combine" pass to combine uniform AMDGPU lane Intrinsics. #116953

[AMDGPU] Introduce "amdgpu-uniform-intrinsic-combine" pass to combine uniform AMDGPU lane Intrinsics. #116953

Uh oh!

Conversation

PankajDwivedi-25 commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 20, 2024

Uh oh!

github-actions bot commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ssahasra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ssahasra commented Nov 26, 2024

Uh oh!

ssahasra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ssahasra commented Nov 26, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ssahasra commented Dec 6, 2024

Uh oh!

jayfoad commented Dec 6, 2024

Uh oh!

arsenm commented Dec 6, 2024

Uh oh!

jayfoad commented Dec 6, 2024

Uh oh!

ssahasra commented Dec 9, 2024

Uh oh!

ssahasra commented Dec 10, 2024

Uh oh!

PankajDwivedi-25 commented Dec 12, 2024

Uh oh!

jayfoad commented Dec 12, 2024

Uh oh!

ssahasra commented Dec 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arsenm Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

PankajDwivedi-25 commented Nov 20, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024 •

edited

Loading

PankajDwivedi-25 Sep 17, 2025 •

edited

Loading

PankajDwivedi-25 Sep 17, 2025 •

edited

Loading

PankajDwivedi-25 Sep 17, 2025 •

edited

Loading