Skip to content

Conversation

artagnon
Copy link
Contributor

@artagnon artagnon commented Aug 3, 2025

Introduce a simple common-subexpression-elimination pass at the VPlan-level, running late during the execution of the VPlan. The long-term vision is to get rid of the legacy non-VPlan-based cse routine in LV, but this patch doesn't yet fully subsume it.

Copy link

github-actions bot commented Aug 3, 2025

✅ With the latest revision this PR passed the undef deprecator.

@artagnon artagnon requested a review from ayalz August 3, 2025 17:37
@artagnon artagnon force-pushed the vplan-cse branch 2 times, most recently from 71938b9 to 771fefc Compare August 3, 2025 21:06
@artagnon artagnon requested a review from nikic August 3, 2025 23:26
@artagnon artagnon force-pushed the vplan-cse branch 2 times, most recently from f8f64e9 to 70925b7 Compare August 4, 2025 17:49
@artagnon
Copy link
Contributor Author

artagnon commented Aug 4, 2025

Current status: there are lots of failing tests to update, and there is a DenseMap-already-present-value-when-growing crash in Transforms/LoopVectorize/X86/induction-costs.ll, which is turning out to be very hard to debug.

@llvmbot
Copy link
Member

llvmbot commented Aug 4, 2025

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-backend-powerpc

Author: Ramkumar Ramachandra (artagnon)

Changes

Introduce a simple and limited common-subexpression-elimination pass at the VPlan-level, running late after recipes are executed. The long-term vision is to get rid of the legacy non-VPlan-based cse routine in LV, but this patch doesn't yet fully subsume it.


Patch is 207.23 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151872.diff

56 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+5)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+70)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+4)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanUtils.h (+17)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll (+63-69)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/drop-poison-generating-flags.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll (+2-3)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll (+71-34)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll (+3-6)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/interleave-with-gaps.ll (+4-8)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/licm-calls.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/masked-call-scalarize.ll (+4-11)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll (+9-12)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/outer_loop_test1_no_explicit_vect_width.ll (+66-58)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product.ll (-12)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/store-costs-sve.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll (+16-16)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-vscale-based-trip-counts.ll (+28-28)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-extractvalue.ll (+32-10)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-phi.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-with-wide-ops.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll (+5-10)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/vectorize-bswap.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/scalable-tailfold.ll (+2-3)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/strided-accesses.ll (+4-6)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/cost-constant-known-via-scev.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (-7)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/induction-costs.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll (+2-8)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/load-deref-pred.ll (+6-33)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/outer_loop_test1_no_explicit_vect_width.ll (+66-57)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/replicate-uniform-call.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/scatter_crash.ll (+6-12)
  • (modified) llvm/test/Transforms/LoopVectorize/assume.ll (+104-16)
  • (modified) llvm/test/Transforms/LoopVectorize/dead_instructions.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/dereferenceable-info-from-assumption-constant-size.ll (+3-6)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-complex.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-multiply-recurrences.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+15-27)
  • (modified) llvm/test/Transforms/LoopVectorize/induction.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/interleave-with-i65-induction.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses-different-insert-position.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/opaque-ptr.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/outer_loop_test1.ll (+36-29)
  • (modified) llvm/test/Transforms/LoopVectorize/pr36983-multiple-lcssa.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/pr59319-loop-access-info-invalidation.ll (+12-13)
  • (modified) llvm/test/Transforms/LoopVectorize/pseudoprobe.ll (+2-3)
  • (modified) llvm/test/Transforms/LoopVectorize/reverse_induction.ll (+3-6)
  • (modified) llvm/test/Transforms/LoopVectorize/scalable-assume.ll (+148-17)
  • (modified) llvm/test/Transforms/LoopVectorize/single-value-blend-phis.ll (+5-7)
  • (modified) llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction2.ll (+9-10)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index d04317bd8822d..b78017027dbf1 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7309,6 +7309,7 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
   VPlanTransforms::narrowInterleaveGroups(
       BestVPlan, BestVF,
       TTI.getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector));
+  VPlanTransforms::cse(BestVPlan, *Legal->getWidestInductionType());
   VPlanTransforms::removeDeadRecipes(BestVPlan);
 
   VPlanTransforms::convertToConcreteRecipes(BestVPlan,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 39f5e3651e9bb..7929d30a1c9f5 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -897,6 +897,11 @@ struct VPRecipeWithIRFlags : public VPSingleDefRecipe, public VPIRFlags {
     return R && classof(R);
   }
 
+  static inline bool classof(const VPSingleDefRecipe *U) {
+    auto *R = dyn_cast<VPRecipeBase>(U);
+    return R && classof(R);
+  }
+
   void execute(VPTransformState &State) override = 0;
 };
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 3ecffc7593d49..5e7815f26e9c0 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -1753,6 +1753,76 @@ void VPlanTransforms::clearReductionWrapFlags(VPlan &Plan) {
   }
 }
 
+/// Hash the underlying data of a VPSingleDefRecipe pointer, instead of hashing
+/// the pointer itself.
+namespace {
+struct VPCSEDenseMapInfo : public DenseMapInfo<VPSingleDefRecipe *> {
+  static bool isSentinel(const VPSingleDefRecipe *Def) {
+    return Def == getEmptyKey() || Def == getTombstoneKey();
+  }
+
+  static bool canHandle(const VPSingleDefRecipe *Def) {
+    return isa<VPInstruction, VPWidenRecipe, VPWidenCastRecipe,
+               VPWidenSelectRecipe, VPHistogramRecipe, VPPartialReductionRecipe,
+               VPReplicateRecipe, VPWidenIntrinsicRecipe>(Def);
+  }
+
+  static unsigned getHashValue(const VPSingleDefRecipe *Def) {
+    return hash_combine(Def->getVPDefID(), vputils::getOpcode(*Def),
+                        vputils::isSingleScalar(Def),
+                        hash_combine_range(Def->operands()));
+  }
+
+  static bool isEqual(const VPSingleDefRecipe *L, const VPSingleDefRecipe *R) {
+    if (isSentinel(L) || isSentinel(R))
+      return L == R;
+    bool Result = L->getVPDefID() == R->getVPDefID() &&
+           vputils::getOpcode(*L) == vputils::getOpcode(*R) &&
+           vputils::isSingleScalar(L) == vputils::isSingleScalar(R) &&
+           equal(L->operands(), R->operands());
+    assert(!Result || getHashValue(L) == getHashValue(R));
+    return Result;
+  }
+};
+} // end anonymous namespace
+
+/// Perform a common-subexpression-elimination of VPSingleDefRecipes on the \p
+/// Plan.
+void VPlanTransforms::cse(VPlan &Plan, Type &CanonicalIVTy) {
+  VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
+  if (!LoopRegion)
+    return;
+  auto VPBBsOutsideLoopRegion = VPBlockUtils::blocksOnly<VPBasicBlock>(
+      vp_depth_first_shallow(Plan.getEntry()));
+  auto VPBBsInsideLoopRegion = VPBlockUtils::blocksOnly<VPBasicBlock>(
+      vp_depth_first_shallow(LoopRegion->getEntry()));
+
+  // There is existing logic to sink instructions into replicate regions, and
+  // we'd be undoing that work if we went through replicate regions. Hence,
+  // don't CSE in replicate regions.
+  DenseMap<VPSingleDefRecipe *, VPSingleDefRecipe *, VPCSEDenseMapInfo> CSEMap;
+  VPTypeAnalysis TypeInfo(&CanonicalIVTy);
+  for (VPBasicBlock *VPBB :
+       concat<VPBasicBlock *>(VPBBsOutsideLoopRegion, VPBBsInsideLoopRegion)) {
+    for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
+      auto *Def = dyn_cast<VPSingleDefRecipe>(&R);
+      if (!Def || !VPCSEDenseMapInfo::canHandle(Def))
+        continue;
+      if (VPSingleDefRecipe *V = CSEMap.lookup(Def)) {
+        if (TypeInfo.inferScalarType(Def) != TypeInfo.inferScalarType(V))
+          continue;
+        // Drop poison-generating flags when reusing a value.
+        if (auto *RFlags = dyn_cast<VPRecipeWithIRFlags>(V))
+          RFlags->dropPoisonGeneratingFlags();
+        Def->replaceAllUsesWith(V);
+        Def->eraseFromParent();
+        continue;
+      }
+      CSEMap[Def] = Def;
+    }
+  }
+}
+
 /// Move loop-invariant recipes out of the vector loop region in \p Plan.
 static void licm(VPlan &Plan) {
   VPBasicBlock *Preheader = Plan.getVectorPreheader();
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 5943684e17a76..9e99c781022d7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -240,6 +240,10 @@ struct VPlanTransforms {
   /// removing dead edges to their successors.
   static void removeBranchOnConst(VPlan &Plan);
 
+  /// Perform common-subexpression-elimination, which is best done after the \p
+  /// Plan is executed.
+  static void cse(VPlan &Plan, Type &CanonicalIVType);
+
   /// If there's a single exit block, optimize its phi recipes that use exiting
   /// IV values by feeding them precomputed end values instead, possibly taken
   /// one step backwards.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUtils.h b/llvm/lib/Transforms/Vectorize/VPlanUtils.h
index 8dcd57f1b3598..f0a6540a91915 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUtils.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanUtils.h
@@ -10,6 +10,7 @@
 #define LLVM_TRANSFORMS_VECTORIZE_VPLANUTILS_H
 
 #include "VPlan.h"
+#include "llvm/ADT/TypeSwitch.h"
 
 namespace llvm {
 class ScalarEvolution;
@@ -37,6 +38,22 @@ VPValue *getOrCreateVPValueForSCEVExpr(VPlan &Plan, const SCEV *Expr,
 /// SCEV expression could be constructed.
 const SCEV *getSCEVExprForVPValue(VPValue *V, ScalarEvolution &SE);
 
+/// Get any instruction opcode data embedded in recipe \p R. Returns an optional
+/// pair, where the first element indicates whether it is an intrinsic ID.
+inline std::optional<std::pair<bool, unsigned>>
+getOpcode(const VPRecipeBase &R) {
+  return TypeSwitch<const VPRecipeBase *,
+                    std::optional<std::pair<bool, unsigned>>>(&R)
+      .Case<VPInstruction, VPWidenRecipe, VPWidenCastRecipe,
+            VPWidenSelectRecipe, VPHistogramRecipe, VPPartialReductionRecipe,
+            VPReplicateRecipe>(
+          [](auto *I) { return std::make_pair(false, I->getOpcode()); })
+      .Case<VPWidenIntrinsicRecipe>([](auto *I) {
+        return std::make_pair(true, I->getVectorIntrinsicID());
+      })
+      .Default([](auto *) { return std::nullopt; });
+}
+
 /// Returns true if \p VPV is a single scalar, either because it produces the
 /// same value for all lanes or only has its first lane used.
 inline bool isSingleScalar(const VPValue *VPV) {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index 0232d88347d0a..cefb191f74c3e 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -1032,8 +1032,8 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
 ; DEFAULT-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; DEFAULT:       [[VECTOR_BODY]]:
-; DEFAULT-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE27:.*]] ]
-; DEFAULT-NEXT:    [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE27]] ]
+; DEFAULT-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE25:.*]] ]
+; DEFAULT-NEXT:    [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE25]] ]
 ; DEFAULT-NEXT:    [[TMP15:%.*]] = load float, ptr [[SRC_1]], align 4
 ; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT8:%.*]] = insertelement <8 x float> poison, float [[TMP15]], i64 0
 ; DEFAULT-NEXT:    [[BROADCAST_SPLAT9:%.*]] = shufflevector <8 x float> [[BROADCAST_SPLATINSERT8]], <8 x float> poison, <8 x i32> zeroinitializer
@@ -1046,10 +1046,7 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT10:%.*]] = insertelement <8 x float> poison, float [[TMP19]], i64 0
 ; DEFAULT-NEXT:    [[BROADCAST_SPLAT11:%.*]] = shufflevector <8 x float> [[BROADCAST_SPLATINSERT10]], <8 x float> poison, <8 x i32> zeroinitializer
 ; DEFAULT-NEXT:    [[TMP20:%.*]] = call <8 x float> @llvm.fmuladd.v8f32(<8 x float> [[BROADCAST_SPLAT11]], <8 x float> zeroinitializer, <8 x float> [[TMP18]])
-; DEFAULT-NEXT:    [[TMP21:%.*]] = load float, ptr [[SRC_3]], align 4
-; DEFAULT-NEXT:    [[BROADCAST_SPLATINSERT12:%.*]] = insertelement <8 x float> poison, float [[TMP21]], i64 0
-; DEFAULT-NEXT:    [[BROADCAST_SPLAT13:%.*]] = shufflevector <8 x float> [[BROADCAST_SPLATINSERT12]], <8 x float> poison, <8 x i32> zeroinitializer
-; DEFAULT-NEXT:    [[TMP22:%.*]] = fcmp ogt <8 x float> [[TMP20]], [[BROADCAST_SPLAT13]]
+; DEFAULT-NEXT:    [[TMP22:%.*]] = fcmp ogt <8 x float> [[TMP20]], [[BROADCAST_SPLAT11]]
 ; DEFAULT-NEXT:    [[TMP23:%.*]] = getelementptr { [4 x float] }, ptr [[DST]], <8 x i64> [[VEC_IND]]
 ; DEFAULT-NEXT:    [[TMP24:%.*]] = extractelement <8 x i1> [[TMP22]], i32 0
 ; DEFAULT-NEXT:    br i1 [[TMP24]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
@@ -1067,8 +1064,8 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE]]
 ; DEFAULT:       [[PRED_STORE_CONTINUE]]:
 ; DEFAULT-NEXT:    [[TMP31:%.*]] = extractelement <8 x i1> [[TMP22]], i32 1
-; DEFAULT-NEXT:    br i1 [[TMP31]], label %[[PRED_STORE_IF14:.*]], label %[[PRED_STORE_CONTINUE15:.*]]
-; DEFAULT:       [[PRED_STORE_IF14]]:
+; DEFAULT-NEXT:    br i1 [[TMP31]], label %[[PRED_STORE_IF12:.*]], label %[[PRED_STORE_CONTINUE13:.*]]
+; DEFAULT:       [[PRED_STORE_IF12]]:
 ; DEFAULT-NEXT:    [[TMP32:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 1
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP32]], align 4
 ; DEFAULT-NEXT:    [[TMP33:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 1
@@ -1079,11 +1076,11 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP36]], align 4
 ; DEFAULT-NEXT:    [[TMP37:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 1
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP37]], align 4
-; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE15]]
-; DEFAULT:       [[PRED_STORE_CONTINUE15]]:
+; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE13]]
+; DEFAULT:       [[PRED_STORE_CONTINUE13]]:
 ; DEFAULT-NEXT:    [[TMP38:%.*]] = extractelement <8 x i1> [[TMP22]], i32 2
-; DEFAULT-NEXT:    br i1 [[TMP38]], label %[[PRED_STORE_IF16:.*]], label %[[PRED_STORE_CONTINUE17:.*]]
-; DEFAULT:       [[PRED_STORE_IF16]]:
+; DEFAULT-NEXT:    br i1 [[TMP38]], label %[[PRED_STORE_IF14:.*]], label %[[PRED_STORE_CONTINUE15:.*]]
+; DEFAULT:       [[PRED_STORE_IF14]]:
 ; DEFAULT-NEXT:    [[TMP39:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 2
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP39]], align 4
 ; DEFAULT-NEXT:    [[TMP40:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 2
@@ -1094,11 +1091,11 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP43]], align 4
 ; DEFAULT-NEXT:    [[TMP44:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 2
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP44]], align 4
-; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE17]]
-; DEFAULT:       [[PRED_STORE_CONTINUE17]]:
+; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE15]]
+; DEFAULT:       [[PRED_STORE_CONTINUE15]]:
 ; DEFAULT-NEXT:    [[TMP45:%.*]] = extractelement <8 x i1> [[TMP22]], i32 3
-; DEFAULT-NEXT:    br i1 [[TMP45]], label %[[PRED_STORE_IF18:.*]], label %[[PRED_STORE_CONTINUE19:.*]]
-; DEFAULT:       [[PRED_STORE_IF18]]:
+; DEFAULT-NEXT:    br i1 [[TMP45]], label %[[PRED_STORE_IF16:.*]], label %[[PRED_STORE_CONTINUE17:.*]]
+; DEFAULT:       [[PRED_STORE_IF16]]:
 ; DEFAULT-NEXT:    [[TMP46:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 3
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP46]], align 4
 ; DEFAULT-NEXT:    [[TMP47:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 3
@@ -1109,11 +1106,11 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP50]], align 4
 ; DEFAULT-NEXT:    [[TMP51:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 3
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP51]], align 4
-; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE19]]
-; DEFAULT:       [[PRED_STORE_CONTINUE19]]:
+; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE17]]
+; DEFAULT:       [[PRED_STORE_CONTINUE17]]:
 ; DEFAULT-NEXT:    [[TMP52:%.*]] = extractelement <8 x i1> [[TMP22]], i32 4
-; DEFAULT-NEXT:    br i1 [[TMP52]], label %[[PRED_STORE_IF20:.*]], label %[[PRED_STORE_CONTINUE21:.*]]
-; DEFAULT:       [[PRED_STORE_IF20]]:
+; DEFAULT-NEXT:    br i1 [[TMP52]], label %[[PRED_STORE_IF18:.*]], label %[[PRED_STORE_CONTINUE19:.*]]
+; DEFAULT:       [[PRED_STORE_IF18]]:
 ; DEFAULT-NEXT:    [[TMP53:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 4
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP53]], align 4
 ; DEFAULT-NEXT:    [[TMP54:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 4
@@ -1124,11 +1121,11 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP57]], align 4
 ; DEFAULT-NEXT:    [[TMP58:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 4
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP58]], align 4
-; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE21]]
-; DEFAULT:       [[PRED_STORE_CONTINUE21]]:
+; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE19]]
+; DEFAULT:       [[PRED_STORE_CONTINUE19]]:
 ; DEFAULT-NEXT:    [[TMP59:%.*]] = extractelement <8 x i1> [[TMP22]], i32 5
-; DEFAULT-NEXT:    br i1 [[TMP59]], label %[[PRED_STORE_IF22:.*]], label %[[PRED_STORE_CONTINUE23:.*]]
-; DEFAULT:       [[PRED_STORE_IF22]]:
+; DEFAULT-NEXT:    br i1 [[TMP59]], label %[[PRED_STORE_IF20:.*]], label %[[PRED_STORE_CONTINUE21:.*]]
+; DEFAULT:       [[PRED_STORE_IF20]]:
 ; DEFAULT-NEXT:    [[TMP60:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 5
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP60]], align 4
 ; DEFAULT-NEXT:    [[TMP61:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 5
@@ -1139,11 +1136,11 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP64]], align 4
 ; DEFAULT-NEXT:    [[TMP65:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 5
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP65]], align 4
-; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE23]]
-; DEFAULT:       [[PRED_STORE_CONTINUE23]]:
+; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE21]]
+; DEFAULT:       [[PRED_STORE_CONTINUE21]]:
 ; DEFAULT-NEXT:    [[TMP66:%.*]] = extractelement <8 x i1> [[TMP22]], i32 6
-; DEFAULT-NEXT:    br i1 [[TMP66]], label %[[PRED_STORE_IF24:.*]], label %[[PRED_STORE_CONTINUE25:.*]]
-; DEFAULT:       [[PRED_STORE_IF24]]:
+; DEFAULT-NEXT:    br i1 [[TMP66]], label %[[PRED_STORE_IF22:.*]], label %[[PRED_STORE_CONTINUE23:.*]]
+; DEFAULT:       [[PRED_STORE_IF22]]:
 ; DEFAULT-NEXT:    [[TMP67:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 6
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP67]], align 4
 ; DEFAULT-NEXT:    [[TMP68:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 6
@@ -1154,11 +1151,11 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP71]], align 4
 ; DEFAULT-NEXT:    [[TMP72:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 6
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP72]], align 4
-; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE25]]
-; DEFAULT:       [[PRED_STORE_CONTINUE25]]:
+; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE23]]
+; DEFAULT:       [[PRED_STORE_CONTINUE23]]:
 ; DEFAULT-NEXT:    [[TMP73:%.*]] = extractelement <8 x i1> [[TMP22]], i32 7
-; DEFAULT-NEXT:    br i1 [[TMP73]], label %[[PRED_STORE_IF26:.*]], label %[[PRED_STORE_CONTINUE27]]
-; DEFAULT:       [[PRED_STORE_IF26]]:
+; DEFAULT-NEXT:    br i1 [[TMP73]], label %[[PRED_STORE_IF24:.*]], label %[[PRED_STORE_CONTINUE25]]
+; DEFAULT:       [[PRED_STORE_IF24]]:
 ; DEFAULT-NEXT:    [[TMP74:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 7
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP74]], align 4
 ; DEFAULT-NEXT:    [[TMP75:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 7
@@ -1169,8 +1166,8 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP78]], align 4
 ; DEFAULT-NEXT:    [[TMP79:%.*]] = extractelement <8 x ptr> [[TMP23]], i32 7
 ; DEFAULT-NEXT:    store float 0.000000e+00, ptr [[TMP79]], align 4
-; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE27]]
-; DEFAULT:       [[PRED_STORE_CONTINUE27]]:
+; DEFAULT-NEXT:    br label %[[PRED_STORE_CONTINUE25]]
+; DEFAULT:       [[PRED_STORE_CONTINUE25]]:
 ; DEFAULT-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
 ; DEFAULT-NEXT:    [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], splat (i64 8)
 ; DEFAULT-NEXT:    [[TMP80:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
@@ -1251,9 +1248,9 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; PRED-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i64(i64 0, i64 [[TMP0]])
 ; PRED-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; PRED:       [[VECTOR_BODY]]:
-; PRED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE27:.*]] ]
-; PRED-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[PRED_STORE_CONTINUE27]] ]
-; PRED-NEXT:    [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE27]] ]
+; PRED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE25:.*]] ]
+; PRED-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[PRED_STORE_CONTINUE25]] ]
+; PRED-NEXT:    [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE25]] ]
 ; PRED-NEXT:    [[TMP18:%.*]] = load float, ptr [[SRC_1]], align 4
 ; PRED-NEXT:    [[BROADCAST_SPLATINSERT8:%.*]] = insertelement <8 x float> poison, float [[TMP18]], i64 0
 ; PRED-NEXT:    [[BROADCAST_SPLAT9:%.*]] = shufflevector <8 x float> [[BROADCAST_SPLATINSERT8]], <8 x float> poison, <8 x i32> zeroinitializer
@@ -1266,10 +1263,7 @@ define void @test_conditional_interleave_group (ptr noalias %src.1, ptr noalias
 ; PRED-NEXT:    [[BROADCAST_SPLATINSERT10:%.*]] = insertelement <8 x float> poison, float [[TMP22]], i64 0
 ; PRED-NEXT:    [[BROADCAST_SPLAT11:%.*]] = shufflevector <8 x float> [[BROADCAST_SPLATINSERT10]], <8 x float> poison, <8 x i32> zeroinitializer
 ; PRED-NEXT:    [[TMP23:%.*]] = call <8 x float> @llvm.fmuladd.v8f32(<8 x float> [[BROADCAST_SPLAT11]], <8 x float> zeroinitializer, <8 x float> [[TMP21]])
-; PRED-NEXT:    [[TMP24:%.*]] = load float, ptr [[SRC_3]], align 4
-; PRED-NEXT:    [[BROADCAST_SPLATINSERT12:%.*]] = insertelement <8 x float> poison, float [[TMP24]], i64 0
-; PRED-NEXT:    [[BROADCAST_SPLAT13:%.*]] = shufflevector <8 x float> [[BROADCAST_SPLATINSERT12]], <8 x float> poison, <8 x i32> zeroinitializer
-; PRED-NEXT:    [[...
[truncated]

Copy link

github-actions bot commented Aug 4, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@artagnon
Copy link
Contributor Author

artagnon commented Aug 4, 2025

Current status: managed to limit the cse to some recipes. All tests have been updated, and there is no crash. This patch is now ready to review.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a number of rename-only changes, could you go over the test diffs and remove those to reduce the diff? I highlighted some but didn't go through the while diff

@artagnon artagnon removed the request for review from nikic August 19, 2025 10:11
Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I checked the diffs and they seem fine, and I can't think of anything else we would need to hash in the list in canHandle. As always though I'll defer to Florian for the final review :)

@artagnon artagnon force-pushed the vplan-cse branch 4 times, most recently from 039adf6 to 4ff0326 Compare August 26, 2025 13:12
@artagnon
Copy link
Contributor Author

Gentle ping.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fine now I think. Will run some final checks today

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@artagnon artagnon merged commit d8fd511 into llvm:main Sep 2, 2025
9 checks passed
@artagnon artagnon deleted the vplan-cse branch September 2, 2025 11:23
@llvm-ci
Copy link
Collaborator

llvm-ci commented Sep 2, 2025

LLVM Buildbot has detected a new failure on builder llvm-clang-aarch64-darwin running on doug-worker-4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/190/builds/26522

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'Clang-Unit :: ./AllClangUnitTests/21/48' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:/Volumes/RAMDisk/buildbot-root/aarch64-darwin/build/tools/clang/unittests/./AllClangUnitTests-Clang-Unit-80732-21-48.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=48 GTEST_SHARD_INDEX=21 /Volumes/RAMDisk/buildbot-root/aarch64-darwin/build/tools/clang/unittests/./AllClangUnitTests
--

Script:
--
/Volumes/RAMDisk/buildbot-root/aarch64-darwin/build/tools/clang/unittests/./AllClangUnitTests --gtest_filter=TimeProfilerTest.ConstantEvaluationCxx20
--
/Users/buildbot/buildbot-root/aarch64-darwin/llvm-project/clang/unittests/Support/TimeProfilerTest.cpp:247: Failure
Expected equality of these values:
  R"(
Frontend (test.cc)
| ParseDeclarationOrFunctionDefinition (test.cc:2:1)
| ParseDeclarationOrFunctionDefinition (test.cc:6:1)
| | ParseFunctionDefinition (slow_func)
| | | EvaluateAsRValue (<test.cc:8:21>)
| | | EvaluateForOverflow (<test.cc:8:21, col:25>)
| | | EvaluateForOverflow (<test.cc:8:30, col:32>)
| | | EvaluateAsRValue (<test.cc:9:14>)
| | | EvaluateForOverflow (<test.cc:9:9, col:14>)
| | | isPotentialConstantExpr (slow_namespace::slow_func)
| | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)
| | | | EvaluateAsRValue (<test.cc:8:21, col:25>)
| | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)
| | | | EvaluateAsRValue (<test.cc:8:21, col:25>)
| ParseDeclarationOrFunctionDefinition (test.cc:16:1)
| | ParseFunctionDefinition (slow_test)
| | | EvaluateAsInitializer (slow_value)
| | | EvaluateAsConstantExpr (<test.cc:17:33, col:59>)
| | | EvaluateAsConstantExpr (<test.cc:18:11, col:37>)
| ParseDeclarationOrFunctionDefinition (test.cc:22:1)
| | EvaluateAsConstantExpr (<test.cc:23:31, col:57>)
| | EvaluateAsRValue (<test.cc:22:14, line:23:58>)
| ParseDeclarationOrFunctionDefinition (test.cc:25:1)
| | EvaluateAsInitializer (slow_init_list)
| PerformPendingInstantiations
)"
    Which is: "\nFrontend (test.cc)\n| ParseDeclarationOrFunctionDefinition (test.cc:2:1)\n| ParseDeclarationOrFunctionDefinition (test.cc:6:1)\n| | ParseFunctionDefinition (slow_func)\n| | | EvaluateAsRValue (<test.cc:8:21>)\n| | | EvaluateForOverflow (<test.cc:8:21, col:25>)\n| | | EvaluateForOverflow (<test.cc:8:30, col:32>)\n| | | EvaluateAsRValue (<test.cc:9:14>)\n| | | EvaluateForOverflow (<test.cc:9:9, col:14>)\n| | | isPotentialConstantExpr (slow_namespace::slow_func)\n| | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)\n| | | | EvaluateAsRValue (<test.cc:8:21, col:25>)\n| | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)\n| | | | EvaluateAsRValue (<test.cc:8:21, col:25>)\n| ParseDeclarationOrFunctionDefinition (test.cc:16:1)\n| | ParseFunctionDefinition (slow_test)\n| | | EvaluateAsInitializer (slow_value)\n| | | EvaluateAsConstantExpr (<test.cc:17:33, col:59>)\n| | | EvaluateAsConstantExpr (<test.cc:18:11, col:37>)\n| ParseDeclarationOrFunctionDefinition (test.cc:22:1)\n| | EvaluateAsConstantExpr (<test.cc:23:31, col:57>)\n| | EvaluateAsRValue (<test.cc:22:14, line:23:58>)\n| ParseDeclarationOrFunctionDefinition (test.cc:25:1)\n| | EvaluateAsInitializer (slow_init_list)\n| PerformPendingInstantiations\n"
  buildTraceGraph(Json)
    Which is: "\nFrontend (test.cc)\n| ParseDeclarationOrFunctionDefinition (test.cc:2:1)\n| ParseDeclarationOrFunctionDefinition (test.cc:6:1)\n| | ParseFunctionDefinition (slow_func)\n| | | EvaluateAsRValue (<test.cc:8:21>)\n| | | EvaluateForOverflow (<test.cc:8:21, col:25>)\n| | | EvaluateForOverflow (<test.cc:8:30, col:32>)\n| | | EvaluateAsRValue (<test.cc:9:14>)\n| | | EvaluateForOverflow (<test.cc:9:9, col:14>)\n| | | isPotentialConstantExpr (slow_namespace::slow_func)\n| | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)\n| | | | EvaluateAsRValue (<test.cc:8:21, col:25>)\n| | | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)\n| | | | | EvaluateAsRValue (<test.cc:8:21, col:25>)\n| ParseDeclarationOrFunctionDefinition (test.cc:16:1)\n| | ParseFunctionDefinition (slow_test)\n| | | EvaluateAsInitializer (slow_value)\n| | | EvaluateAsConstantExpr (<test.cc:17:33, col:59>)\n| | | EvaluateAsConstantExpr (<test.cc:18:11, col:37>)\n| ParseDeclarationOrFunctionDefinition (test.cc:22:1)\n| | EvaluateAsConstantExpr (<test.cc:23:31, col:57>)\n| | EvaluateAsRValue (<test.cc:22:14, line:23:58>)\n| ParseDeclarationOrFunctionDefinition (test.cc:25:1)\n| | EvaluateAsInitializer (slow_init_list)\n| PerformPendingInstantiations\n"
With diff:
@@ -12,6 +12,6 @@
 | | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)
 | | | | EvaluateAsRValue (<test.cc:8:21, col:25>)
-| | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)
-| | | | EvaluateAsRValue (<test.cc:8:21, col:25>)
+| | | | EvaluateAsBooleanCondition (<test.cc:8:21, col:25>)
+| | | | | EvaluateAsRValue (<test.cc:8:21, col:25>)
...

@akuegel
Copy link
Member

akuegel commented Sep 16, 2025

@artagnon it looks like there might be a problem with the new CSE pass. We tracked down a test failure to this revision, and I dumped the function after the pass, both with CSE enabled and with CSE disabled. Here are the two dumps:

With CSE enabled:

define noalias noundef ptr @transpose_copy_fusion(ptr readonly captures(none) %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds nuw i8, ptr %0, i64 24
  %3 = load ptr, ptr %2, align 8, !invariant.load !3
  %4 = load ptr, ptr %3, align 8, !invariant.load !3, !dereferenceable !4
  %5 = getelementptr inbounds nuw i8, ptr %3, i64 16
  %6 = load ptr, ptr %5, align 8, !invariant.load !3, !dereferenceable !4
  tail call void @llvm.experimental.noalias.scope.decl(metadata !5)
  tail call void @llvm.experimental.noalias.scope.decl(metadata !8)
  br i1 false, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %1
  br label %vector.body

vector.body:                                      ; preds = %vector.ph
  %7 = getelementptr i8, ptr %4, i64 12
  %8 = getelementptr float, ptr %7, i32 0
  %9 = getelementptr float, ptr %8, i32 -3
  %wide.load = load <4 x float>, ptr %9, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %reverse = shufflevector <4 x float> %wide.load, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %10 = getelementptr float, ptr %4, i64 4
  %11 = getelementptr i8, ptr %10, i64 12
  %12 = getelementptr float, ptr %11, i32 0
  %13 = getelementptr float, ptr %12, i32 -3
  %wide.load1 = load <4 x float>, ptr %13, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %reverse2 = shufflevector <4 x float> %wide.load1, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %14 = getelementptr float, ptr %4, i64 8
  %15 = getelementptr i8, ptr %14, i64 12
  %16 = getelementptr float, ptr %15, i32 0
  %17 = getelementptr float, ptr %16, i32 -3
  %wide.load3 = load <4 x float>, ptr %17, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %reverse4 = shufflevector <4 x float> %wide.load3, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %18 = getelementptr float, ptr %4, i64 12
  %19 = getelementptr i8, ptr %18, i64 12
  %20 = getelementptr float, ptr %19, i32 0
  %21 = getelementptr float, ptr %20, i32 -3
  %wide.load5 = load <4 x float>, ptr %21, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %reverse6 = shufflevector <4 x float> %wide.load5, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %22 = shufflevector <4 x float> %reverse, <4 x float> %reverse2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %23 = shufflevector <4 x float> %reverse4, <4 x float> %reverse6, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %24 = shufflevector <8 x float> %22, <8 x float> %23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %interleaved.vec = shufflevector <16 x float> %24, <16 x float> poison, <16 x i32> <i32 0, i32 4, i32 8, i32 12, i32 1, i32 5, i32 9, i32 13, i32 2, i32 6, i32 10, i32 14, i32 3, i32 7, i32 11, i32 15>
  store <16 x float> %interleaved.vec, ptr %6, align 4, !alias.scope !8, !noalias !5
  br label %middle.block

middle.block:                                     ; preds = %vector.body
  br label %transpose_copy_fusion_wrapped.exit

scalar.ph:                                        ; preds = %1
  br label %.preheader

.preheader:                                       ; preds = %scalar.ph, %.preheader
  %25 = phi i64 [ 0, %scalar.ph ], [ %46, %.preheader ]
  %.idx = shl i64 %25, 4
  %26 = getelementptr i8, ptr %6, i64 %.idx
  %27 = sub nsw i64 0, %25
  %28 = getelementptr float, ptr %4, i64 %27
  %29 = getelementptr i8, ptr %28, i64 12
  %30 = load float, ptr %29, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  store float %30, ptr %26, align 4, !alias.scope !8, !noalias !5
  %31 = sub nuw nsw i64 4, %25
  %32 = getelementptr float, ptr %4, i64 %31
  %33 = getelementptr i8, ptr %32, i64 12
  %34 = load float, ptr %33, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %35 = getelementptr i8, ptr %26, i64 4
  store float %34, ptr %35, align 4, !alias.scope !8, !noalias !5
  %36 = sub nuw nsw i64 8, %25
  %37 = getelementptr float, ptr %4, i64 %36
  %38 = getelementptr i8, ptr %37, i64 12
  %39 = load float, ptr %38, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %40 = getelementptr i8, ptr %26, i64 8
  store float %39, ptr %40, align 4, !alias.scope !8, !noalias !5
  %41 = sub nuw nsw i64 12, %25
  %42 = getelementptr float, ptr %4, i64 %41
  %43 = getelementptr i8, ptr %42, i64 12
  %44 = load float, ptr %43, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %45 = getelementptr i8, ptr %26, i64 12
  store float %44, ptr %45, align 4, !alias.scope !8, !noalias !5
  %46 = add nuw nsw i64 %25, 1
  %exitcond.not = icmp eq i64 %46, 4
  br i1 %exitcond.not, label %transpose_copy_fusion_wrapped.exit, label %.preheader, !llvm.loop !10

transpose_copy_fusion_wrapped.exit:               ; preds = %middle.block, %.preheader
  ret ptr null
}

; Function Attrs: uwtable
define noalias noundef ptr @convert_element_type.1_kernel(ptr readonly captures(none) %0) local_unnamed_addr #0 {
  %args_gep = getelementptr inbounds nuw i8, ptr %0, i64 24
  %args = load ptr, ptr %args_gep, align 8
  %arg0 = load ptr, ptr %args, align 8, !invariant.load !3, !dereferenceable !4, !align !5
  %arg1_gep = getelementptr i8, ptr %args, i64 16
  %arg1 = load ptr, ptr %arg1_gep, align 8, !invariant.load !3, !dereferenceable !5, !align !5
  br label %convert_element_type.1.loop_header.dim.1.preheader

convert_element_type.1.loop_header.dim.1.preheader: ; preds = %1, %convert_element_type.1.loop_header.dim.1.preheader
  %convert_element_type.1.invar_address.dim.0.06 = phi i64 [ 0, %1 ], [ %invar.inc, %convert_element_type.1.loop_header.dim.1.preheader ]
  %.split = getelementptr inbounds nuw [4 x float], ptr %arg0, i64 %convert_element_type.1.invar_address.dim.0.06
  %.split4 = getelementptr inbounds nuw [4 x bfloat], ptr %arg1, i64 %convert_element_type.1.invar_address.dim.0.06
  %2 = load float, ptr %.split, align 4, !invariant.load !3, !noalias !6
  %3 = tail call bfloat @xla.fptrunc.f32.to.bf16(float %2) #1
  store bfloat %3, ptr %.split4, align 2, !alias.scope !6
  %4 = getelementptr inbounds nuw i8, ptr %.split, i64 4
  %5 = load float, ptr %4, align 4, !invariant.load !3, !noalias !6
  %6 = tail call bfloat @xla.fptrunc.f32.to.bf16(float %5) #1
  %7 = getelementptr inbounds nuw i8, ptr %.split4, i64 2
  store bfloat %6, ptr %7, align 2, !alias.scope !6
  %8 = getelementptr inbounds nuw i8, ptr %.split, i64 8
  %9 = load float, ptr %8, align 4, !invariant.load !3, !noalias !6
  %10 = tail call bfloat @xla.fptrunc.f32.to.bf16(float %9) #1
  %11 = getelementptr inbounds nuw i8, ptr %.split4, i64 4
  store bfloat %10, ptr %11, align 2, !alias.scope !6
  %12 = getelementptr inbounds nuw i8, ptr %.split, i64 12
  %13 = load float, ptr %12, align 4, !invariant.load !3, !noalias !6
  %14 = tail call bfloat @xla.fptrunc.f32.to.bf16(float %13) #1
  %15 = getelementptr inbounds nuw i8, ptr %.split4, i64 6
  store bfloat %14, ptr %15, align 2, !alias.scope !6
  %invar.inc = add nuw nsw i64 %convert_element_type.1.invar_address.dim.0.06, 1
  %exitcond = icmp eq i64 %invar.inc, 4
  br i1 %exitcond, label %return, label %convert_element_type.1.loop_header.dim.1.preheader, !llvm.loop !9

return:                                           ; preds = %convert_element_type.1.loop_header.dim.1.preheader
  ret ptr null
}

With CSE disabled:

define noalias noundef ptr @transpose_copy_fusion(ptr readonly captures(none) %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds nuw i8, ptr %0, i64 24
  %3 = load ptr, ptr %2, align 8, !invariant.load !3
  %4 = load ptr, ptr %3, align 8, !invariant.load !3, !dereferenceable !4
  %5 = getelementptr inbounds nuw i8, ptr %3, i64 16
  %6 = load ptr, ptr %5, align 8, !invariant.load !3, !dereferenceable !4
  tail call void @llvm.experimental.noalias.scope.decl(metadata !5)
  tail call void @llvm.experimental.noalias.scope.decl(metadata !8)
  br i1 false, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %1
  br label %vector.body

vector.body:                                      ; preds = %vector.ph
  %7 = getelementptr i8, ptr %4, i64 12
  %8 = getelementptr float, ptr %7, i32 0
  %9 = getelementptr float, ptr %8, i32 -3
  %wide.load = load <4 x float>, ptr %9, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %reverse = shufflevector <4 x float> %wide.load, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %10 = getelementptr float, ptr %4, i64 4
  %11 = getelementptr i8, ptr %10, i64 12
  %12 = getelementptr float, ptr %11, i32 0
  %13 = getelementptr float, ptr %12, i32 -3
  %wide.load1 = load <4 x float>, ptr %13, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %reverse2 = shufflevector <4 x float> %wide.load1, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %14 = getelementptr float, ptr %4, i64 8
  %15 = getelementptr i8, ptr %14, i64 12
  %16 = getelementptr float, ptr %15, i32 0
  %17 = getelementptr float, ptr %16, i32 -3
  %wide.load3 = load <4 x float>, ptr %17, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %reverse4 = shufflevector <4 x float> %wide.load3, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %18 = getelementptr i8, ptr %7, i64 12
  %19 = getelementptr float, ptr %18, i32 0
  %20 = getelementptr float, ptr %19, i32 -3
  %wide.load5 = load <4 x float>, ptr %20, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %reverse6 = shufflevector <4 x float> %wide.load5, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %21 = shufflevector <4 x float> %reverse, <4 x float> %reverse2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %22 = shufflevector <4 x float> %reverse4, <4 x float> %reverse6, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %23 = shufflevector <8 x float> %21, <8 x float> %22, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %interleaved.vec = shufflevector <16 x float> %23, <16 x float> poison, <16 x i32> <i32 0, i32 4, i32 8, i32 12, i32 1, i32 5, i32 9, i32 13, i32 2, i32 6, i32 10, i32 14, i32 3, i32 7, i32 11, i32 15>
  store <16 x float> %interleaved.vec, ptr %6, align 4, !alias.scope !8, !noalias !5
  br label %middle.block

middle.block:                                     ; preds = %vector.body
  br label %transpose_copy_fusion_wrapped.exit

scalar.ph:                                        ; preds = %1
  br label %.preheader

.preheader:                                       ; preds = %scalar.ph, %.preheader
  %24 = phi i64 [ 0, %scalar.ph ], [ %45, %.preheader ]
  %.idx = shl i64 %24, 4
  %25 = getelementptr i8, ptr %6, i64 %.idx
  %26 = sub nsw i64 0, %24
  %27 = getelementptr float, ptr %4, i64 %26
  %28 = getelementptr i8, ptr %27, i64 12
  %29 = load float, ptr %28, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  store float %29, ptr %25, align 4, !alias.scope !8, !noalias !5
  %30 = sub nuw nsw i64 4, %24
  %31 = getelementptr float, ptr %4, i64 %30
  %32 = getelementptr i8, ptr %31, i64 12
  %33 = load float, ptr %32, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %34 = getelementptr i8, ptr %25, i64 4
  store float %33, ptr %34, align 4, !alias.scope !8, !noalias !5
  %35 = sub nuw nsw i64 8, %24
  %36 = getelementptr float, ptr %4, i64 %35
  %37 = getelementptr i8, ptr %36, i64 12
  %38 = load float, ptr %37, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %39 = getelementptr i8, ptr %25, i64 8
  store float %38, ptr %39, align 4, !alias.scope !8, !noalias !5
  %40 = sub nuw nsw i64 12, %24
  %41 = getelementptr float, ptr %4, i64 %40
  %42 = getelementptr i8, ptr %41, i64 12
  %43 = load float, ptr %42, align 4, !invariant.load !3, !alias.scope !5, !noalias !8
  %44 = getelementptr i8, ptr %25, i64 12
  store float %43, ptr %44, align 4, !alias.scope !8, !noalias !5
  %45 = add nuw nsw i64 %24, 1
  %exitcond.not = icmp eq i64 %45, 4
  br i1 %exitcond.not, label %transpose_copy_fusion_wrapped.exit, label %.preheader, !llvm.loop !10

transpose_copy_fusion_wrapped.exit:               ; preds = %middle.block, %.preheader
  ret ptr null
}

; Function Attrs: uwtable
define noalias noundef ptr @convert_element_type.1_kernel(ptr readonly captures(none) %0) local_unnamed_addr #0 {
  %args_gep = getelementptr inbounds nuw i8, ptr %0, i64 24
  %args = load ptr, ptr %args_gep, align 8
  %arg0 = load ptr, ptr %args, align 8, !invariant.load !3, !dereferenceable !4, !align !5
  %arg1_gep = getelementptr i8, ptr %args, i64 16
  %arg1 = load ptr, ptr %arg1_gep, align 8, !invariant.load !3, !dereferenceable !5, !align !5
  br label %convert_element_type.1.loop_header.dim.1.preheader

convert_element_type.1.loop_header.dim.1.preheader: ; preds = %1, %convert_element_type.1.loop_header.dim.1.preheader
  %convert_element_type.1.invar_address.dim.0.06 = phi i64 [ 0, %1 ], [ %invar.inc, %convert_element_type.1.loop_header.dim.1.preheader ]
  %.split = getelementptr inbounds nuw [4 x float], ptr %arg0, i64 %convert_element_type.1.invar_address.dim.0.06
  %.split4 = getelementptr inbounds nuw [4 x bfloat], ptr %arg1, i64 %convert_element_type.1.invar_address.dim.0.06
  %2 = load float, ptr %.split, align 4, !invariant.load !3, !noalias !6
  %3 = tail call bfloat @xla.fptrunc.f32.to.bf16(float %2) #1
  store bfloat %3, ptr %.split4, align 2, !alias.scope !6
  %4 = getelementptr inbounds nuw i8, ptr %.split, i64 4
  %5 = load float, ptr %4, align 4, !invariant.load !3, !noalias !6
  %6 = tail call bfloat @xla.fptrunc.f32.to.bf16(float %5) #1
  %7 = getelementptr inbounds nuw i8, ptr %.split4, i64 2
  store bfloat %6, ptr %7, align 2, !alias.scope !6
  %8 = getelementptr inbounds nuw i8, ptr %.split, i64 8
  %9 = load float, ptr %8, align 4, !invariant.load !3, !noalias !6
  %10 = tail call bfloat @xla.fptrunc.f32.to.bf16(float %9) #1
  %11 = getelementptr inbounds nuw i8, ptr %.split4, i64 4
  store bfloat %10, ptr %11, align 2, !alias.scope !6
  %12 = getelementptr inbounds nuw i8, ptr %.split, i64 12
  %13 = load float, ptr %12, align 4, !invariant.load !3, !noalias !6
  %14 = tail call bfloat @xla.fptrunc.f32.to.bf16(float %13) #1
  %15 = getelementptr inbounds nuw i8, ptr %.split4, i64 6
  store bfloat %14, ptr %15, align 2, !alias.scope !6
  %invar.inc = add nuw nsw i64 %convert_element_type.1.invar_address.dim.0.06, 1
  %exitcond = icmp eq i64 %invar.inc, 4
  br i1 %exitcond, label %return, label %convert_element_type.1.loop_header.dim.1.preheader, !llvm.loop !9

return:                                           ; preds = %convert_element_type.1.loop_header.dim.1.preheader
  ret ptr null
}

You can see that with CSE enabled, it removes the line %18 = getelementptr float, ptr %4, i64 12
and changes the next one to %18 = getelementptr i8, ptr %7, i64 12 instead of %19 = getelementptr i8, ptr %18, i64 12

I believe this is a bug, it seems that it CSEs the line with %7 = getelementptr i8, ptr %4, i64 12, but that has a different type than %18.

@artagnon
Copy link
Contributor Author

@akuegel Thanks for the report. We realized this problem, and it should be fixed when #156699 lands.

@akuegel
Copy link
Member

akuegel commented Sep 16, 2025

@akuegel Thanks for the report. We realized this problem, and it should be fixed when #156699 lands.

Thank you, good to know that you already have a pending fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants