[LV][RFC] Generating conditional VPBB that will be skip when the mask is inactive in VPlan. #141900

ElvisWang123 · 2025-05-29T06:21:25Z

This patch create an overview of VPlan transformation of creating conditional VPBB and will be split off into multiple patches later.

RFC: https://discourse.llvm.org/t/rfc-lv-generating-conditional-vpbb-that-will-be-skip-when-the-mask-is-inactive-in-vplan/86591

This patch add the transformation that convert flatten control flow
with conditional vector basic block.
This transformation can help program skip masked operations without any
active lane.

First, this transformation will collect all masked stores and operands
bottom-up. And put these masked operations into a new vector basic
block.

Second, this transformation will split original vector loop and insert
the new basic block between split blocks. And update the conditional
branch in the original blocks.

E.g.
Before: {
vector.loop:
...
BranchOnCount %IV, %TC
Successors middle.block, vector.loop
}

After: {
vector.loop:
...
%any.active.mask = any-of(%mask)
BranchOnCount %any.active.mask, 0
Successors vector.loop.split, vector.if.bb

vector.if.bb:
... (Masked operations)
Successors vector.loop.split

vector.loop.split:
...
BranchOnCount %IV, %TC
Successors middle.block, vector.loop
}

Co-authored-by: nikolaypanchenko [email protected]

llvmbot · 2025-05-29T06:21:59Z

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-llvm-analysis

Author: Elvis Wang (ElvisWang123)

Changes

This patch create an overview of VPlan transformation of creating conditional VPBB and will be split off into multiple patches later.

RFC: https://discourse.llvm.org/t/rfc-lv-generating-conditional-vpbb-that-will-be-skip-when-the-mask-is-inactive-in-vplan/86591

This patch add the transformation that convert flatten control flow
with conditional vector basic block.
This transformation can help program skip masked operations without any
active lane.

First, this transformation will collect all masked stores and operands
bottom-up. And put these masked operations into a new vector basic
block.

Second, this transformation will split original vector loop and insert
the new basic block between split blocks. And update the conditional
branch in the original blocks.

E.g.
Before: {
vector.loop:
...
BranchOnCount %IV, %TC
Successors middle.block, vector.loop
}

After: {
vector.loop:
...
%any.active.mask = any-of(%mask)
BranchOnCount %any.active.mask, 0
Successors vector.loop.split, vector.if.bb

vector.if.bb:
... (Masked operations)
Successors vector.loop.split

vector.loop.split:
...
BranchOnCount %IV, %TC
Successors middle.block, vector.loop
}

Co-authored-by: nikolaypanchenko <[email protected]>

Patch is 20.32 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141900.diff

9 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+4)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+4)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+2)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+8)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+176)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+23)
(added) llvm/test/Transforms/LoopVectorize/RISCV/vplan-conditional-basic-block.ll (+127)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 8f4ce80ada5ed..20c4b52098770 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1822,6 +1822,10 @@ class TargetTransformInfo {
   /// otherwise scalar epilogue loop.
   LLVM_ABI bool preferEpilogueVectorization() const;
 
+  /// Return true if the loop vectorizer shoud consider vectorizing with
+  /// flattern control flow, otherwise create conditional vector basic block.
+  bool preferFlattenControlFlow() const;
+
   /// \returns True if the target wants to expand the given reduction intrinsic
   /// into a shuffle sequence.
   LLVM_ABI bool shouldExpandReduction(const IntrinsicInst *II) const;
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index a80b4c5179bad..1211c80f8ff51 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -1091,6 +1091,8 @@ class TargetTransformInfoImplBase {
 
   virtual bool preferEpilogueVectorization() const { return true; }
 
+  virtual bool preferFlattenControlFlow() const { return true; }
+
   virtual bool shouldExpandReduction(const IntrinsicInst *II) const {
     return true;
   }
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index ff8778168686d..5000bef968213 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -777,6 +777,10 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return BaseT::preferPredicateOverEpilogue(TFI);
   }
 
+  bool preferFlattenControlFlow() const override {
+    return thisT()->preferFlattenControlFlow();
+  }
+
   TailFoldingStyle
   getPreferredTailFoldingStyle(bool IVUpdateMayOverflow = true) const override {
     return BaseT::getPreferredTailFoldingStyle(IVUpdateMayOverflow);
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 0f857399660fe..d455e1dc63c13 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -371,6 +371,10 @@ bool TargetTransformInfo::preferPredicateOverEpilogue(
   return TTIImpl->preferPredicateOverEpilogue(TFI);
 }
 
+bool TargetTransformInfo::preferFlattenControlFlow() const {
+  return TTIImpl->preferFlattenControlFlow();
+}
+
 TailFoldingStyle TargetTransformInfo::getPreferredTailFoldingStyle(
     bool IVUpdateMayOverflow) const {
   return TTIImpl->getPreferredTailFoldingStyle(IVUpdateMayOverflow);
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 0a784461d67bf..412dd8ae2b991 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -146,6 +146,8 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
     return false;
   }
 
+  bool preferFlattenControlFlow() const { return false; }
+
   InstructionCost
   getMaskedMemoryOpCost(unsigned Opcode, Type *Src, Align Alignment,
                         unsigned AddressSpace,
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 2fe59a464457f..a834b6106f028 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -351,6 +351,11 @@ static cl::opt<bool> PreferPredicatedReductionSelect(
     cl::desc(
         "Prefer predicating a reduction operation over an after loop select."));
 
+static cl::opt<bool>
+    PreferFlattenControlFlow("prefer-flatten-control-flow", cl::init(true),
+                             cl::Hidden,
+                             cl::desc("Prefer flatten control flow."));
+
 cl::opt<bool> llvm::EnableVPlanNativePath(
     "enable-vplan-native-path", cl::Hidden,
     cl::desc("Enable VPlan-native vectorization path with "
@@ -9287,6 +9292,9 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
   }
   VPlanTransforms::optimizeInductionExitUsers(*Plan, IVEndValues);
 
+  if (!PreferFlattenControlFlow && !TTI.preferFlattenControlFlow())
+    VPlanTransforms::optimizeConditionalVPBB(*Plan);
+
   assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
   return Plan;
 }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index dc5be520505eb..e1c8690986599 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2087,6 +2087,23 @@ void VPlanTransforms::addActiveLaneMask(
     HeaderMask->replaceAllUsesWith(LaneMask);
 }
 
+static bool replaceHeaderMaskToEVL(VPValue *HeaderMask, VPRecipeBase *R) {
+  using namespace llvm::VPlanPatternMatch;
+  VPValue *EdgeMask;
+  if (!R)
+    return false;
+  if (match(R, m_Binary<VPInstruction::BranchOnCount>(
+                   m_VPInstruction<VPInstruction::AnyOf>(
+                       m_Binary<VPInstruction::LogicalAnd>(
+                           m_Specific(HeaderMask), m_VPValue(EdgeMask))),
+                   m_VPValue()))) {
+
+    cast<VPInstruction>(R->getOperand(0))->setOperand(0, EdgeMask);
+    return true;
+  }
+  return false;
+}
+
 /// Try to convert \p CurRecipe to a corresponding EVL-based recipe. Returns
 /// nullptr if no EVL-based recipe could be created.
 /// \p HeaderMask  Header Mask.
@@ -2202,6 +2219,8 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
   for (VPValue *HeaderMask : collectAllHeaderMasks(Plan)) {
     for (VPUser *U : collectUsersRecursively(HeaderMask)) {
       auto *CurRecipe = cast<VPRecipeBase>(U);
+      if (replaceHeaderMaskToEVL(HeaderMask, CurRecipe))
+        continue;
       VPRecipeBase *EVLRecipe = createEVLRecipe(
           HeaderMask, *CurRecipe, TypeInfo, *AllOneMask, EVL, PrevEVL);
       if (!EVLRecipe)
@@ -3202,3 +3221,160 @@ void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
       Plan.getOrAddLiveIn(ConstantInt::get(CanIV->getScalarType(), 1)));
   removeDeadRecipes(Plan);
 }
+
+void VPlanTransforms::optimizeConditionalVPBB(VPlan &Plan) {
+
+  VPDominatorTree VPDT;
+  VPDT.recalculate(Plan);
+
+  SmallVector<VPValue *> HeaderMasks = collectAllHeaderMasks(Plan);
+
+  // Get the mask from the store recipes.
+  auto GetMask = [&HeaderMasks](VPRecipeBase &R) -> VPValue * {
+    using namespace llvm::VPlanPatternMatch;
+    if (isa<VPWidenStoreRecipe, VPWidenStoreEVLRecipe>(R)) {
+      VPValue *OrigMask = cast<VPWidenMemoryRecipe>(R).getMask();
+      if (!OrigMask)
+        return OrigMask;
+
+      if (any_of(HeaderMasks, [OrigMask](VPValue *HeaderMask) {
+            return OrigMask == HeaderMask;
+          }))
+        return nullptr;
+
+      // Match active.lane.mask.
+      if (match(OrigMask, m_VPInstruction<VPInstruction::ActiveLaneMask>(
+                              m_VPValue(), m_VPValue())))
+        return nullptr;
+
+      return OrigMask;
+    }
+    return nullptr;
+  };
+
+  // First, collect all masked stores.
+  SmallVector<std::pair<VPRecipeBase *, VPValue *>> MaskedStores;
+  ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
+      Plan.getEntry());
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+    for (VPRecipeBase &R : *VPBB) {
+      if (VPValue *Mask = GetMask(R))
+        MaskedStores.emplace_back(&R, Mask);
+    }
+  }
+
+  DenseSet<VPRecipeBase *> Candidates;
+  auto AddOperandsToCandidates = [&Candidates](VPRecipeBase *R) {
+    for (VPValue *Op : R->operands())
+      if (VPRecipeBase *OpR = Op->getDefiningRecipe())
+        Candidates.insert(OpR);
+  };
+
+  SmallVector<SetVector<VPRecipeBase *>> Tries;
+  while (!MaskedStores.empty()) {
+    auto [LR, M] = MaskedStores.pop_back_val();
+    Candidates.clear();
+    AddOperandsToCandidates(LR);
+
+    SetVector<VPRecipeBase *> CurrentTree;
+    CurrentTree.insert(LR);
+
+    VPBasicBlock *MaskBlock =
+        M->hasDefiningRecipe() ? M->getDefiningRecipe()->getParent() : nullptr;
+    auto End = MaskBlock == LR->getParent()
+                   ? M->getDefiningRecipe()->getReverseIterator()
+                   : LR->getParent()->getFirstNonPhi()->getReverseIterator();
+    // Greedily add all recipes that are used to compute stored value to the
+    // tree. All users of the added recipe must dominate the store
+    // recipe.
+    for (VPRecipeBase &R : make_range(LR->getReverseIterator(), End)) {
+      // Recipe is not a part of the tree
+      if (!Candidates.contains(&R))
+        continue;
+
+      if (any_of(R.definedValues(), [&LR = LR, &VPDT](VPValue *Def) {
+            for (VPUser *U : Def->users()) {
+              if (auto *UR = dyn_cast<VPRecipeBase>(U)) {
+                if (UR == LR || VPDT.properlyDominates(UR, LR))
+                  continue;
+              }
+              return true;
+            }
+            return false;
+          }))
+        continue;
+
+      CurrentTree.insert(&R);
+      AddOperandsToCandidates(&R);
+    }
+    // Previous traversal could add recipes that are used by non-added recipes,
+    // thus need to be removed from the list.
+    DenseSet<VPRecipeBase *> ToRemove;
+    bool Changed;
+    do {
+      Changed = false;
+      for (VPRecipeBase *R : CurrentTree) {
+        if (ToRemove.contains(R))
+          continue;
+        if (any_of(R->definedValues(), [&](VPValue *Def) {
+              for (VPUser *U : Def->users()) {
+                if (auto *UR = dyn_cast<VPRecipeBase>(U))
+                  if (!CurrentTree.contains(UR) || ToRemove.contains(UR))
+                    return true;
+              }
+              return false;
+            })) {
+          Changed = true;
+          ToRemove.insert(R);
+        }
+      }
+    } while (Changed);
+
+    for (VPRecipeBase *R : ToRemove)
+      CurrentTree.remove(R);
+
+    if (CurrentTree.size() > 1)
+      Tries.push_back(CurrentTree);
+  }
+  for (const auto &List : Tries) {
+    VPRecipeBase *LR = List.front();
+    VPValue *M = cast<VPWidenMemoryRecipe>(LR)->getMask();
+    assert(M && "Mask VPValue must exist at this point");
+    auto Recipes = reverse(List.getArrayRef());
+
+    // Split current basic block at LR point so that VPConditionalRegionBlock
+    // can be added inbetween.
+    VPBasicBlock *ParentBB = LR->getParent();
+    VPBasicBlock *ContBB = ParentBB->splitAt(LR->getIterator());
+
+    // Create VPBB and insert it between ParentBB and ContBB.
+    VPBasicBlock *IfBB = Plan.createVPBasicBlock("vector.if.bb");
+    VPBlockUtils::insertBlockAfter(IfBB, ParentBB);
+    if (ContBB->getNumSuccessors() == 0)
+      ParentBB->getEnclosingLoopRegion()->setExiting(ContBB);
+
+    // Copy recipes into conditional block.
+    for (VPRecipeBase *R : Recipes)
+      R->moveBefore(*IfBB, IfBB->end());
+
+    // Add the condition and brach in the parent block.
+    auto *ActiveLane =
+        new VPInstruction(VPInstruction::AnyOf, {M}, nullptr, "any.of.mask");
+
+    Type *CanonicalIVType = Plan.getCanonicalIV()->getScalarType();
+    LLVMContext &Ctx = CanonicalIVType->getContext();
+    VPValue *Zero =
+        Plan.getOrAddLiveIn(ConstantInt::get(Type::getInt1Ty(Ctx), 0));
+
+    auto *BranchOnCount =
+        new VPInstruction(VPInstruction::BranchOnCount, {ActiveLane, Zero});
+    ParentBB->appendRecipe(ActiveLane);
+    ParentBB->appendRecipe(BranchOnCount);
+
+    // Set proper predecessor and successors for modifed basicblocks.
+    ParentBB->clearSuccessors();
+    ParentBB->setTwoSuccessors(ContBB, IfBB);
+    ContBB->clearPredecessors();
+    ContBB->setPredecessors({ParentBB, IfBB});
+  }
+}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 34e2de4eb3b74..0e535d6bf7c36 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -234,6 +234,29 @@ struct VPlanTransforms {
   /// removed in the future.
   static DenseMap<VPBasicBlock *, VPValue *>
   introduceMasksAndLinearize(VPlan &Plan, bool FoldTail);
+
+  /// Try to convet flatten control flow to the conditional vector basic block.
+  /// If no active bits in the mask, will skip all the masked operations.
+  /// This transformation will collect all masked operations bottom-up from the
+  /// masked stores and put all of masked operations in a new vector basic
+  /// block. This original vector.loop will be split and the new created basic
+  /// block will inserted in between.
+  ///
+  /// After transformation the vplan will looks like.
+  /// vector.loop:
+  ///   ...
+  ///   %any.active.mask = any-of(%Mask)
+  ///   Branch-On-Count %any.active.mask, 0
+  /// successors vector.loop.split, vector.if.bb
+  ///
+  /// vector.if.bb:
+  ///   (Masked operations)
+  /// successors vector.loop.split
+  ///
+  /// vector.loop.split:
+  ///   ...
+  /// successors middle.block, vector.loop
+  static void optimizeConditionalVPBB(VPlan &Plan);
 };
 
 } // namespace llvm
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vplan-conditional-basic-block.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vplan-conditional-basic-block.ll
new file mode 100644
index 0000000000000..2b8f2542ab544
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vplan-conditional-basic-block.ll
@@ -0,0 +1,127 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -p loop-vectorize -force-vector-width=4 -S -mtriple=riscv64 -mattr=+v -prefer-flatten-control-flow=false %s | FileCheck %s
+
+define void @test(i32 %control1, i32 %control2, i32 %target, i32 %reg.4.val, ptr %reg.24.val) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: i32 [[CONTROL1:%.*]], i32 [[CONTROL2:%.*]], i32 [[TARGET:%.*]], i32 [[REG_4_VAL:%.*]], ptr [[REG_24_VAL:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp sgt i32 [[REG_4_VAL]], 0
+; CHECK-NEXT:    br i1 [[CMP1]], label %[[FOR_BODY_LR_PH:.*]], label %[[FOR_END:.*]]
+; CHECK:       [[FOR_BODY_LR_PH]]:
+; CHECK-NEXT:    [[SH_PROM:%.*]] = zext nneg i32 [[CONTROL1]] to i64
+; CHECK-NEXT:    [[SHL:%.*]] = shl nuw i64 1, [[SH_PROM]]
+; CHECK-NEXT:    [[SH_PROM5:%.*]] = zext nneg i32 [[CONTROL2]] to i64
+; CHECK-NEXT:    [[SHL6:%.*]] = shl nuw i64 1, [[SH_PROM5]]
+; CHECK-NEXT:    [[SH_PROM10:%.*]] = zext nneg i32 [[TARGET]] to i64
+; CHECK-NEXT:    [[SHL11:%.*]] = shl nuw nsw i64 1, [[SH_PROM10]]
+; CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[REG_4_VAL]] to i64
+; CHECK-NEXT:    [[TMP0:%.*]] = freeze i64 [[SHL6]]
+; CHECK-NEXT:    [[TMP1:%.*]] = or i64 [[SHL]], [[TMP0]]
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 8
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[SHL11]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i64> poison, i64 [[TMP1]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT1]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[IF_THEN9_SPLIT:.*]] ]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[REG_24_VAL]], i64 [[INDEX]]
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i64, ptr [[TMP2]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[TMP2]], i32 4
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP3]], align 8
+; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <4 x i64>, ptr [[TMP4]], align 8
+; CHECK-NEXT:    [[TMP5:%.*]] = and <4 x i64> [[WIDE_LOAD]], [[BROADCAST_SPLAT2]]
+; CHECK-NEXT:    [[TMP6:%.*]] = and <4 x i64> [[WIDE_LOAD3]], [[BROADCAST_SPLAT2]]
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[TMP5]], [[BROADCAST_SPLAT2]]
+; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq <4 x i64> [[TMP6]], [[BROADCAST_SPLAT2]]
+; CHECK-NEXT:    [[TMP9:%.*]] = xor <4 x i64> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP10:%.*]] = xor <4 x i64> [[WIDE_LOAD3]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP7]])
+; CHECK-NEXT:    [[TMP14:%.*]] = icmp eq i1 [[TMP13]], false
+; CHECK-NEXT:    br i1 [[TMP14]], label %[[IF_THEN9_SPLIT]], label %[[VECTOR_IF_BB:.*]]
+; CHECK:       [[VECTOR_IF_BB]]:
+; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr i64, ptr [[TMP2]], i32 0
+; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr i64, ptr [[TMP2]], i32 4
+; CHECK-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> [[TMP9]], ptr [[TMP11]], i32 8, <4 x i1> [[TMP7]])
+; CHECK-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> [[TMP10]], ptr [[TMP12]], i32 8, <4 x i1> [[TMP8]])
+; CHECK-NEXT:    br label %[[IF_THEN9_SPLIT]]
+; CHECK:       [[IF_THEN9_SPLIT]]:
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; CHECK-NEXT:    [[TMP26:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP26]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[FOR_END_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_LR_PH]] ]
+; CHECK-NEXT:    br label %[[FOR_BODY:.*]]
+; CHECK:       [[FOR_BODY]]:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_INC:.*]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[REG_24_VAL]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP27:%.*]] = load i64, ptr [[ARRAYIDX]], align 8
+; CHECK-NEXT:    [[TMP28:%.*]] = and i64 [[TMP27]], [[TMP1]]
+; CHECK-NEXT:    [[OR_COND_NOT:%.*]] = icmp eq i64 [[TMP28]], [[TMP1]]
+; CHECK-NEXT:    br i1 [[OR_COND_NOT]], label %[[IF_THEN9:.*]], label %[[FOR_INC]]
+; CHECK:       [[IF_THEN9]]:
+; CHECK-NEXT:    [[XOR:%.*]] = xor i64 [[TMP27]], [[SHL11]]
+; CHECK-NEXT:    store i64 [[XOR]], ptr [[ARRAYIDX]], align 8
+; CHECK-NEXT:    br label %[[FOR_INC]]
+; CHECK:       [[FOR_INC]]:
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_END_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[FOR_END_LOOPEXIT]]:
+; CHECK-NEXT:    br label %[[FOR_END]]
+; CHECK:       [[FOR_END]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %cmp1 = icmp sgt i32 %reg.4.val, 0
+  br i1 %cmp1, label %for.body.lr.ph, label %for.end
+
+for.body.lr.ph:
+  %sh_prom = zext nneg i32 %control1 to i64
+  %shl = shl nuw i64 1, %sh_prom
+  %sh_prom5 = zext nneg i32 %control2 to i64
+  %shl6 = shl nuw i64 1, %sh_prom5
+  %sh_prom10 = zext nneg i32 %target to i64
+  %shl11 = shl nuw nsw i64 1, %sh_prom10
+  %wide.trip.count = zext nneg i32 %reg.4.val to i64
+  %0 = freeze i64 %shl6
+  %1 = or i64 %shl, %0
+  br label %for.body
+
+for.body:
+  %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.inc ]
+  %arrayidx = getelementptr inbounds i64, ptr %reg.24.val, i64 %indvars.iv
+  %2 = load i64, ptr %arrayidx, align 8
+  %3 = and i64 %2, %1
+  %or.cond.not = icmp eq i64 %3, %1
+  br i1 %or.cond.not, label %if.then9, label %for.inc
+
+if.then9:
+  %xor = xor i64 %2, %shl11
+  store i64 %xor, ptr %arrayidx, align 8
+  br label %for.inc
+
+for.inc:
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.end.loopexit, label %for.body
+
+for.end.loope...
[truncated]

david-arm · 2025-06-02T12:43:27Z

llvm/test/Transforms/LoopVectorize/RISCV/vplan-conditional-basic-block.ll

Could profiling data help guide the best vectorisation strategy here too? This could be from PGO or it could be the likely or unlikely C level branch hints.

Thanks for your comment!

I think the challenge here is that the branch taken probability also changes by the VF.
I am not quite familiar with the PGO. Could PGO collect the probability of branch after vectorization?

That really depends on what will be collected in PGO. Naive hit-rate information won't be sufficient to make the best decision. For example, having a 25% hits of the condition within a loop with 100 iterations could mean and :

condition at first 25 iterations are true, 75 others are false. In this case generating CF within vector loop will be good

condition at every 4th iteration is true, such control flow will add extra overhead. (assuming 128bit vector register and 32bit data)

having a more fine-grained profile, i.e. hit-rate within a VLEN could be useful, especially for VLS.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

fhahn

Thanks for sharing. I'm curious if you have any eprformance data you could share?

ElvisWang123 · 2025-06-12T04:22:41Z

Thanks for sharing. I'm curious if you have any eprformance data you could share?

Yes. We got ~30% performance improvement on 462.libquantum on sifive-p470 with this optimization.

preames · 2025-06-12T14:33:07Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

  for (VPValue *HeaderMask : collectAllHeaderMasks(Plan)) {
    for (VPUser *U : collectUsersRecursively(HeaderMask)) {
      auto *CurRecipe = cast<VPRecipeBase>(U);
+      if (replaceHeaderMaskToEVL(HeaderMask, CurRecipe))


JFYI, this line is unsound. The next line can return without replacing a a recipe with a VL recipe, and thus you've just dropped a mask without adding any form of predication.

I think when this function being invoked, the loop will be EVL tail-folding.

This function is focus on peeling out the HeaderMask (which will be replaced by EVL) in the BlockInMasks when generating VPInstruction::BranchOnCount %Masks for conditional basic block.

Currently only load/store and reduction will generate EVL recipes. So I think it is fine to skip following EVL replacements.

Would it be safer to do this optimisation in optimizeConditionalVPBB, and move optimizeConditionalVPBB till after the EVL transform has happened?

I think we should generally try to do the EVL transform as early as possible, rather than teaching it to work around various other optimisations

Yes, move the optimizationConditionalVPBB() pass after EVL transformation pass. thanks!

ElvisWang123

Rebase and ping :)

@fhahn Do you have any concern of this patch? Just try to make sure this optimization will not break any rules in LV and VPlan.

ElvisWang123 · 2025-06-16T03:01:40Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

  for (VPValue *HeaderMask : collectAllHeaderMasks(Plan)) {
    for (VPUser *U : collectUsersRecursively(HeaderMask)) {
      auto *CurRecipe = cast<VPRecipeBase>(U);
+      if (replaceHeaderMaskToEVL(HeaderMask, CurRecipe))


I think when this function being invoked, the loop will be EVL tail-folding.

This function is focus on peeling out the HeaderMask (which will be replaced by EVL) in the BlockInMasks when generating VPInstruction::BranchOnCount %Masks for conditional basic block.

Currently only load/store and reduction will generate EVL recipes. So I think it is fine to skip following EVL replacements.

lukel97

Out of curiosity did you see any regressions in any workloads with this flag enabled? I.e. loops where the mask is true most of the time so it doesn't amortise the cost of the vector.reduce.or

lukel97 · 2025-07-31T03:18:37Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Nit, would this flag be better inverted, like prefer-control-flow or something?

Inverted, thanks!

lukel97 · 2025-07-31T03:20:42Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Is it possible to do this later in VPlanTransforms::optimize?

Move after EVL insertion pass. This can help to remove replaceHeaderMaskToEVL(). Thanks!

lukel97 · 2025-07-31T03:29:09Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

  for (VPValue *HeaderMask : collectAllHeaderMasks(Plan)) {
    for (VPUser *U : collectUsersRecursively(HeaderMask)) {
      auto *CurRecipe = cast<VPRecipeBase>(U);
+      if (replaceHeaderMaskToEVL(HeaderMask, CurRecipe))


Would it be safer to do this optimisation in optimizeConditionalVPBB, and move optimizeConditionalVPBB till after the EVL transform has happened?

I think we should generally try to do the EVL transform as early as possible, rather than teaching it to work around various other optimisations

lukel97 · 2025-07-31T03:34:07Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

For EVL tail folding, do we not still need to convert the anyof into a vp.reduce.or or something? Otherwise we will be counting lanes in edgemask that are outside the EVL

The AnyOf(%mask) can lower to vcpop.m. Not sure if VPOptimizer can handle this properly or not.
If not, need to put this pass before EVL transformations and using vp.reduce.or.

ElvisWang123

Rebase and address comments.

ElvisWang123 · 2025-08-06T06:17:06Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Inverted, thanks!

ElvisWang123 · 2025-08-06T06:17:56Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Move after EVL insertion pass. This can help to remove replaceHeaderMaskToEVL(). Thanks!

ElvisWang123 · 2025-08-06T06:19:16Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

  for (VPValue *HeaderMask : collectAllHeaderMasks(Plan)) {
    for (VPUser *U : collectUsersRecursively(HeaderMask)) {
      auto *CurRecipe = cast<VPRecipeBase>(U);
+      if (replaceHeaderMaskToEVL(HeaderMask, CurRecipe))


Yes, move the optimizationConditionalVPBB() pass after EVL transformation pass. thanks!

ElvisWang123 · 2025-08-06T06:35:37Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

The AnyOf(%mask) can lower to vcpop.m. Not sure if VPOptimizer can handle this properly or not.
If not, need to put this pass before EVL transformations and using vp.reduce.or.

ElvisWang123 · 2025-08-27T00:41:31Z

Gentle ping :)

vzakhari · 2025-09-05T20:07:16Z

Hi @ElvisWang123 and @npanchen, thank you for working on this!

I just want to post some information to support this change. I tried this patch with Flang compiler on zen4 with setting PreferControlFlow to true. 481.wrf benchmark sped up by 36%. I guess this might be related to the following loop, but I did not confirm that:

DO 823 II=N1STAR,N1END
  IF ( icmask(II,JJ) ) THEN
    OV(II,JJ)=(MXM(II,JJ)-W(II,JJ))/(-PN(F(II,JJ,1))+         &
      PP(F(II,JJ,0))+EP)
    UN(II,JJ)=(W(II,JJ)-MN(II,JJ))/(PP(F(II,JJ,1))-             &
      PN(F(II,JJ,0))+EP)
  ENDIF
823 CONTINUE

It looks like the AOCC compiler also has a similar optimization, which they call BOSCC (https://reviews.llvm.org/D139074). Some performance results of BOSCC for 481.wrf are also reported in #137213.

So this change seems to be profitable for multiple benchmarks and multiple architectures, and it will be great to land it in llvm.

On the downside, I got a few failures on other benchmarks, and they are all related to VPlan cost model and legacy cost model disagreed assertion. I can provide reproducers, if you would like to look at them.

ElvisWang123 · 2025-09-07T22:46:27Z

@vzakhari Thanks for your support and review!
I probably missed some changes in the legacy cost model. Would be great for some case to help finding the problem!

github-actions · 2025-10-16T01:44:42Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

fhahn

@ElvisWang123 @vzakhari is there still a crash with this version as mentioned in #157542?

vzakhari · 2025-10-17T20:05:12Z

@ElvisWang123 @vzakhari is there still a crash with this version as mentioned in #157542?

Sorry, I won't be able to check it any time soon. @alexey-bataev, would you have bandwidth to check this patch with our testing?

This patch add a need attribut in TTI to let LV knows which is better. Default value of preferControlFlow is false to match current TTI implementation. If preferFlattenControlFlow() return true, LV will try to generate conditional VPBB if possible.

This patch add the transformation that convert flatten control flow with conditional vector basic block. This transformation can help program skip masked operations without any active lane. First, this transformation will collect all masked stores and operands bottom-up. And put these msaked operations into a new vector basic block. Second, this transformation will split original vector loop and insert the new basic block between split blocks. And update the conditional branch in the orignal blocks. E.g. Before: { vector.loop: ... BranchOnCount %IV, %TC Successors middle.block, vector.loop } After: { vector.loop: ... %any.active.mask = any-of(%mask) BranchOnCond %any.active.mask Successors vector.if.bb, vector.loop.split vector.if.bb: ... (Masked operations) Successors vector.loop.split vector.loop.split: ... BranchOnCount %IV, %TC Successors middle.block, vector.loop }

The `any-of (%mask)` is generated during generating conditional VPBB in vplan trnasforms. The cost of the any-of should also account by the legacy model to prevent misaligned issue.

ElvisWang123 · 2025-10-20T02:41:34Z

@ElvisWang123 @vzakhari is there still a crash with this version as mentioned in #157542?

It seems that the issue is disappear after rebasing. Probably fixed (masked) by cost model updating.
I've add the cost of the AnyOf generated by conditional VPBB to the legacy cost model to fix the potential cost model misaligned in b945b7e

ElvisWang123 requested review from alexey-bataev, david-arm, fhahn, npanchen and preames May 29, 2025 06:21

llvmbot added backend:RISC-V vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels May 29, 2025

david-arm reviewed Jun 2, 2025

View reviewed changes

wangpc-pp reviewed Jun 3, 2025

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h Outdated Show resolved Hide resolved

ElvisWang123 force-pushed the RFC-Cond-VPBB branch from e2382bf to 6dd5b38 Compare June 9, 2025 12:38

fhahn reviewed Jun 9, 2025

View reviewed changes

preames reviewed Jun 12, 2025

View reviewed changes

ElvisWang123 force-pushed the RFC-Cond-VPBB branch from 6dd5b38 to 3d111a1 Compare July 30, 2025 23:07

ElvisWang123 commented Jul 30, 2025

View reviewed changes

ElvisWang123 requested a review from lukel97 July 31, 2025 03:01

lukel97 reviewed Jul 31, 2025

View reviewed changes

ElvisWang123 force-pushed the RFC-Cond-VPBB branch from 3d111a1 to 66f9454 Compare August 6, 2025 06:46

ElvisWang123 commented Aug 6, 2025

View reviewed changes

ElvisWang123 force-pushed the RFC-Cond-VPBB branch from 66f9454 to 0867545 Compare August 27, 2025 00:40

vzakhari mentioned this pull request Sep 8, 2025

[VPlan] Assertions with patch #141900 #157542

Closed

ElvisWang123 force-pushed the RFC-Cond-VPBB branch from 0867545 to 2ba3f32 Compare October 9, 2025 01:25

ElvisWang123 force-pushed the RFC-Cond-VPBB branch from 2ba3f32 to aa1d309 Compare October 16, 2025 01:43

ElvisWang123 force-pushed the RFC-Cond-VPBB branch from aa1d309 to 33cf37f Compare October 16, 2025 01:47

alexey-bataev reviewed Oct 17, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved

fhahn reviewed Oct 17, 2025

View reviewed changes

ElvisWang123 added 4 commits October 19, 2025 18:44

[LV] Pre-commit test case for condition VPBB. (NFC)

1dc7bbc

Fixup, add the cost of any-of to legacy model.

b945b7e

The `any-of (%mask)` is generated during generating conditional VPBB in vplan trnasforms. The cost of the any-of should also account by the legacy model to prevent misaligned issue.

ElvisWang123 force-pushed the RFC-Cond-VPBB branch from 33cf37f to b945b7e Compare October 20, 2025 02:39

[LV][RFC] Generating conditional VPBB that will be skip when the mask is inactive in VPlan. #141900

Are you sure you want to change the base?

[LV][RFC] Generating conditional VPBB that will be skip when the mask is inactive in VPlan. #141900

Uh oh!

Conversation

ElvisWang123 commented May 29, 2025

Uh oh!

llvmbot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

ElvisWang123 commented Jun 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElvisWang123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElvisWang123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElvisWang123 commented Aug 27, 2025

Uh oh!

vzakhari commented Sep 5, 2025

Uh oh!

ElvisWang123 commented Sep 7, 2025

Uh oh!

github-actions bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

vzakhari commented Oct 17, 2025

Uh oh!

ElvisWang123 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

llvmbot commented May 29, 2025 •

edited

Loading

ElvisWang123 left a comment •

edited

Loading

github-actions bot commented Oct 16, 2025 •

edited

Loading

ElvisWang123 commented Oct 20, 2025 •

edited

Loading