[LV] Vectorize conditional scalar assignments #158088

huntergr-arm · 2025-09-11T14:35:56Z

Based on Michael Maitland's previous work:
#121222

This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.

I've enabled it by default to see the impact on tests; if there are
regressions we can put it behind a cli option. I haven't corrected
all the comments for the tests, I'll wait until we decide whether
to keep it enabled by default first.

I will be doing some performance runs on AArch64 to figure out
the cost model, as we mostly regard vector selects as per-lane
instead of selecting the whole vector at once.

llvmbot · 2025-09-11T14:36:28Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Graham Hunter (huntergr-arm)

Changes

Based on Michael Maitland's previous work:
#121222

This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.

I've enabled it by default to see the impact on tests; if there are
regressions we can put it behind a cli option. I haven't corrected
all the comments for the tests, I'll wait until we decide whether
to keep it enabled by default first.

I will be doing some performance runs on AArch64 to figure out
the cost model, as we mostly regard vector selects as per-lane
instead of selecting the whole vector at once.

Patch is 204.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158088.diff

19 Files Affected:

(modified) llvm/include/llvm/Analysis/IVDescriptors.h (+15)
(modified) llvm/lib/Analysis/IVDescriptors.cpp (+44-1)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+8)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+32-1)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+44)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+7-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+69-3)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+67)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+7)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
(added) llvm/test/Transforms/LoopVectorize/AArch64/conditional-scalar-assignment.ll (+155)
(added) llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment-vplan.ll (+138)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-decreasing.ll (+294-38)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-no-wrap.ll (+80-8)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-non-const-iv-start.ll (+321-52)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-trunc.ll (+450-100)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp.ll (+171-19)
(modified) llvm/test/Transforms/LoopVectorize/select-cmp.ll (+121-20)

diff --git a/llvm/include/llvm/Analysis/IVDescriptors.h b/llvm/include/llvm/Analysis/IVDescriptors.h
index f9e6da6d0846a..afa175704a7b1 100644
--- a/llvm/include/llvm/Analysis/IVDescriptors.h
+++ b/llvm/include/llvm/Analysis/IVDescriptors.h
@@ -70,6 +70,9 @@ enum class RecurKind {
   FindLastIVUMax, ///< FindLast reduction with select(cmp(),x,y) where one of
                   ///< (x,y) is increasing loop induction, and both x and y
                   ///< are integer type, producing a UMax reduction.
+  FindLast,       ///< FindLast reduction with select(cmp(),x,y) where x and y
+                  ///< can be any scalar type, one is the current recurrence
+                  ///< value, and the other is an arbitrary value.
   // clang-format on
   // TODO: Any_of and FindLast reduction need not be restricted to integer type
   // only.
@@ -183,6 +186,12 @@ class RecurrenceDescriptor {
                                            PHINode *OrigPhi, Instruction *I,
                                            ScalarEvolution &SE);
 
+  /// Returns a struct describing whether the instruction is of the form
+  ///  Select(Cmp(A, B), X, Y)
+  /// where one of (X, Y) is the Phi value and the other is an arbitrary value.
+  LLVM_ABI static InstDesc isFindLastPattern(Instruction *I, PHINode *Phi,
+                                             Loop *TheLoop);
+
   /// Returns a struct describing if the instruction is a
   /// Select(FCmp(X, Y), (Z = X op PHINode), PHINode) instruction pattern.
   LLVM_ABI static InstDesc isConditionalRdxPattern(Instruction *I);
@@ -299,6 +308,12 @@ class RecurrenceDescriptor {
            isFindLastIVRecurrenceKind(Kind);
   }
 
+  /// Returns true if the recurrence kind is of the form
+  ///   select(cmp(),x,y) where one of (x,y) is an arbitrary value.
+  static bool isFindLastRecurrenceKind(RecurKind Kind) {
+    return Kind == RecurKind::FindLast;
+  }
+
   /// Returns the type of the recurrence. This type can be narrower than the
   /// actual type of the Phi if the recurrence has been type-promoted.
   Type *getRecurrenceType() const { return RecurrenceType; }
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index b8c540ce4b99d..bd87e9de46bd5 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -56,6 +56,8 @@ bool RecurrenceDescriptor::isIntegerRecurrenceKind(RecurKind Kind) {
   case RecurKind::FindFirstIVUMin:
   case RecurKind::FindLastIVSMax:
   case RecurKind::FindLastIVUMax:
+  // TODO: Make type-agnostic.
+  case RecurKind::FindLast:
     return true;
   }
   return false;
@@ -426,6 +428,8 @@ bool RecurrenceDescriptor::AddReductionVar(
       ++NumCmpSelectPatternInst;
     if (isAnyOfRecurrenceKind(Kind) && IsASelect)
       ++NumCmpSelectPatternInst;
+    if (isFindLastRecurrenceKind(Kind) && IsASelect)
+      ++NumCmpSelectPatternInst;
 
     // Check  whether we found a reduction operator.
     FoundReduxOp |= !IsAPhi && Cur != Start;
@@ -789,6 +793,38 @@ RecurrenceDescriptor::isFindIVPattern(RecurKind Kind, Loop *TheLoop,
   return InstDesc(false, I);
 }
 
+RecurrenceDescriptor::InstDesc
+RecurrenceDescriptor::isFindLastPattern(Instruction *I, PHINode *Phi,
+                                        Loop *TheLoop) {
+  // Must be a scalar.
+  Type *Type = Phi->getType();
+  if (!Type->isIntegerTy() && !Type->isFloatingPointTy() &&
+      !Type->isPointerTy())
+    return InstDesc(false, I);
+
+  SelectInst *Select = dyn_cast<SelectInst>(I);
+  if (!Select)
+    return InstDesc(false, I);
+
+  // FIXME: Support more complex patterns, including multiple selects.
+  // Phi or Select must be used only outside the loop,
+  // except for each other.
+  auto IsOnlyUsedOutsideLoop = [&](Value *V, Value *Ignore) {
+    return all_of(V->users(), [Ignore, TheLoop](User *U) {
+      if (U == Ignore)
+        return true;
+      if (auto *I = dyn_cast<Instruction>(U))
+        return !TheLoop->contains(I);
+      return false;
+    });
+  };
+  if (!IsOnlyUsedOutsideLoop(Phi, Select) ||
+      !IsOnlyUsedOutsideLoop(Select, Phi))
+    return InstDesc(false, I);
+
+  return InstDesc(I, RecurKind::FindLast);
+}
+
 RecurrenceDescriptor::InstDesc
 RecurrenceDescriptor::isMinMaxPattern(Instruction *I, RecurKind Kind,
                                       const InstDesc &Prev) {
@@ -927,6 +963,8 @@ RecurrenceDescriptor::InstDesc RecurrenceDescriptor::isRecurrenceInstr(
       return isConditionalRdxPattern(I);
     if (isFindIVRecurrenceKind(Kind) && SE)
       return isFindIVPattern(Kind, L, OrigPhi, I, *SE);
+    if (isFindLastRecurrenceKind(Kind))
+      return isFindLastPattern(I, OrigPhi, L);
     [[fallthrough]];
   case Instruction::FCmp:
   case Instruction::ICmp:
@@ -1123,7 +1161,11 @@ bool RecurrenceDescriptor::isReductionPHI(PHINode *Phi, Loop *TheLoop,
                       << "\n");
     return true;
   }
-
+  if (AddReductionVar(Phi, RecurKind::FindLast, TheLoop, FMF, RedDes, DB, AC, DT,
+                      SE)) {
+    LLVM_DEBUG(dbgs() << "Found a FindLast reduction PHI." << *Phi << "\n");
+    return true;
+  }
   // Not a reduction of known type.
   return false;
 }
@@ -1245,6 +1287,7 @@ unsigned RecurrenceDescriptor::getOpcode(RecurKind Kind) {
   case RecurKind::SMin:
   case RecurKind::UMax:
   case RecurKind::UMin:
+  case RecurKind::FindLast:
     return Instruction::ICmp;
   case RecurKind::FMax:
   case RecurKind::FMin:
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 92321a76dbd80..6595c6e770be0 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -1004,6 +1004,13 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     break;
   }
+  case Intrinsic::experimental_vector_extract_last_active:
+    if (ST->isSVEAvailable()) {
+      auto [LegalCost, _] = getTypeLegalizationCost(ICA.getArgTypes()[0]);
+      // This should turn into chained clastb instructions.
+      return LegalCost;
+    }
+    break;
   default:
     break;
   }
@@ -5325,6 +5332,7 @@ bool AArch64TTIImpl::isLegalToVectorizeReduction(
   case RecurKind::FMax:
   case RecurKind::FMulAdd:
   case RecurKind::AnyOf:
+  case RecurKind::FindLast:
     return true;
   default:
     return false;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index b4acda80cfb93..ea85685cdf7b8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4047,6 +4047,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
       case VPDef::VPWidenIntrinsicSC:
       case VPDef::VPWidenSC:
       case VPDef::VPWidenSelectSC:
+      case VPDef::VPWidenSelectVectorSC:
       case VPDef::VPBlendSC:
       case VPDef::VPFirstOrderRecurrencePHISC:
       case VPDef::VPHistogramSC:
@@ -4546,6 +4547,11 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
       any_of(Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
              IsaPred<VPReductionPHIRecipe>);
 
+  // FIXME: implement interleaving for FindLast transform correctly.
+  for (auto &[_, RdxDesc] : Legal->getReductionVars())
+    if (RecurrenceDescriptor::isFindLastRecurrenceKind(RdxDesc.getRecurrenceKind()))
+      return 1;
+
   // If we did not calculate the cost for VF (because the user selected the VF)
   // then we calculate the cost of VF here.
   if (LoopCost == 0) {
@@ -8687,6 +8693,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
                                 *Plan, Builder))
     return nullptr;
 
+  // Create whole-vector selects for find-last recurrences.
+  VPlanTransforms::runPass(VPlanTransforms::convertFindLastRecurrences,
+                           *Plan, RecipeBuilder, Legal);
+
   if (useActiveLaneMask(Style)) {
     // TODO: Move checks to VPlanTransforms::addActiveLaneMask once
     // TailFoldingStyle is visible there.
@@ -8779,6 +8789,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
 
     RecurKind Kind = PhiR->getRecurrenceKind();
     assert(
+        !RecurrenceDescriptor::isFindLastRecurrenceKind(Kind) &&
         !RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
         !RecurrenceDescriptor::isFindIVRecurrenceKind(Kind) &&
         "AnyOf and FindIV reductions are not allowed for in-loop reductions");
@@ -8987,6 +8998,10 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       FinalReductionResult =
           Builder.createNaryOp(VPInstruction::ComputeAnyOfResult,
                                {PhiR, Start, NewExitingVPV}, ExitDL);
+    } else if (RecurrenceDescriptor::isFindLastRecurrenceKind(
+             RdxDesc.getRecurrenceKind())) {
+      FinalReductionResult = Builder.createNaryOp(
+          VPInstruction::ExtractLastActive, {NewExitingVPV}, ExitDL);
     } else {
       VPIRFlags Flags =
           RecurrenceDescriptor::isFloatingPointRecurrenceKind(RecurrenceKind)
@@ -9076,7 +9091,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     RecurKind RK = RdxDesc.getRecurrenceKind();
     if ((!RecurrenceDescriptor::isAnyOfRecurrenceKind(RK) &&
          !RecurrenceDescriptor::isFindIVRecurrenceKind(RK) &&
-         !RecurrenceDescriptor::isMinMaxRecurrenceKind(RK))) {
+         !RecurrenceDescriptor::isMinMaxRecurrenceKind(RK) &&
+         !RecurrenceDescriptor::isFindLastRecurrenceKind(RK))) {
       VPBuilder PHBuilder(Plan->getVectorPreheader());
       VPValue *Iden = Plan->getOrAddLiveIn(
           getRecurrenceIdentity(RK, PhiTy, RdxDesc.getFastMathFlags()));
@@ -10069,6 +10085,21 @@ bool LoopVectorizePass::processLoop(Loop *L) {
   // Override IC if user provided an interleave count.
   IC = UserIC > 0 ? UserIC : IC;
 
+  // FIXME: Enable interleaving for last_active reductions.
+  if (any_of(LVL.getReductionVars(), [&](auto &Reduction) -> bool {
+    const RecurrenceDescriptor &RdxDesc = Reduction.second;
+    return RecurrenceDescriptor::isFindLastRecurrenceKind(RdxDesc.getRecurrenceKind());
+  })) {
+    LLVM_DEBUG(dbgs() << "LV: Not interleaving without vectorization due "
+                      << "to conditional scalar assignments.\n");
+    IntDiagMsg = {
+        "ConditionalAssignmentPreventsScalarInterleaving",
+        "Unable to interleave without vectorization due to conditional "
+        "assignments"};
+    InterleaveLoop = false;
+    IC = 1;
+  }
+
   // Emit diagnostic messages, if any.
   const char *VAPassName = Hints.vectorizeAnalysisPassName();
   if (!VectorizeLoop && !InterleaveLoop) {
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 75cace77ec534..7b25731af19d8 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -24868,6 +24868,7 @@ class HorizontalReduction {
         case RecurKind::FindFirstIVUMin:
         case RecurKind::FindLastIVSMax:
         case RecurKind::FindLastIVUMax:
+        case RecurKind::FindLast:
         case RecurKind::FMaxNum:
         case RecurKind::FMinNum:
         case RecurKind::FMaximumNum:
@@ -25009,6 +25010,7 @@ class HorizontalReduction {
     case RecurKind::FindFirstIVUMin:
     case RecurKind::FindLastIVSMax:
     case RecurKind::FindLastIVUMax:
+    case RecurKind::FindLast:
     case RecurKind::FMaxNum:
     case RecurKind::FMinNum:
     case RecurKind::FMaximumNum:
@@ -25115,6 +25117,7 @@ class HorizontalReduction {
     case RecurKind::FindFirstIVUMin:
     case RecurKind::FindLastIVSMax:
     case RecurKind::FindLastIVUMax:
+    case RecurKind::FindLast:
     case RecurKind::FMaxNum:
     case RecurKind::FMinNum:
     case RecurKind::FMaximumNum:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 53291a931530f..2ffe68fedee05 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -548,6 +548,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
     case VPRecipeBase::VPWidenIntrinsicSC:
     case VPRecipeBase::VPWidenSC:
     case VPRecipeBase::VPWidenSelectSC:
+    case VPRecipeBase::VPWidenSelectVectorSC:
     case VPRecipeBase::VPBlendSC:
     case VPRecipeBase::VPPredInstPHISC:
     case VPRecipeBase::VPCanonicalIVPHISC:
@@ -1059,6 +1060,8 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
     ResumeForEpilogue,
     /// Returns the value for vscale.
     VScale,
+    // Extracts the last active lane based on a predicate vector operand.
+    ExtractLastActive,
   };
 
 private:
@@ -1749,6 +1752,47 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags,
 
   unsigned getOpcode() const { return Instruction::Select; }
 
+  VPValue *getCond() const { return getOperand(0); }
+
+  bool isInvariantCond() const {
+    return getCond()->isDefinedOutsideLoopRegions();
+  }
+
+  /// Returns true if the recipe only uses the first lane of operand \p Op.
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    assert(is_contained(operands(), Op) &&
+           "Op must be an operand of the recipe");
+    return Op == getCond() && isInvariantCond();
+  }
+};
+
+/// A recipe for selecting whole vector values.
+struct VPWidenSelectVectorRecipe : public VPRecipeWithIRFlags {
+  VPWidenSelectVectorRecipe(ArrayRef<VPValue *> Operands)
+      : VPRecipeWithIRFlags(VPDef::VPWidenSelectVectorSC, Operands) {}
+
+  ~VPWidenSelectVectorRecipe() override = default;
+
+  VPWidenSelectVectorRecipe *clone() override {
+    SmallVector<VPValue *, 3> Operands(operands());
+    return new VPWidenSelectVectorRecipe(Operands);
+  }
+
+  VP_CLASSOF_IMPL(VPDef::VPWidenSelectVectorSC)
+
+  /// Produce a widened version of the select instruction.
+  void execute(VPTransformState &State) override;
+
+  /// Return the cost of this VPWidenSelectVectorRecipe.
+  InstructionCost computeCost(ElementCount VF,
+                              VPCostContext &Ctx) const override;
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+  /// Print the recipe.
+  void print(raw_ostream &O, const Twine &Indent,
+             VPSlotTracker &SlotTracker) const override;
+#endif
+
   VPValue *getCond() const {
     return getOperand(0);
   }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index d400ceff7797c..a299ab8593a2f 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -115,7 +115,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
   case VPInstruction::FirstActiveLane:
     return Type::getIntNTy(Ctx, 64);
   case VPInstruction::ExtractLastElement:
-  case VPInstruction::ExtractPenultimateElement: {
+  case VPInstruction::ExtractPenultimateElement:
+  case VPInstruction::ExtractLastActive: {
     Type *BaseTy = inferScalarType(R->getOperand(0));
     if (auto *VecTy = dyn_cast<VectorType>(BaseTy))
       return VecTy->getElementType();
@@ -308,7 +309,11 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
           })
           .Case<VPExpressionRecipe>([this](const auto *R) {
             return inferScalarType(R->getOperandOfResultType());
-          });
+          })
+          .Case<VPWidenSelectVectorRecipe>(
+              [this](const VPWidenSelectVectorRecipe *R) {
+                return inferScalarType(R->getOperand(1));
+              });
 
   assert(ResultTy && "could not infer type for the given VPValue");
   CachedTypes[V] = ResultTy;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index bf51489543098..598fa4888fe8a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -86,7 +86,8 @@ bool VPRecipeBase::mayWriteToMemory() const {
   case VPWidenLoadSC:
   case VPWidenPHISC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -134,7 +135,8 @@ bool VPRecipeBase::mayReadFromMemory() const {
   case VPWidenIntOrFpInductionSC:
   case VPWidenPHISC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -177,7 +179,8 @@ bool VPRecipeBase::mayHaveSideEffects() const {
   case VPWidenPHISC:
   case VPWidenPointerInductionSC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -522,6 +525,7 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) {
   case VPInstruction::ActiveLaneMask:
   case VPInstruction::ComputeAnyOfResult:
   case VPInstruction::ReductionStartVector:
+  case VPInstruction::ExtractLastActive:
     return 3;
   case VPInstruction::ComputeFindIVResult:
     return 4;
@@ -983,6 +987,17 @@ Value *VPInstruction::generate(VPTransformState &State) {
   }
   case VPInstruction::ResumeForEpilogue:
     return State.get(getOperand(0), true);
+  case VPInstruction::ExtractLastActive: {
+    Value *Data = State.get(getOperand(0));
+    Value *Mask = State.get(getOperand(1));
+    Value *Default = State.get(getOperand(2), /*IsScalar=*/true);
+    Type *VTy = Data->getType();
+
+    Module *M = State.Builder.GetInsertBlock()->getModule();
+    Function *ExtractLast = Intrinsic::getOrInsertDeclaration(
+        M, Intrinsic::experimental_vector_extract_last_active, {VTy});
+    return Builder.CreateCall(ExtractLast, {Data, Mask, Default});
+  }
   default:
     llvm_unreachable("Unsupported opcode for instruction");
   }
@@ -1119,6 +1134,14 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
                                   {PredTy, Type::getInt1Ty(Ctx.LLVMCtx)});
     return Ctx.TTI.getIntrinsicInstrCost(Attrs, Ctx.CostKind);
   }
+  case VPInstruction::ExtractLastActive: {
+    Type *ScalarTy = Ctx.Types.inferScalarType(this);
+    Type *VecTy = toVectorTy(ScalarTy, VF);
+    Type *MaskTy = toVectorTy(Type::getInt1Ty(Ctx.LLVMCtx), VF);
+    IntrinsicCostAttributes ICA(Intrinsic::experimental_vector_extract_last_active,
+                                ScalarTy, {VecTy, MaskTy, ScalarTy});
+    return Ctx.TTI.getIntrinsicInstrCost(ICA, Ctx.CostKind);
+  }
   case VPInstruction::FirstOrderRecurrenceSplice: {
     assert(VF.isVector() && "Scalar FirstOrderRecurrenceSplice?");
     SmallVector<int> Mask(VF.getKnownMinValue());
@@ -1174,6 +1197,7 @@ bool VPInstruction::isVectorToScalar() const {
          getOpcode() == VPInstruction::FirstActiveLane ||
          getOpcode() == VPInstruction::ComputeAnyOfResult ||
          getOpcode() == VPInstruction::ComputeFindIVResult ||
+         getOpcode() == VPInstruction::ExtractLastActive ||
          getOpcode() == VPInstruction::ComputeReductionResult ||
          getOpcode() == VPInstruction::AnyOf;
 }
@@ -1243,6 +1267,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
   case VPInstruction::ExtractLastElement:
   case VPInstruction::ExtractPenultimateElement:
   case VPInstruction::FirstActiveLane:
+  case VPInstruction::ExtractLastActive:
   case VPInstruction::FirstOrderRecurrenceSplice:
   case VPInstruction::LogicalAnd:
   case VPInstruction::Not:
@@ -1414,6 +1439,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
   case VPInstruction::ResumeForEpilogue:
     O << "resume-for-epilogue";
     break;
+  case VPInstruction::ExtractLastActive:
+    O << "extract-last-active";
+    break;
   default:
     O << Instruction::getOpcodeName(getOpcode());
   }
@@ -1927,7 +1955,9 @@ void VPHistogramRecipe::print(raw_ostream &O, const Twine &Indent,
     Mask->printAsOperand(O, SlotTracker);
   }
 }
+#endif
 
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
 void VPWidenSelectRecipe::print(raw_ostream &O, const Twine &Indent,
                                 VPSlotTracker &SlotTracker) const {
   O << Indent << "WIDEN-SELECT ";
@@ -2002,6 +2032,42 @@ InstructionCost VPWidenS...
[truncated]

llvmbot · 2025-09-11T14:36:28Z

@llvm/pr-subscribers-vectorizers

Author: Graham Hunter (huntergr-arm)

Changes

Based on Michael Maitland's previous work:
#121222

This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.

I've enabled it by default to see the impact on tests; if there are
regressions we can put it behind a cli option. I haven't corrected
all the comments for the tests, I'll wait until we decide whether
to keep it enabled by default first.

I will be doing some performance runs on AArch64 to figure out
the cost model, as we mostly regard vector selects as per-lane
instead of selecting the whole vector at once.

Patch is 204.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158088.diff

19 Files Affected:

(modified) llvm/include/llvm/Analysis/IVDescriptors.h (+15)
(modified) llvm/lib/Analysis/IVDescriptors.cpp (+44-1)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+8)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+32-1)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+44)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+7-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+69-3)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+67)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+7)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
(added) llvm/test/Transforms/LoopVectorize/AArch64/conditional-scalar-assignment.ll (+155)
(added) llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment-vplan.ll (+138)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-decreasing.ll (+294-38)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-no-wrap.ll (+80-8)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-non-const-iv-start.ll (+321-52)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-trunc.ll (+450-100)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp.ll (+171-19)
(modified) llvm/test/Transforms/LoopVectorize/select-cmp.ll (+121-20)

diff --git a/llvm/include/llvm/Analysis/IVDescriptors.h b/llvm/include/llvm/Analysis/IVDescriptors.h
index f9e6da6d0846a..afa175704a7b1 100644
--- a/llvm/include/llvm/Analysis/IVDescriptors.h
+++ b/llvm/include/llvm/Analysis/IVDescriptors.h
@@ -70,6 +70,9 @@ enum class RecurKind {
   FindLastIVUMax, ///< FindLast reduction with select(cmp(),x,y) where one of
                   ///< (x,y) is increasing loop induction, and both x and y
                   ///< are integer type, producing a UMax reduction.
+  FindLast,       ///< FindLast reduction with select(cmp(),x,y) where x and y
+                  ///< can be any scalar type, one is the current recurrence
+                  ///< value, and the other is an arbitrary value.
   // clang-format on
   // TODO: Any_of and FindLast reduction need not be restricted to integer type
   // only.
@@ -183,6 +186,12 @@ class RecurrenceDescriptor {
                                            PHINode *OrigPhi, Instruction *I,
                                            ScalarEvolution &SE);
 
+  /// Returns a struct describing whether the instruction is of the form
+  ///  Select(Cmp(A, B), X, Y)
+  /// where one of (X, Y) is the Phi value and the other is an arbitrary value.
+  LLVM_ABI static InstDesc isFindLastPattern(Instruction *I, PHINode *Phi,
+                                             Loop *TheLoop);
+
   /// Returns a struct describing if the instruction is a
   /// Select(FCmp(X, Y), (Z = X op PHINode), PHINode) instruction pattern.
   LLVM_ABI static InstDesc isConditionalRdxPattern(Instruction *I);
@@ -299,6 +308,12 @@ class RecurrenceDescriptor {
            isFindLastIVRecurrenceKind(Kind);
   }
 
+  /// Returns true if the recurrence kind is of the form
+  ///   select(cmp(),x,y) where one of (x,y) is an arbitrary value.
+  static bool isFindLastRecurrenceKind(RecurKind Kind) {
+    return Kind == RecurKind::FindLast;
+  }
+
   /// Returns the type of the recurrence. This type can be narrower than the
   /// actual type of the Phi if the recurrence has been type-promoted.
   Type *getRecurrenceType() const { return RecurrenceType; }
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index b8c540ce4b99d..bd87e9de46bd5 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -56,6 +56,8 @@ bool RecurrenceDescriptor::isIntegerRecurrenceKind(RecurKind Kind) {
   case RecurKind::FindFirstIVUMin:
   case RecurKind::FindLastIVSMax:
   case RecurKind::FindLastIVUMax:
+  // TODO: Make type-agnostic.
+  case RecurKind::FindLast:
     return true;
   }
   return false;
@@ -426,6 +428,8 @@ bool RecurrenceDescriptor::AddReductionVar(
       ++NumCmpSelectPatternInst;
     if (isAnyOfRecurrenceKind(Kind) && IsASelect)
       ++NumCmpSelectPatternInst;
+    if (isFindLastRecurrenceKind(Kind) && IsASelect)
+      ++NumCmpSelectPatternInst;
 
     // Check  whether we found a reduction operator.
     FoundReduxOp |= !IsAPhi && Cur != Start;
@@ -789,6 +793,38 @@ RecurrenceDescriptor::isFindIVPattern(RecurKind Kind, Loop *TheLoop,
   return InstDesc(false, I);
 }
 
+RecurrenceDescriptor::InstDesc
+RecurrenceDescriptor::isFindLastPattern(Instruction *I, PHINode *Phi,
+                                        Loop *TheLoop) {
+  // Must be a scalar.
+  Type *Type = Phi->getType();
+  if (!Type->isIntegerTy() && !Type->isFloatingPointTy() &&
+      !Type->isPointerTy())
+    return InstDesc(false, I);
+
+  SelectInst *Select = dyn_cast<SelectInst>(I);
+  if (!Select)
+    return InstDesc(false, I);
+
+  // FIXME: Support more complex patterns, including multiple selects.
+  // Phi or Select must be used only outside the loop,
+  // except for each other.
+  auto IsOnlyUsedOutsideLoop = [&](Value *V, Value *Ignore) {
+    return all_of(V->users(), [Ignore, TheLoop](User *U) {
+      if (U == Ignore)
+        return true;
+      if (auto *I = dyn_cast<Instruction>(U))
+        return !TheLoop->contains(I);
+      return false;
+    });
+  };
+  if (!IsOnlyUsedOutsideLoop(Phi, Select) ||
+      !IsOnlyUsedOutsideLoop(Select, Phi))
+    return InstDesc(false, I);
+
+  return InstDesc(I, RecurKind::FindLast);
+}
+
 RecurrenceDescriptor::InstDesc
 RecurrenceDescriptor::isMinMaxPattern(Instruction *I, RecurKind Kind,
                                       const InstDesc &Prev) {
@@ -927,6 +963,8 @@ RecurrenceDescriptor::InstDesc RecurrenceDescriptor::isRecurrenceInstr(
       return isConditionalRdxPattern(I);
     if (isFindIVRecurrenceKind(Kind) && SE)
       return isFindIVPattern(Kind, L, OrigPhi, I, *SE);
+    if (isFindLastRecurrenceKind(Kind))
+      return isFindLastPattern(I, OrigPhi, L);
     [[fallthrough]];
   case Instruction::FCmp:
   case Instruction::ICmp:
@@ -1123,7 +1161,11 @@ bool RecurrenceDescriptor::isReductionPHI(PHINode *Phi, Loop *TheLoop,
                       << "\n");
     return true;
   }
-
+  if (AddReductionVar(Phi, RecurKind::FindLast, TheLoop, FMF, RedDes, DB, AC, DT,
+                      SE)) {
+    LLVM_DEBUG(dbgs() << "Found a FindLast reduction PHI." << *Phi << "\n");
+    return true;
+  }
   // Not a reduction of known type.
   return false;
 }
@@ -1245,6 +1287,7 @@ unsigned RecurrenceDescriptor::getOpcode(RecurKind Kind) {
   case RecurKind::SMin:
   case RecurKind::UMax:
   case RecurKind::UMin:
+  case RecurKind::FindLast:
     return Instruction::ICmp;
   case RecurKind::FMax:
   case RecurKind::FMin:
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 92321a76dbd80..6595c6e770be0 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -1004,6 +1004,13 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     break;
   }
+  case Intrinsic::experimental_vector_extract_last_active:
+    if (ST->isSVEAvailable()) {
+      auto [LegalCost, _] = getTypeLegalizationCost(ICA.getArgTypes()[0]);
+      // This should turn into chained clastb instructions.
+      return LegalCost;
+    }
+    break;
   default:
     break;
   }
@@ -5325,6 +5332,7 @@ bool AArch64TTIImpl::isLegalToVectorizeReduction(
   case RecurKind::FMax:
   case RecurKind::FMulAdd:
   case RecurKind::AnyOf:
+  case RecurKind::FindLast:
     return true;
   default:
     return false;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index b4acda80cfb93..ea85685cdf7b8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4047,6 +4047,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
       case VPDef::VPWidenIntrinsicSC:
       case VPDef::VPWidenSC:
       case VPDef::VPWidenSelectSC:
+      case VPDef::VPWidenSelectVectorSC:
       case VPDef::VPBlendSC:
       case VPDef::VPFirstOrderRecurrencePHISC:
       case VPDef::VPHistogramSC:
@@ -4546,6 +4547,11 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
       any_of(Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
              IsaPred<VPReductionPHIRecipe>);
 
+  // FIXME: implement interleaving for FindLast transform correctly.
+  for (auto &[_, RdxDesc] : Legal->getReductionVars())
+    if (RecurrenceDescriptor::isFindLastRecurrenceKind(RdxDesc.getRecurrenceKind()))
+      return 1;
+
   // If we did not calculate the cost for VF (because the user selected the VF)
   // then we calculate the cost of VF here.
   if (LoopCost == 0) {
@@ -8687,6 +8693,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
                                 *Plan, Builder))
     return nullptr;
 
+  // Create whole-vector selects for find-last recurrences.
+  VPlanTransforms::runPass(VPlanTransforms::convertFindLastRecurrences,
+                           *Plan, RecipeBuilder, Legal);
+
   if (useActiveLaneMask(Style)) {
     // TODO: Move checks to VPlanTransforms::addActiveLaneMask once
     // TailFoldingStyle is visible there.
@@ -8779,6 +8789,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
 
     RecurKind Kind = PhiR->getRecurrenceKind();
     assert(
+        !RecurrenceDescriptor::isFindLastRecurrenceKind(Kind) &&
         !RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
         !RecurrenceDescriptor::isFindIVRecurrenceKind(Kind) &&
         "AnyOf and FindIV reductions are not allowed for in-loop reductions");
@@ -8987,6 +8998,10 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       FinalReductionResult =
           Builder.createNaryOp(VPInstruction::ComputeAnyOfResult,
                                {PhiR, Start, NewExitingVPV}, ExitDL);
+    } else if (RecurrenceDescriptor::isFindLastRecurrenceKind(
+             RdxDesc.getRecurrenceKind())) {
+      FinalReductionResult = Builder.createNaryOp(
+          VPInstruction::ExtractLastActive, {NewExitingVPV}, ExitDL);
     } else {
       VPIRFlags Flags =
           RecurrenceDescriptor::isFloatingPointRecurrenceKind(RecurrenceKind)
@@ -9076,7 +9091,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     RecurKind RK = RdxDesc.getRecurrenceKind();
     if ((!RecurrenceDescriptor::isAnyOfRecurrenceKind(RK) &&
          !RecurrenceDescriptor::isFindIVRecurrenceKind(RK) &&
-         !RecurrenceDescriptor::isMinMaxRecurrenceKind(RK))) {
+         !RecurrenceDescriptor::isMinMaxRecurrenceKind(RK) &&
+         !RecurrenceDescriptor::isFindLastRecurrenceKind(RK))) {
       VPBuilder PHBuilder(Plan->getVectorPreheader());
       VPValue *Iden = Plan->getOrAddLiveIn(
           getRecurrenceIdentity(RK, PhiTy, RdxDesc.getFastMathFlags()));
@@ -10069,6 +10085,21 @@ bool LoopVectorizePass::processLoop(Loop *L) {
   // Override IC if user provided an interleave count.
   IC = UserIC > 0 ? UserIC : IC;
 
+  // FIXME: Enable interleaving for last_active reductions.
+  if (any_of(LVL.getReductionVars(), [&](auto &Reduction) -> bool {
+    const RecurrenceDescriptor &RdxDesc = Reduction.second;
+    return RecurrenceDescriptor::isFindLastRecurrenceKind(RdxDesc.getRecurrenceKind());
+  })) {
+    LLVM_DEBUG(dbgs() << "LV: Not interleaving without vectorization due "
+                      << "to conditional scalar assignments.\n");
+    IntDiagMsg = {
+        "ConditionalAssignmentPreventsScalarInterleaving",
+        "Unable to interleave without vectorization due to conditional "
+        "assignments"};
+    InterleaveLoop = false;
+    IC = 1;
+  }
+
   // Emit diagnostic messages, if any.
   const char *VAPassName = Hints.vectorizeAnalysisPassName();
   if (!VectorizeLoop && !InterleaveLoop) {
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 75cace77ec534..7b25731af19d8 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -24868,6 +24868,7 @@ class HorizontalReduction {
         case RecurKind::FindFirstIVUMin:
         case RecurKind::FindLastIVSMax:
         case RecurKind::FindLastIVUMax:
+        case RecurKind::FindLast:
         case RecurKind::FMaxNum:
         case RecurKind::FMinNum:
         case RecurKind::FMaximumNum:
@@ -25009,6 +25010,7 @@ class HorizontalReduction {
     case RecurKind::FindFirstIVUMin:
     case RecurKind::FindLastIVSMax:
     case RecurKind::FindLastIVUMax:
+    case RecurKind::FindLast:
     case RecurKind::FMaxNum:
     case RecurKind::FMinNum:
     case RecurKind::FMaximumNum:
@@ -25115,6 +25117,7 @@ class HorizontalReduction {
     case RecurKind::FindFirstIVUMin:
     case RecurKind::FindLastIVSMax:
     case RecurKind::FindLastIVUMax:
+    case RecurKind::FindLast:
     case RecurKind::FMaxNum:
     case RecurKind::FMinNum:
     case RecurKind::FMaximumNum:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 53291a931530f..2ffe68fedee05 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -548,6 +548,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
     case VPRecipeBase::VPWidenIntrinsicSC:
     case VPRecipeBase::VPWidenSC:
     case VPRecipeBase::VPWidenSelectSC:
+    case VPRecipeBase::VPWidenSelectVectorSC:
     case VPRecipeBase::VPBlendSC:
     case VPRecipeBase::VPPredInstPHISC:
     case VPRecipeBase::VPCanonicalIVPHISC:
@@ -1059,6 +1060,8 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
     ResumeForEpilogue,
     /// Returns the value for vscale.
     VScale,
+    // Extracts the last active lane based on a predicate vector operand.
+    ExtractLastActive,
   };
 
 private:
@@ -1749,6 +1752,47 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags,
 
   unsigned getOpcode() const { return Instruction::Select; }
 
+  VPValue *getCond() const { return getOperand(0); }
+
+  bool isInvariantCond() const {
+    return getCond()->isDefinedOutsideLoopRegions();
+  }
+
+  /// Returns true if the recipe only uses the first lane of operand \p Op.
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    assert(is_contained(operands(), Op) &&
+           "Op must be an operand of the recipe");
+    return Op == getCond() && isInvariantCond();
+  }
+};
+
+/// A recipe for selecting whole vector values.
+struct VPWidenSelectVectorRecipe : public VPRecipeWithIRFlags {
+  VPWidenSelectVectorRecipe(ArrayRef<VPValue *> Operands)
+      : VPRecipeWithIRFlags(VPDef::VPWidenSelectVectorSC, Operands) {}
+
+  ~VPWidenSelectVectorRecipe() override = default;
+
+  VPWidenSelectVectorRecipe *clone() override {
+    SmallVector<VPValue *, 3> Operands(operands());
+    return new VPWidenSelectVectorRecipe(Operands);
+  }
+
+  VP_CLASSOF_IMPL(VPDef::VPWidenSelectVectorSC)
+
+  /// Produce a widened version of the select instruction.
+  void execute(VPTransformState &State) override;
+
+  /// Return the cost of this VPWidenSelectVectorRecipe.
+  InstructionCost computeCost(ElementCount VF,
+                              VPCostContext &Ctx) const override;
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+  /// Print the recipe.
+  void print(raw_ostream &O, const Twine &Indent,
+             VPSlotTracker &SlotTracker) const override;
+#endif
+
   VPValue *getCond() const {
     return getOperand(0);
   }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index d400ceff7797c..a299ab8593a2f 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -115,7 +115,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
   case VPInstruction::FirstActiveLane:
     return Type::getIntNTy(Ctx, 64);
   case VPInstruction::ExtractLastElement:
-  case VPInstruction::ExtractPenultimateElement: {
+  case VPInstruction::ExtractPenultimateElement:
+  case VPInstruction::ExtractLastActive: {
     Type *BaseTy = inferScalarType(R->getOperand(0));
     if (auto *VecTy = dyn_cast<VectorType>(BaseTy))
       return VecTy->getElementType();
@@ -308,7 +309,11 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
           })
           .Case<VPExpressionRecipe>([this](const auto *R) {
             return inferScalarType(R->getOperandOfResultType());
-          });
+          })
+          .Case<VPWidenSelectVectorRecipe>(
+              [this](const VPWidenSelectVectorRecipe *R) {
+                return inferScalarType(R->getOperand(1));
+              });
 
   assert(ResultTy && "could not infer type for the given VPValue");
   CachedTypes[V] = ResultTy;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index bf51489543098..598fa4888fe8a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -86,7 +86,8 @@ bool VPRecipeBase::mayWriteToMemory() const {
   case VPWidenLoadSC:
   case VPWidenPHISC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -134,7 +135,8 @@ bool VPRecipeBase::mayReadFromMemory() const {
   case VPWidenIntOrFpInductionSC:
   case VPWidenPHISC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -177,7 +179,8 @@ bool VPRecipeBase::mayHaveSideEffects() const {
   case VPWidenPHISC:
   case VPWidenPointerInductionSC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -522,6 +525,7 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) {
   case VPInstruction::ActiveLaneMask:
   case VPInstruction::ComputeAnyOfResult:
   case VPInstruction::ReductionStartVector:
+  case VPInstruction::ExtractLastActive:
     return 3;
   case VPInstruction::ComputeFindIVResult:
     return 4;
@@ -983,6 +987,17 @@ Value *VPInstruction::generate(VPTransformState &State) {
   }
   case VPInstruction::ResumeForEpilogue:
     return State.get(getOperand(0), true);
+  case VPInstruction::ExtractLastActive: {
+    Value *Data = State.get(getOperand(0));
+    Value *Mask = State.get(getOperand(1));
+    Value *Default = State.get(getOperand(2), /*IsScalar=*/true);
+    Type *VTy = Data->getType();
+
+    Module *M = State.Builder.GetInsertBlock()->getModule();
+    Function *ExtractLast = Intrinsic::getOrInsertDeclaration(
+        M, Intrinsic::experimental_vector_extract_last_active, {VTy});
+    return Builder.CreateCall(ExtractLast, {Data, Mask, Default});
+  }
   default:
     llvm_unreachable("Unsupported opcode for instruction");
   }
@@ -1119,6 +1134,14 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
                                   {PredTy, Type::getInt1Ty(Ctx.LLVMCtx)});
     return Ctx.TTI.getIntrinsicInstrCost(Attrs, Ctx.CostKind);
   }
+  case VPInstruction::ExtractLastActive: {
+    Type *ScalarTy = Ctx.Types.inferScalarType(this);
+    Type *VecTy = toVectorTy(ScalarTy, VF);
+    Type *MaskTy = toVectorTy(Type::getInt1Ty(Ctx.LLVMCtx), VF);
+    IntrinsicCostAttributes ICA(Intrinsic::experimental_vector_extract_last_active,
+                                ScalarTy, {VecTy, MaskTy, ScalarTy});
+    return Ctx.TTI.getIntrinsicInstrCost(ICA, Ctx.CostKind);
+  }
   case VPInstruction::FirstOrderRecurrenceSplice: {
     assert(VF.isVector() && "Scalar FirstOrderRecurrenceSplice?");
     SmallVector<int> Mask(VF.getKnownMinValue());
@@ -1174,6 +1197,7 @@ bool VPInstruction::isVectorToScalar() const {
          getOpcode() == VPInstruction::FirstActiveLane ||
          getOpcode() == VPInstruction::ComputeAnyOfResult ||
          getOpcode() == VPInstruction::ComputeFindIVResult ||
+         getOpcode() == VPInstruction::ExtractLastActive ||
          getOpcode() == VPInstruction::ComputeReductionResult ||
          getOpcode() == VPInstruction::AnyOf;
 }
@@ -1243,6 +1267,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
   case VPInstruction::ExtractLastElement:
   case VPInstruction::ExtractPenultimateElement:
   case VPInstruction::FirstActiveLane:
+  case VPInstruction::ExtractLastActive:
   case VPInstruction::FirstOrderRecurrenceSplice:
   case VPInstruction::LogicalAnd:
   case VPInstruction::Not:
@@ -1414,6 +1439,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
   case VPInstruction::ResumeForEpilogue:
     O << "resume-for-epilogue";
     break;
+  case VPInstruction::ExtractLastActive:
+    O << "extract-last-active";
+    break;
   default:
     O << Instruction::getOpcodeName(getOpcode());
   }
@@ -1927,7 +1955,9 @@ void VPHistogramRecipe::print(raw_ostream &O, const Twine &Indent,
     Mask->printAsOperand(O, SlotTracker);
   }
 }
+#endif
 
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
 void VPWidenSelectRecipe::print(raw_ostream &O, const Twine &Indent,
                                 VPSlotTracker &SlotTracker) const {
   O << Indent << "WIDEN-SELECT ";
@@ -2002,6 +2032,42 @@ InstructionCost VPWidenS...
[truncated]

llvmbot · 2025-09-11T14:36:29Z

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

Based on Michael Maitland's previous work:
#121222

This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.

I've enabled it by default to see the impact on tests; if there are
regressions we can put it behind a cli option. I haven't corrected
all the comments for the tests, I'll wait until we decide whether
to keep it enabled by default first.

I will be doing some performance runs on AArch64 to figure out
the cost model, as we mostly regard vector selects as per-lane
instead of selecting the whole vector at once.

Patch is 204.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158088.diff

19 Files Affected:

(modified) llvm/include/llvm/Analysis/IVDescriptors.h (+15)
(modified) llvm/lib/Analysis/IVDescriptors.cpp (+44-1)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+8)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+32-1)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+44)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+7-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+69-3)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+67)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+7)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
(added) llvm/test/Transforms/LoopVectorize/AArch64/conditional-scalar-assignment.ll (+155)
(added) llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment-vplan.ll (+138)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-decreasing.ll (+294-38)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-no-wrap.ll (+80-8)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-non-const-iv-start.ll (+321-52)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-trunc.ll (+450-100)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp.ll (+171-19)
(modified) llvm/test/Transforms/LoopVectorize/select-cmp.ll (+121-20)

diff --git a/llvm/include/llvm/Analysis/IVDescriptors.h b/llvm/include/llvm/Analysis/IVDescriptors.h
index f9e6da6d0846a..afa175704a7b1 100644
--- a/llvm/include/llvm/Analysis/IVDescriptors.h
+++ b/llvm/include/llvm/Analysis/IVDescriptors.h
@@ -70,6 +70,9 @@ enum class RecurKind {
   FindLastIVUMax, ///< FindLast reduction with select(cmp(),x,y) where one of
                   ///< (x,y) is increasing loop induction, and both x and y
                   ///< are integer type, producing a UMax reduction.
+  FindLast,       ///< FindLast reduction with select(cmp(),x,y) where x and y
+                  ///< can be any scalar type, one is the current recurrence
+                  ///< value, and the other is an arbitrary value.
   // clang-format on
   // TODO: Any_of and FindLast reduction need not be restricted to integer type
   // only.
@@ -183,6 +186,12 @@ class RecurrenceDescriptor {
                                            PHINode *OrigPhi, Instruction *I,
                                            ScalarEvolution &SE);
 
+  /// Returns a struct describing whether the instruction is of the form
+  ///  Select(Cmp(A, B), X, Y)
+  /// where one of (X, Y) is the Phi value and the other is an arbitrary value.
+  LLVM_ABI static InstDesc isFindLastPattern(Instruction *I, PHINode *Phi,
+                                             Loop *TheLoop);
+
   /// Returns a struct describing if the instruction is a
   /// Select(FCmp(X, Y), (Z = X op PHINode), PHINode) instruction pattern.
   LLVM_ABI static InstDesc isConditionalRdxPattern(Instruction *I);
@@ -299,6 +308,12 @@ class RecurrenceDescriptor {
            isFindLastIVRecurrenceKind(Kind);
   }
 
+  /// Returns true if the recurrence kind is of the form
+  ///   select(cmp(),x,y) where one of (x,y) is an arbitrary value.
+  static bool isFindLastRecurrenceKind(RecurKind Kind) {
+    return Kind == RecurKind::FindLast;
+  }
+
   /// Returns the type of the recurrence. This type can be narrower than the
   /// actual type of the Phi if the recurrence has been type-promoted.
   Type *getRecurrenceType() const { return RecurrenceType; }
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index b8c540ce4b99d..bd87e9de46bd5 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -56,6 +56,8 @@ bool RecurrenceDescriptor::isIntegerRecurrenceKind(RecurKind Kind) {
   case RecurKind::FindFirstIVUMin:
   case RecurKind::FindLastIVSMax:
   case RecurKind::FindLastIVUMax:
+  // TODO: Make type-agnostic.
+  case RecurKind::FindLast:
     return true;
   }
   return false;
@@ -426,6 +428,8 @@ bool RecurrenceDescriptor::AddReductionVar(
       ++NumCmpSelectPatternInst;
     if (isAnyOfRecurrenceKind(Kind) && IsASelect)
       ++NumCmpSelectPatternInst;
+    if (isFindLastRecurrenceKind(Kind) && IsASelect)
+      ++NumCmpSelectPatternInst;
 
     // Check  whether we found a reduction operator.
     FoundReduxOp |= !IsAPhi && Cur != Start;
@@ -789,6 +793,38 @@ RecurrenceDescriptor::isFindIVPattern(RecurKind Kind, Loop *TheLoop,
   return InstDesc(false, I);
 }
 
+RecurrenceDescriptor::InstDesc
+RecurrenceDescriptor::isFindLastPattern(Instruction *I, PHINode *Phi,
+                                        Loop *TheLoop) {
+  // Must be a scalar.
+  Type *Type = Phi->getType();
+  if (!Type->isIntegerTy() && !Type->isFloatingPointTy() &&
+      !Type->isPointerTy())
+    return InstDesc(false, I);
+
+  SelectInst *Select = dyn_cast<SelectInst>(I);
+  if (!Select)
+    return InstDesc(false, I);
+
+  // FIXME: Support more complex patterns, including multiple selects.
+  // Phi or Select must be used only outside the loop,
+  // except for each other.
+  auto IsOnlyUsedOutsideLoop = [&](Value *V, Value *Ignore) {
+    return all_of(V->users(), [Ignore, TheLoop](User *U) {
+      if (U == Ignore)
+        return true;
+      if (auto *I = dyn_cast<Instruction>(U))
+        return !TheLoop->contains(I);
+      return false;
+    });
+  };
+  if (!IsOnlyUsedOutsideLoop(Phi, Select) ||
+      !IsOnlyUsedOutsideLoop(Select, Phi))
+    return InstDesc(false, I);
+
+  return InstDesc(I, RecurKind::FindLast);
+}
+
 RecurrenceDescriptor::InstDesc
 RecurrenceDescriptor::isMinMaxPattern(Instruction *I, RecurKind Kind,
                                       const InstDesc &Prev) {
@@ -927,6 +963,8 @@ RecurrenceDescriptor::InstDesc RecurrenceDescriptor::isRecurrenceInstr(
       return isConditionalRdxPattern(I);
     if (isFindIVRecurrenceKind(Kind) && SE)
       return isFindIVPattern(Kind, L, OrigPhi, I, *SE);
+    if (isFindLastRecurrenceKind(Kind))
+      return isFindLastPattern(I, OrigPhi, L);
     [[fallthrough]];
   case Instruction::FCmp:
   case Instruction::ICmp:
@@ -1123,7 +1161,11 @@ bool RecurrenceDescriptor::isReductionPHI(PHINode *Phi, Loop *TheLoop,
                       << "\n");
     return true;
   }
-
+  if (AddReductionVar(Phi, RecurKind::FindLast, TheLoop, FMF, RedDes, DB, AC, DT,
+                      SE)) {
+    LLVM_DEBUG(dbgs() << "Found a FindLast reduction PHI." << *Phi << "\n");
+    return true;
+  }
   // Not a reduction of known type.
   return false;
 }
@@ -1245,6 +1287,7 @@ unsigned RecurrenceDescriptor::getOpcode(RecurKind Kind) {
   case RecurKind::SMin:
   case RecurKind::UMax:
   case RecurKind::UMin:
+  case RecurKind::FindLast:
     return Instruction::ICmp;
   case RecurKind::FMax:
   case RecurKind::FMin:
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 92321a76dbd80..6595c6e770be0 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -1004,6 +1004,13 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     break;
   }
+  case Intrinsic::experimental_vector_extract_last_active:
+    if (ST->isSVEAvailable()) {
+      auto [LegalCost, _] = getTypeLegalizationCost(ICA.getArgTypes()[0]);
+      // This should turn into chained clastb instructions.
+      return LegalCost;
+    }
+    break;
   default:
     break;
   }
@@ -5325,6 +5332,7 @@ bool AArch64TTIImpl::isLegalToVectorizeReduction(
   case RecurKind::FMax:
   case RecurKind::FMulAdd:
   case RecurKind::AnyOf:
+  case RecurKind::FindLast:
     return true;
   default:
     return false;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index b4acda80cfb93..ea85685cdf7b8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4047,6 +4047,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
       case VPDef::VPWidenIntrinsicSC:
       case VPDef::VPWidenSC:
       case VPDef::VPWidenSelectSC:
+      case VPDef::VPWidenSelectVectorSC:
       case VPDef::VPBlendSC:
       case VPDef::VPFirstOrderRecurrencePHISC:
       case VPDef::VPHistogramSC:
@@ -4546,6 +4547,11 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
       any_of(Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
              IsaPred<VPReductionPHIRecipe>);
 
+  // FIXME: implement interleaving for FindLast transform correctly.
+  for (auto &[_, RdxDesc] : Legal->getReductionVars())
+    if (RecurrenceDescriptor::isFindLastRecurrenceKind(RdxDesc.getRecurrenceKind()))
+      return 1;
+
   // If we did not calculate the cost for VF (because the user selected the VF)
   // then we calculate the cost of VF here.
   if (LoopCost == 0) {
@@ -8687,6 +8693,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
                                 *Plan, Builder))
     return nullptr;
 
+  // Create whole-vector selects for find-last recurrences.
+  VPlanTransforms::runPass(VPlanTransforms::convertFindLastRecurrences,
+                           *Plan, RecipeBuilder, Legal);
+
   if (useActiveLaneMask(Style)) {
     // TODO: Move checks to VPlanTransforms::addActiveLaneMask once
     // TailFoldingStyle is visible there.
@@ -8779,6 +8789,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
 
     RecurKind Kind = PhiR->getRecurrenceKind();
     assert(
+        !RecurrenceDescriptor::isFindLastRecurrenceKind(Kind) &&
         !RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
         !RecurrenceDescriptor::isFindIVRecurrenceKind(Kind) &&
         "AnyOf and FindIV reductions are not allowed for in-loop reductions");
@@ -8987,6 +8998,10 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       FinalReductionResult =
           Builder.createNaryOp(VPInstruction::ComputeAnyOfResult,
                                {PhiR, Start, NewExitingVPV}, ExitDL);
+    } else if (RecurrenceDescriptor::isFindLastRecurrenceKind(
+             RdxDesc.getRecurrenceKind())) {
+      FinalReductionResult = Builder.createNaryOp(
+          VPInstruction::ExtractLastActive, {NewExitingVPV}, ExitDL);
     } else {
       VPIRFlags Flags =
           RecurrenceDescriptor::isFloatingPointRecurrenceKind(RecurrenceKind)
@@ -9076,7 +9091,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     RecurKind RK = RdxDesc.getRecurrenceKind();
     if ((!RecurrenceDescriptor::isAnyOfRecurrenceKind(RK) &&
          !RecurrenceDescriptor::isFindIVRecurrenceKind(RK) &&
-         !RecurrenceDescriptor::isMinMaxRecurrenceKind(RK))) {
+         !RecurrenceDescriptor::isMinMaxRecurrenceKind(RK) &&
+         !RecurrenceDescriptor::isFindLastRecurrenceKind(RK))) {
       VPBuilder PHBuilder(Plan->getVectorPreheader());
       VPValue *Iden = Plan->getOrAddLiveIn(
           getRecurrenceIdentity(RK, PhiTy, RdxDesc.getFastMathFlags()));
@@ -10069,6 +10085,21 @@ bool LoopVectorizePass::processLoop(Loop *L) {
   // Override IC if user provided an interleave count.
   IC = UserIC > 0 ? UserIC : IC;
 
+  // FIXME: Enable interleaving for last_active reductions.
+  if (any_of(LVL.getReductionVars(), [&](auto &Reduction) -> bool {
+    const RecurrenceDescriptor &RdxDesc = Reduction.second;
+    return RecurrenceDescriptor::isFindLastRecurrenceKind(RdxDesc.getRecurrenceKind());
+  })) {
+    LLVM_DEBUG(dbgs() << "LV: Not interleaving without vectorization due "
+                      << "to conditional scalar assignments.\n");
+    IntDiagMsg = {
+        "ConditionalAssignmentPreventsScalarInterleaving",
+        "Unable to interleave without vectorization due to conditional "
+        "assignments"};
+    InterleaveLoop = false;
+    IC = 1;
+  }
+
   // Emit diagnostic messages, if any.
   const char *VAPassName = Hints.vectorizeAnalysisPassName();
   if (!VectorizeLoop && !InterleaveLoop) {
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 75cace77ec534..7b25731af19d8 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -24868,6 +24868,7 @@ class HorizontalReduction {
         case RecurKind::FindFirstIVUMin:
         case RecurKind::FindLastIVSMax:
         case RecurKind::FindLastIVUMax:
+        case RecurKind::FindLast:
         case RecurKind::FMaxNum:
         case RecurKind::FMinNum:
         case RecurKind::FMaximumNum:
@@ -25009,6 +25010,7 @@ class HorizontalReduction {
     case RecurKind::FindFirstIVUMin:
     case RecurKind::FindLastIVSMax:
     case RecurKind::FindLastIVUMax:
+    case RecurKind::FindLast:
     case RecurKind::FMaxNum:
     case RecurKind::FMinNum:
     case RecurKind::FMaximumNum:
@@ -25115,6 +25117,7 @@ class HorizontalReduction {
     case RecurKind::FindFirstIVUMin:
     case RecurKind::FindLastIVSMax:
     case RecurKind::FindLastIVUMax:
+    case RecurKind::FindLast:
     case RecurKind::FMaxNum:
     case RecurKind::FMinNum:
     case RecurKind::FMaximumNum:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 53291a931530f..2ffe68fedee05 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -548,6 +548,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
     case VPRecipeBase::VPWidenIntrinsicSC:
     case VPRecipeBase::VPWidenSC:
     case VPRecipeBase::VPWidenSelectSC:
+    case VPRecipeBase::VPWidenSelectVectorSC:
     case VPRecipeBase::VPBlendSC:
     case VPRecipeBase::VPPredInstPHISC:
     case VPRecipeBase::VPCanonicalIVPHISC:
@@ -1059,6 +1060,8 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
     ResumeForEpilogue,
     /// Returns the value for vscale.
     VScale,
+    // Extracts the last active lane based on a predicate vector operand.
+    ExtractLastActive,
   };
 
 private:
@@ -1749,6 +1752,47 @@ struct LLVM_ABI_FOR_TEST VPWidenSelectRecipe : public VPRecipeWithIRFlags,
 
   unsigned getOpcode() const { return Instruction::Select; }
 
+  VPValue *getCond() const { return getOperand(0); }
+
+  bool isInvariantCond() const {
+    return getCond()->isDefinedOutsideLoopRegions();
+  }
+
+  /// Returns true if the recipe only uses the first lane of operand \p Op.
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    assert(is_contained(operands(), Op) &&
+           "Op must be an operand of the recipe");
+    return Op == getCond() && isInvariantCond();
+  }
+};
+
+/// A recipe for selecting whole vector values.
+struct VPWidenSelectVectorRecipe : public VPRecipeWithIRFlags {
+  VPWidenSelectVectorRecipe(ArrayRef<VPValue *> Operands)
+      : VPRecipeWithIRFlags(VPDef::VPWidenSelectVectorSC, Operands) {}
+
+  ~VPWidenSelectVectorRecipe() override = default;
+
+  VPWidenSelectVectorRecipe *clone() override {
+    SmallVector<VPValue *, 3> Operands(operands());
+    return new VPWidenSelectVectorRecipe(Operands);
+  }
+
+  VP_CLASSOF_IMPL(VPDef::VPWidenSelectVectorSC)
+
+  /// Produce a widened version of the select instruction.
+  void execute(VPTransformState &State) override;
+
+  /// Return the cost of this VPWidenSelectVectorRecipe.
+  InstructionCost computeCost(ElementCount VF,
+                              VPCostContext &Ctx) const override;
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+  /// Print the recipe.
+  void print(raw_ostream &O, const Twine &Indent,
+             VPSlotTracker &SlotTracker) const override;
+#endif
+
   VPValue *getCond() const {
     return getOperand(0);
   }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index d400ceff7797c..a299ab8593a2f 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -115,7 +115,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
   case VPInstruction::FirstActiveLane:
     return Type::getIntNTy(Ctx, 64);
   case VPInstruction::ExtractLastElement:
-  case VPInstruction::ExtractPenultimateElement: {
+  case VPInstruction::ExtractPenultimateElement:
+  case VPInstruction::ExtractLastActive: {
     Type *BaseTy = inferScalarType(R->getOperand(0));
     if (auto *VecTy = dyn_cast<VectorType>(BaseTy))
       return VecTy->getElementType();
@@ -308,7 +309,11 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
           })
           .Case<VPExpressionRecipe>([this](const auto *R) {
             return inferScalarType(R->getOperandOfResultType());
-          });
+          })
+          .Case<VPWidenSelectVectorRecipe>(
+              [this](const VPWidenSelectVectorRecipe *R) {
+                return inferScalarType(R->getOperand(1));
+              });
 
   assert(ResultTy && "could not infer type for the given VPValue");
   CachedTypes[V] = ResultTy;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index bf51489543098..598fa4888fe8a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -86,7 +86,8 @@ bool VPRecipeBase::mayWriteToMemory() const {
   case VPWidenLoadSC:
   case VPWidenPHISC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -134,7 +135,8 @@ bool VPRecipeBase::mayReadFromMemory() const {
   case VPWidenIntOrFpInductionSC:
   case VPWidenPHISC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -177,7 +179,8 @@ bool VPRecipeBase::mayHaveSideEffects() const {
   case VPWidenPHISC:
   case VPWidenPointerInductionSC:
   case VPWidenSC:
-  case VPWidenSelectSC: {
+  case VPWidenSelectSC:
+  case VPWidenSelectVectorSC: {
     const Instruction *I =
         dyn_cast_or_null<Instruction>(getVPSingleValue()->getUnderlyingValue());
     (void)I;
@@ -522,6 +525,7 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) {
   case VPInstruction::ActiveLaneMask:
   case VPInstruction::ComputeAnyOfResult:
   case VPInstruction::ReductionStartVector:
+  case VPInstruction::ExtractLastActive:
     return 3;
   case VPInstruction::ComputeFindIVResult:
     return 4;
@@ -983,6 +987,17 @@ Value *VPInstruction::generate(VPTransformState &State) {
   }
   case VPInstruction::ResumeForEpilogue:
     return State.get(getOperand(0), true);
+  case VPInstruction::ExtractLastActive: {
+    Value *Data = State.get(getOperand(0));
+    Value *Mask = State.get(getOperand(1));
+    Value *Default = State.get(getOperand(2), /*IsScalar=*/true);
+    Type *VTy = Data->getType();
+
+    Module *M = State.Builder.GetInsertBlock()->getModule();
+    Function *ExtractLast = Intrinsic::getOrInsertDeclaration(
+        M, Intrinsic::experimental_vector_extract_last_active, {VTy});
+    return Builder.CreateCall(ExtractLast, {Data, Mask, Default});
+  }
   default:
     llvm_unreachable("Unsupported opcode for instruction");
   }
@@ -1119,6 +1134,14 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
                                   {PredTy, Type::getInt1Ty(Ctx.LLVMCtx)});
     return Ctx.TTI.getIntrinsicInstrCost(Attrs, Ctx.CostKind);
   }
+  case VPInstruction::ExtractLastActive: {
+    Type *ScalarTy = Ctx.Types.inferScalarType(this);
+    Type *VecTy = toVectorTy(ScalarTy, VF);
+    Type *MaskTy = toVectorTy(Type::getInt1Ty(Ctx.LLVMCtx), VF);
+    IntrinsicCostAttributes ICA(Intrinsic::experimental_vector_extract_last_active,
+                                ScalarTy, {VecTy, MaskTy, ScalarTy});
+    return Ctx.TTI.getIntrinsicInstrCost(ICA, Ctx.CostKind);
+  }
   case VPInstruction::FirstOrderRecurrenceSplice: {
     assert(VF.isVector() && "Scalar FirstOrderRecurrenceSplice?");
     SmallVector<int> Mask(VF.getKnownMinValue());
@@ -1174,6 +1197,7 @@ bool VPInstruction::isVectorToScalar() const {
          getOpcode() == VPInstruction::FirstActiveLane ||
          getOpcode() == VPInstruction::ComputeAnyOfResult ||
          getOpcode() == VPInstruction::ComputeFindIVResult ||
+         getOpcode() == VPInstruction::ExtractLastActive ||
          getOpcode() == VPInstruction::ComputeReductionResult ||
          getOpcode() == VPInstruction::AnyOf;
 }
@@ -1243,6 +1267,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
   case VPInstruction::ExtractLastElement:
   case VPInstruction::ExtractPenultimateElement:
   case VPInstruction::FirstActiveLane:
+  case VPInstruction::ExtractLastActive:
   case VPInstruction::FirstOrderRecurrenceSplice:
   case VPInstruction::LogicalAnd:
   case VPInstruction::Not:
@@ -1414,6 +1439,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
   case VPInstruction::ResumeForEpilogue:
     O << "resume-for-epilogue";
     break;
+  case VPInstruction::ExtractLastActive:
+    O << "extract-last-active";
+    break;
   default:
     O << Instruction::getOpcodeName(getOpcode());
   }
@@ -1927,7 +1955,9 @@ void VPHistogramRecipe::print(raw_ostream &O, const Twine &Indent,
     Mask->printAsOperand(O, SlotTracker);
   }
 }
+#endif
 
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
 void VPWidenSelectRecipe::print(raw_ostream &O, const Twine &Indent,
                                 VPSlotTracker &SlotTracker) const {
   O << Indent << "WIDEN-SELECT ";
@@ -2002,6 +2032,42 @@ InstructionCost VPWidenS...
[truncated]

github-actions · 2025-09-11T14:39:39Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Mel-Chen

Thank you. I've reviewed part of the code, and will continue the review in October, as I'll be on vacation until the end of September.

Mel-Chen · 2025-09-16T09:48:44Z

llvm/include/llvm/Analysis/IVDescriptors.h

                  ///< (x,y) is increasing loop induction, and both x and y
                  ///< are integer type, producing a UMax reduction.
+  FindLast,       ///< FindLast reduction with select(cmp(),x,y) where x and y
+                  ///< can be any scalar type, one is the current recurrence


Since isIntegerRecurrenceKind returns true for FindLast, I suggest

Suggested change

///< can be any scalar type, one is the current recurrence

///< are integer type, one is the current recurrence

Mel-Chen · 2025-09-16T09:51:33Z

llvm/lib/Analysis/IVDescriptors.cpp

+    if (isFindLastRecurrenceKind(Kind) && IsASelect)
+      ++NumCmpSelectPatternInst;


Why we need this?

Mel-Chen · 2025-09-16T09:54:42Z

llvm/lib/Analysis/IVDescriptors.cpp

+RecurrenceDescriptor::isFindLastPattern(Instruction *I, PHINode *Phi,
+                                        Loop *TheLoop) {


Could we reuse RecurrenceDescriptor::isFindIVPattern?

Mel-Chen · 2025-10-07T09:51:39Z

llvm/lib/Analysis/IVDescriptors.cpp

  case RecurKind::SMin:
  case RecurKind::UMax:
  case RecurKind::UMin:
+  case RecurKind::FindLast:


#162252
icmp and fcmp shouldn’t be part of the recurrence chain. I think the opcode should return Instruction::Select instead.

huntergr-arm · 2025-10-21T15:14:13Z

Rebased, addressed comments. Thanks for reviewing :)

MacDue

Some initial comments (mostly nitpicks), I've not looked at everything in full detail yet, but the general concept makes sense to me.

llvm/lib/Analysis/IVDescriptors.cpp

MacDue · 2025-10-22T14:31:17Z

llvm/lib/Analysis/IVDescriptors.cpp

                                     m_Value(NonRdxPhi)))))
    return InstDesc(false, I);

+  if (isFindLastRecurrenceKind(Kind)) {


Above (not this line), this function is called isFindIVPattern, and has a large comment explaining that with an example. Maybe update that documentation to explain the "FindLastRecurance" case (and possibly rename the function to account for that too?).

There's also the comment above for We are looking for selects of the form: which should now in include the FindLast case.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/include/llvm/Analysis/IVDescriptors.h

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

fhahn

I will be doing some performance runs on AArch64 to figure out
the cost model, as we mostly regard vector selects as per-lane
instead of selecting the whole vector at once.

Did you already get a chance to run preliminary benchmarks? Would be very curious what the impact is, possibly for some microbenchmarks (https://github.com/llvm/llvm-test-suite/tree/main/MicroBenchmarks/LoopVectorization would be a good place)

fhahn · 2025-10-29T03:26:55Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

    }
    break;
  }
+  case Intrinsic::experimental_vector_extract_last_active:


Could this be split off, with dedicated cost model tests?

Split out as #165739

fhahn · 2025-10-29T04:09:20Z

llvm/lib/Transforms/Vectorize/VPlan.h

  }
 };

+/// A recipe for selecting whole vector values based on a scalar condition.


do we need a dedicated recipe for this? If we have a scalar condition, broadcasting should be fine, although more vebose than needed?

We should be able to handle single-scalar conditions as well, which may be enough? #165506 (comment)

No, we don't need a dedicated recipe. I'll rebase on top of that other PR when available.

llvm/include/llvm/Analysis/IVDescriptors.h

MacDue · 2025-10-31T13:06:21Z

llvm/lib/Analysis/IVDescriptors.cpp

                                     m_Value(NonRdxPhi)))))
    return InstDesc(false, I);

+  if (isFindLastRecurrenceKind(Kind)) {


There's also the comment above for We are looking for selects of the form: which should now in include the FindLast case.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

huntergr-arm · 2025-10-31T16:32:51Z

Rebased on top of the cost model change, and I removed the extra select recipe in favor of the broadcast approach. We can change that to the scalar condition later.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Analysis/IVDescriptors.cpp

llvm/test/Transforms/LoopVectorize/iv-select-cmp-decreasing.ll

llvm/test/Transforms/LoopVectorize/iv-select-cmp-trunc.ll

MacDue · 2025-11-03T11:09:31Z

llvm/test/Transforms/LoopVectorize/iv-select-cmp-trunc.ll

-; CHECK-VF4IC1:       [[FOR_BODY]]:
-; CHECK-VF4IC1-NEXT:    [[IV:%.*]] = phi i64 [ 4294967294, %[[ENTRY]] ], [ [[INC:%.*]], %[[FOR_BODY]] ]
-; CHECK-VF4IC1-NEXT:    [[RDX:%.*]] = phi i32 [ 331, %[[ENTRY]] ], [ [[SPEC_SELECT:%.*]], %[[FOR_BODY]] ]
+; CHECK-VF4IC1-NEXT:  [[ENTRY:.*:]]


Again here.

I think the name/comment for not_vectorized_select_icmp_truncated_iv_out_of_bound missed being updated.

MacDue · 2025-11-03T11:13:35Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-scalar-assignment.ll

+; RUN: opt -passes=loop-vectorize,instcombine -mattr=+sve -S < %s 2>&1 | FileCheck %s --check-prefix=SVE
+
+target triple = "aarch64-linux-gnu"
+


How about testing a pointer type for the data phi, and some negative tests for the cases we bail out for in isFindPattern (e.g. multiple users in the loop)?

Done. Added tests for pointers and floats, then multiple in-loop users of the select, multiple users of the compare, and chained selects.

huntergr-arm · 2025-11-07T15:50:51Z

I will be doing some performance runs on AArch64 to figure out
the cost model, as we mostly regard vector selects as per-lane
instead of selecting the whole vector at once.

Did you already get a chance to run preliminary benchmarks? Would be very curious what the impact is, possibly for some microbenchmarks (https://github.com/llvm/llvm-test-suite/tree/main/MicroBenchmarks/LoopVectorization would be a good place)

So running with SVE enabled, we vectorize 7 loops in find-last.cpp from the test suite. I don't see any noticeable performance impact from the vectorization, but total execution time is on the order of a couple of milliseconds, so I maybe need to increase the amount of data for a microbenchmark to see anything.

That said, this is the more generic form of this reduction. I intend to follow this part up with an in-loop version of the reduction, so that targets with SVE can just use clastb inside the loop instead of maintaining the extra phi for the mask.

I'll try some spec runs and see if there's any interesting results there.

MacDue

Generally LGTM. Thanks for addressing my comments!

Needing to add the VPLastActiveMaskPHIRecipe is a bit unfortunate, but I don't have another suggestion (maybe other reviewers do).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

sdesmalen-arm

Left a few nits, but otherwise looks reasonable to me.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

sdesmalen-arm · 2025-11-11T10:17:00Z

llvm/lib/Analysis/IVDescriptors.cpp

  case RecurKind::FindFirstIVUMin:
  case RecurKind::FindLastIVSMax:
  case RecurKind::FindLastIVUMax:
+  // TODO: Make type-agnostic.


If this is not type-agnostic, should this be reflected in the name of the recurrence kind?

It isn't for the other FindFirst/FindLast (though that might be inferred by U/S Min/Max) or AnyOf. I think it's just the fp-based reduction types that are prefixed with an extra F.

I did experiment with treating FindLast separately in AddReductionVar when it checks the type and everything was fine, but decided to leave that out of the initial patch.

sdesmalen-arm · 2025-11-11T10:21:33Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

  // Override IC if user provided an interleave count.
  IC = UserIC > 0 ? UserIC : IC;

+  // FIXME: Enable interleaving for last_active reductions.


What would be required to enable interleaving?

Completing the final reduction outside of the loop, especially the mask phis.

llvm/lib/Transforms/Vectorize/VPlan.h

fhahn

So running with SVE enabled, we vectorize 7 loops in find-last.cpp from the test suite. I don't see any noticeable performance impact from the vectorization, but total execution time is on the order of a couple of milliseconds, so I maybe need to increase the amount of data for a microbenchmark to see anything.

Yes that is only intended to test correctness; as I mentioned earlier, it would be good to add some MicroBenchmarks, to make this easier to measure

fhahn · 2025-11-11T10:30:07Z

llvm/lib/Analysis/IVDescriptors.cpp

+    for (User *U : I->users()) {
+      if (U == OrigPhi)
+        continue;
+      if (auto *UI = dyn_cast<Instruction>(U); UI && !TheLoop->contains(UI))


I think we generally avoid this kind of pattern in LV/IVDesc.

Removed. This was basically a (mutated) leftover from Michael's code, since it had entirely separate analysis logic.

Is the FIXME still relevant then?

// FIXME: Support more complex patterns, including multiple selects. // The Select must be used only outside the loop and by the PHI.

Yes, though some (all?) of the changes might not be in that area of code. The extra unit tests cover some of the patterns we might want.

fhahn · 2025-11-11T10:32:32Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

          continue;
        }
      }
+    } else if (isa<VPLastActiveMaskPHIRecipe>(R)) {


why is this needed? Epilogue vectorization is disabled, right?

I added that before disabling it entirely. Removed (along with the recipe)

fhahn · 2025-11-11T10:34:46Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  for (const auto &[Phi, RdxDesc] : Legal->getReductionVars()) {
+    if (RecurrenceDescriptor::isFindLastRecurrenceKind(
+            RdxDesc.getRecurrenceKind())) {
+      VPRecipeBase *PhiR = RecipeBuilder.getRecipe(Phi);


Please avoid going through Legal/RecipeBuilder here if possible and instead use the information directly available in VPlan.

fhahn · 2025-11-11T10:35:23Z

llvm/lib/Transforms/Vectorize/VPlan.h

 #endif
 };

+// TODO: Can we unify the PHI recipe hierarchy a bit? VPPredInstPHISC is close


THis just lowers to a wide phi, right? Can you just use VPWidenPHIRecipe?

It does, but VPWidenPHIRecipe::clone casts the underlying value to a PHINode. Since there isn't one, that will assert if/when we enable e.g. interleaving. I think that's the same for VPActiveLaneMaskPHIRecipe, which has a comment noting it would be good to drop the specialized recipe in favour of VPWidenPHIRecipe.

I could adjust ::clone() to work (using cast_if_present) if that works for you? I'm not sure if there's any other uses of the underlying value for those recipes; I'll try it and see if any test fails.

It seems that there isn't a computeCost method for VPWidenPHIRecipe yet either.

done; assuming you're happy with VPWidenPHIRecipe no longer requiring an underlying value.

fhahn · 2025-11-11T10:36:02Z

llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment-vplan.ll

+; CHECK-NEXT:     EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
+; CHECK-NEXT:     WIDEN-REDUCTION-PHI ir<%data.phi> = phi ir<-1>, vp<%9>
+; CHECK-NEXT:     LAST-ACTIVE-MASK-PHI vp<%4> = phi ir<false>, vp<%8>
+; CHECK-NEXT:     vp<%5> = SCALAR-STEPS vp<%3>, ir<1>, vp<%0>
+; CHECK-NEXT:     CLONE ir<%ld.addr> = getelementptr inbounds ir<%data>, vp<%5>
+; CHECK-NEXT:     vp<%6> = vector-pointer ir<%ld.addr>
+; CHECK-NEXT:     WIDEN ir<%ld> = load vp<%6>
+; CHECK-NEXT:     WIDEN ir<%select.cmp> = icmp slt ir<%a>, ir<%ld>
+; CHECK-NEXT:     EMIT vp<%7> = any-of ir<%select.cmp>
+; CHECK-NEXT:     EMIT vp<%8> = select vp<%7>, ir<%select.cmp>, vp<%4>
+; CHECK-NEXT:     EMIT vp<%9> = select vp<%7>, ir<%ld>, ir<%data.phi>


Please use patterns for the unnamed vpvalues throughout

huntergr-arm · 2025-11-18T12:53:38Z

Any follow-up comments now that there's a microbenchmark for this? (See llvm/llvm-test-suite#295)

I'd like to commit this week if possible.

fhahn · 2025-11-18T22:06:39Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-scalar-assignment.ll

+
+;; The following run line caused an ICE before using a dedicated FindLast PHI recipe.
+;; We're not looking at the resulting IR, just confirming it doesn't crash.
+; RUN: opt -passes=loop-vectorize,instcombine -mattr=+sve -epilogue-vectorization-force-VF=4 -S < %s 2>&1 > /dev/null


For this, it seems a phase ordering test would be more appropriate?

fhahn · 2025-11-18T22:06:22Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-scalar-assignment.ll

+; RUN: opt -passes=loop-vectorize,instcombine -S < %s 2>&1 | FileCheck %s --check-prefix=NEON
+; RUN: opt -passes=loop-vectorize,instcombine -mattr=+sve -S < %s 2>&1 | FileCheck %s --check-prefix=SVE


Could we remove instcombine, to avoid the test being impacted by unrelated instcombine changes

fhahn · 2025-11-18T21:50:32Z

llvm/include/llvm/Analysis/IVDescriptors.h

  FindLastIVUMax, ///< FindLast reduction with select(cmp(),x,y) where one of
                  ///< (x,y) is increasing loop induction, and both x and y
                  ///< are integer type, producing a UMax reduction.
+  FindLast,       ///< FindLast reduction with select(cmp(),x,y) where x and y


I think at least loop-unrolling also needs to be thought about the new kind, seeing crashes when building the test suite currently. I think to reproduce you can just add a loop with CAS to https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/LoopUnroll/partial-unroll-reductions.ll.

done. I think I left it alone originally since I made FindLast a generic RecurKind that could handle int, float, and pointer types (and it therefore didn't appear in the isIntegerRecurrenceKind list.) I figured that could be a follow-up PR (and could potentially convert AnyOf at the same time).

fhahn · 2025-11-18T21:51:12Z

llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment-vplan.ll

@@ -0,0 +1,123 @@
+; RUN: opt -passes=loop-vectorize -debug-only=loop-vectorize \
+; RUN:   -scalable-vectorization=on -force-target-supports-scalable-vectors \
+; RUN:   -disable-output 2>&1 < %s | FileCheck %s


Does this test need scalable vectors?

no. switched to a fixed width VF (forced to 4)

fhahn · 2025-11-18T21:51:51Z

llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment-vplan.ll

+; CHECK-NEXT:   IR   [[LDADDR]] = getelementptr inbounds i32, ptr %data, i64 [[IV]]
+; CHECK-NEXT:   IR   [[LD]] = load i32, ptr [[LDADDR]], align 4
+; CHECK-NEXT:   IR   [[SELECTCMP]] = icmp slt i32 %a, [[LD]]
+; CHECK-NEXT:   IR   [[SELECTDATA]] = select i1 [[SELECTCMP]], i32 [[LD]], i32 [[DATAPHI]]
+; CHECK-NEXT:   IR   [[IVNEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:   IR   [[EXITCMP:%.*]] = icmp eq i64 [[IVNEXT]], [[ORIGTC]]
+; CHECK-NEXT: No successors


I think you can strip those

fhahn · 2025-11-18T21:52:15Z

llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment-vplan.ll

+; CHECK-NEXT: No successors
+; CHECK-NEXT: }
+
+; CHECK: Cost of 1 for VF vscale x 1: induction instruction   [[IVNEXT]] = add nuw nsw i64 [[IV]], 1


Could this also be removed?

…ary check lines

huntergr-arm · 2025-12-02T17:26:06Z

Rebased (which pulls in #169925) and tidied a little.

I now see csa_with_arith vectorization being a net win for NEON (neutral for i64, better for i32 and i8) on a Neoverse core.

sdesmalen-arm

Just a few nits, otherwise looks good.

llvm/lib/Analysis/IVDescriptors.cpp

sdesmalen-arm · 2025-12-03T10:03:14Z

llvm/lib/Analysis/IVDescriptors.cpp

+//   int r = 0;
+//   for (int i = 0; i < n; i++) {
+//     if (a[i] > 3)
+//       r = a[i];


This is a generalisation of the r = i case, but then with an arbitrary value. Can the same mechanism be used for r = i, possibly with some combines to get similar codegen?

It is a more generic approach, but I'm not sure if all targets have instructions suitable for an efficient implementation of it.

For SVE at least (and possibly NEON?) I think we will want to use the generic approach for all Find recurrences.

sdesmalen-arm · 2025-12-03T10:03:57Z

llvm/lib/Analysis/IVDescriptors.cpp

+// When searching for an induction variable (i), the reduction value after the
+// loop will be the maximum (increasing induction) or minimum (decreasing
+// induction) value of 'i' that the condition (src[i] > 3) is satisfied, or the
+// start value (0 in the example above). When the start value of the induction
+// variable 'i' is greater than the minimum (increasing induction) or maximum
+// (decreasing induction) value of the data type, we can use the minimum
+// (increasing induction) or maximum (decreasing induction) value of the data
+// type as a sentinel value to replace the start value. This allows us to
+// perform a single reduction max (increasing induction) or min (decreasing
+// induction) operation to obtain the final reduction result.


I know this isn't your writing, but this is very difficult to follow..

llvm/lib/Analysis/IVDescriptors.cpp

sdesmalen-arm · 2025-12-03T12:45:48Z

llvm/lib/Transforms/Vectorize/VPlan.h

-    auto *C = new VPWidenPHIRecipe(cast<PHINode>(getUnderlyingValue()),
-                                   getOperand(0), getDebugLoc(), Name);
+    auto *C =
+        new VPWidenPHIRecipe(cast_if_present<PHINode>(getUnderlyingValue()),


When I change the cast to cast<PHINode>(getUnderlyingValue()) none of the tests change. Is this still necessary?

We don't have any interleaving tests yet, but those would rely on being able to clone the recipe. And handleFindLastRedutions does create a VPWidenPHIRecipe with a nullptr for the underlying value, so I'm just safeguarding against future failures. We will need to do some additional work when interleaving though, so maybe we would end up skipping the clone?

llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp

fhahn · 2025-12-04T10:13:39Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    Module *M = State.Builder.GetInsertBlock()->getModule();
+    Function *ExtractLast = Intrinsic::getOrInsertDeclaration(
+        M, Intrinsic::experimental_vector_extract_last_active, {VTy});
+    return Builder.CreateCall(ExtractLast, {Data, Mask, Default});


Can we use CreateIntrinsic here?

fhahn · 2025-12-04T10:14:23Z

llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp

+    // Find the condition for the select
+    auto *SelectR = cast<VPSingleDefRecipe>(&PhiR->getBackedgeRecipe());
+    VPValue *Cond = nullptr;
+    if (auto *WidenR = dyn_cast<VPWidenSelectRecipe>(SelectR))
+      Cond = WidenR->getCond();
+    else if (auto *RepR = dyn_cast<VPReplicateRecipe>(SelectR)) {
+      auto *SI = dyn_cast<SelectInst>(RepR->getUnderlyingInstr());
+      if (!SI)
+        return false;
+      auto *CmpI = dyn_cast<Instruction>(SI->getCondition());
+      if (!CmpI)
+        return false;
+      Cond = RecipeBuilder.getRecipe(CmpI)->getVPSingleValue();


I think you can use m_Select here

fhahn · 2025-12-04T10:15:10Z

llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp

+      auto *SI = dyn_cast<SelectInst>(RepR->getUnderlyingInstr());
+      if (!SI)
+        return false;
+      auto *CmpI = dyn_cast<Instruction>(SI->getCondition());


This shouldn't check the underlying IR, it should be possible to just check the recipe

done via the m_Select

fhahn · 2025-12-04T10:16:43Z

llvm/lib/Transforms/Vectorize/VPlan.h

+    /// Extracts the last active lane based on a predicate vector operand.
+    ExtractLastActive,


how does this compare to extract + last-active-lane? Should we lower one to the other assuming all backends can handle it reasonably well?

So one difference would be the default value used if no lanes are active. I've updated the comment for it.

We could add extra operations select(anyof(mask), extract_result, default) to gain equivalent behaviour. I have no idea what the codegen for that is like right now though.

sdesmalen-arm

I'd prefer to put the cast<PHINode>(getUnderlyingValue()) back if possible (and leave this to be fixed when we enable interleaving), but otherwise LGTM!

fhahn · 2025-12-05T09:46:48Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

  // Override IC if user provided an interleave count.
  IC = UserIC > 0 ? UserIC : IC;

+  // FIXME: Enable interleaving for last_active reductions.


Suggested change

// FIXME: Enable interleaving for last_active reductions.

// FIXME: Enable interleaving for FindLast reductions.

fhahn · 2025-12-05T09:47:09Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+            RdxDesc.getRecurrenceKind());
+      })) {
+    LLVM_DEBUG(dbgs() << "LV: Not interleaving without vectorization due "
+                      << "to conditional scalar assignments.\n");


Suggested change

<< "to conditional scalar assignments.\n");

<< "to FindLast reduction.\n");

for consistency with the reduction naming

fhahn · 2025-12-05T09:47:39Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    LLVM_DEBUG(dbgs() << "LV: Not interleaving without vectorization due "
+                      << "to conditional scalar assignments.\n");
+    IntDiagMsg = {
+        "ConditionalAssignmentPreventsScalarInterleaving",


Suggested change

"ConditionalAssignmentPreventsScalarInterleaving",

"FindLastPreventsScalarInterleaving",

fhahn · 2025-12-05T09:48:54Z

llvm/lib/Transforms/Vectorize/VPlan.h

+    /// Extracts the last active lane based on a predicate vector operand, or
+    /// returns the default if no lanes were active.


It is not clear what the default is here, could you clarify the operands for the operation?

I've added explicit mentions of the operands in the description, hopefully that helps.

fhahn · 2025-12-05T10:00:06Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-scalar-assignment.ll

+
+exit:
+  ret i32 %select.data
+}


Thanks, looks like that fixed the regressions.

I also checked on X86 and there the only regression seems to be BENCHMARK_multi_csa_only_novec_uint8_t_ (vectorized version is ~16% slower). It would probably good to at least add some tests for X86 as well, to have some coverage for the cost model there as well.

fhahn · 2025-12-05T10:03:21Z

llvm/lib/Analysis/IVDescriptors.cpp

+//   int r = 0;
+//   for (int i = 0; i < n; i++) {
+//     if (src[i] > 3)
+//       r = src[i];


Suggested change

// r = src[i];

// r = src[i];

IIUC there are no restrictions here, perhaps better use something more generic for the term here. Also the condition does not have to be related to the value selected value

fhahn · 2025-12-05T10:03:43Z

llvm/lib/Analysis/IVDescriptors.cpp

 // The reduction value (r) is derived from either the values of an induction
-// variable (i) sequence, or from the start value (0). The LLVM IR generated for
-// such loops would be as follows:
+// variable (i) sequence, an arbitrary value (a[i]), or from the start value


Suggested change

// variable (i) sequence, an arbitrary value (a[i]), or from the start value

// variable (i) sequence, an arbitrary value (src[i]), or from the start value

(or the more generic term if updated above)

fhahn · 2025-12-05T10:05:40Z

llvm/lib/Analysis/IVDescriptors.cpp

 // value of the data type or a non-constant value by using mask and multiple
 // reduction operations.
+//
+// When searching for an arbitrary value (such as 'a[i]'), the reduction value


would be good to keep term (currently a[i]) in sync with what the example uses

fhahn · 2025-12-05T16:04:32Z

llvm/test/Transforms/LoopVectorize/X86/conditional-scalar-assignment.ll

+  ret i32 %select.data
+}
+
+;; Add more work to the loop besides the CSA to check cost modelling for NEON.


Suggested change

;; Add more work to the loop besides the CSA to check cost modelling for NEON.

;; Add more work to the loop besides the CSA to check cost modelling.

fhahn · 2025-12-05T16:09:30Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-scalar-assignment.ll

+
+exit:
+  ret i32 %select.data
+}


Hm, I tried to check the lowering on X86 to see if anything is particularly bad, but it looks like SelectionDAG runs into an infinite loop when lowering @llvm.experimental.vector.extract.last.active, unless AVX512 is available: https://llvm.godbolt.org/z/4h7Phr6Pc

huntergr-arm requested review from Mel-Chen, david-arm and fhahn September 11, 2025 14:35

llvmbot added backend:AArch64 vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Sep 11, 2025

Mel-Chen reviewed Sep 16, 2025

View reviewed changes

Mel-Chen reviewed Oct 7, 2025

View reviewed changes

huntergr-arm force-pushed the findlast-recurrence branch from d066b28 to 1ff7778 Compare October 21, 2025 14:59

MacDue reviewed Oct 22, 2025

View reviewed changes

huntergr-arm force-pushed the findlast-recurrence branch from 1ff7778 to e6fc0cf Compare October 28, 2025 13:24

fhahn reviewed Oct 29, 2025

View reviewed changes

MacDue reviewed Oct 31, 2025

View reviewed changes

huntergr-arm force-pushed the findlast-recurrence branch from e6fc0cf to abea562 Compare October 31, 2025 16:30

MacDue reviewed Nov 3, 2025

View reviewed changes

huntergr-arm force-pushed the findlast-recurrence branch from abea562 to 049428d Compare November 7, 2025 15:42

MacDue approved these changes Nov 7, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

sdesmalen-arm reviewed Nov 11, 2025

View reviewed changes

fhahn reviewed Nov 11, 2025

View reviewed changes

huntergr-arm mentioned this pull request Nov 13, 2025

[Microbenchmarks] Add benchmark for conditional scalar assignment autovec llvm/llvm-test-suite#295

Merged

fhahn reviewed Nov 18, 2025

View reviewed changes

huntergr-arm added 17 commits December 2, 2025 14:45

Use find_if

f1b8bcd

Handle FindLast properly in unrolling, test

5e75d92

Remove instcombine from AArch64 FindLast runlines

2cfbbfb

Switched vplan print test to use fixed VF instead of scalable

436fd2f

Moved vectorized epilogue ICE test to separate file, removed unnecess…

74f0351

…ary check lines

Updated check output after rebase

efd9d17

Move epilogue vectorization test back to LV

72a1b8c

Improve IVDesc comments

9fe68d2

Remove traces of dedicated FindLast phi recipe

9cbdf21

Move and improve convertFindLastRecurrences

bb8106f

Add test with extra user for select

5d1be36

Call xform earlier

3708486

Use Plan.getFalse()

2b43298

Add a test case with extra arithmetic operations

87b837b

Adjust tests post-rebase

570f0e3

Unify Find recurrence detection.

5df3db6

Clean up any_of with values() iterator

962a23c

huntergr-arm force-pushed the findlast-recurrence branch from fce52d0 to 962a23c Compare December 2, 2025 17:22

Formatting

12a7cd8

sdesmalen-arm reviewed Dec 3, 2025

View reviewed changes

fhahn reviewed Dec 4, 2025

View reviewed changes

Cleanups

c7b50ac

sdesmalen-arm approved these changes Dec 4, 2025

View reviewed changes

fhahn reviewed Dec 5, 2025

View reviewed changes

huntergr-arm added 3 commits December 5, 2025 12:13

Improve comments, restore original cast

d787a5c

Add test operating on bytes, then add an equivalent test file for X86

8cf2168

Formatting

c8159ce

fhahn reviewed Dec 5, 2025

View reviewed changes

Remove mention of NEON from X86 test file

42ea80d

	///< can be any scalar type, one is the current recurrence
	///< are integer type, one is the current recurrence

		if (isFindLastRecurrenceKind(Kind) && IsASelect)
		++NumCmpSelectPatternInst;

		RecurrenceDescriptor::isFindLastPattern(Instruction I, PHINode Phi,
		Loop *TheLoop) {

		; RUN: opt -passes=loop-vectorize,instcombine -mattr=+sve -S < %s 2>&1 \| FileCheck %s --check-prefix=SVE

		target triple = "aarch64-linux-gnu"

		/// Extracts the last active lane based on a predicate vector operand.
		ExtractLastActive,

	// FIXME: Enable interleaving for last_active reductions.
	// FIXME: Enable interleaving for FindLast reductions.

	<< "to conditional scalar assignments.\n");
	<< "to FindLast reduction.\n");

	"ConditionalAssignmentPreventsScalarInterleaving",
	"FindLastPreventsScalarInterleaving",

		/// Extracts the last active lane based on a predicate vector operand, or
		/// returns the default if no lanes were active.

	// variable (i) sequence, an arbitrary value (a[i]), or from the start value
	// variable (i) sequence, an arbitrary value (src[i]), or from the start value

	;; Add more work to the loop besides the CSA to check cost modelling for NEON.
	;; Add more work to the loop besides the CSA to check cost modelling.

[LV] Vectorize conditional scalar assignments #158088

Are you sure you want to change the base?

[LV] Vectorize conditional scalar assignments #158088

Conversation

huntergr-arm commented Sep 11, 2025

Uh oh!

llvmbot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 11, 2025

Uh oh!

llvmbot commented Sep 11, 2025

Uh oh!

github-actions bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mel-Chen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huntergr-arm commented Oct 21, 2025

Uh oh!

MacDue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

huntergr-arm commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huntergr-arm commented Nov 7, 2025

Uh oh!

MacDue left a comment

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Sep 11, 2025 •

edited

Loading

github-actions bot commented Sep 11, 2025 •

edited

Loading