Skip to content

Conversation

@fhahn
Copy link
Contributor

@fhahn fhahn commented Nov 27, 2025

For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar step values. At the moment, it creates a vector, in addition to to the first lane. The only supported case for this is when only the last lane is used. A recipe should not set both scalar and vector values.

Instead, we can simply use a vector induction. It would also be possible to preserve the current vector code-gen, by creating VPInstructions based on the first lane of VPScalarIVStepsRecipe, but using a vector induction seems simpler.

@llvmbot
Copy link
Member

llvmbot commented Nov 27, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Florian Hahn (fhahn)

Changes

For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar step values. At the moment, it creates a vector, in addition to to the first lane. The only supported case for this is when only the last lane is used. A recipe should not set both scalar and vector values.

Instead, we can simply use a vector induction. It would also be possible to preserve the current vector code-gen, by creating VPInstructions based on the first lane of VPScalarIVStepsRecipe, but using a vector induction seems simpler.


Patch is 24.02 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169796.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (-22)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+10-5)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll (+58-99)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 0ba8cec008555..7e21d17fc89e5 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -2367,15 +2367,6 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
   // Compute the scalar steps and save the results in State.
   Type *IntStepTy =
       IntegerType::get(BaseIVTy->getContext(), BaseIVTy->getScalarSizeInBits());
-  Type *VecIVTy = nullptr;
-  Value *UnitStepVec = nullptr, *SplatStep = nullptr, *SplatIV = nullptr;
-  if (!FirstLaneOnly && State.VF.isScalable()) {
-    VecIVTy = VectorType::get(BaseIVTy, State.VF);
-    UnitStepVec =
-        Builder.CreateStepVector(VectorType::get(IntStepTy, State.VF));
-    SplatStep = Builder.CreateVectorSplat(State.VF, Step);
-    SplatIV = Builder.CreateVectorSplat(State.VF, BaseIV);
-  }
 
   unsigned StartLane = 0;
   unsigned EndLane = FirstLaneOnly ? 1 : State.VF.getKnownMinValue();
@@ -2396,19 +2387,6 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
     StartIdx0 = Builder.CreateSExtOrTrunc(StartIdx0, IntStepTy);
   }
 
-  if (!FirstLaneOnly && State.VF.isScalable()) {
-    auto *SplatStartIdx = Builder.CreateVectorSplat(State.VF, StartIdx0);
-    auto *InitVec = Builder.CreateAdd(SplatStartIdx, UnitStepVec);
-    if (BaseIVTy->isFloatingPointTy())
-      InitVec = Builder.CreateSIToFP(InitVec, VecIVTy);
-    auto *Mul = Builder.CreateBinOp(MulOp, InitVec, SplatStep);
-    auto *Add = Builder.CreateBinOp(AddOp, SplatIV, Mul);
-    State.set(this, Add);
-    // It's useful to record the lane values too for the known minimum number
-    // of elements so we do those below. This improves the code quality when
-    // trying to extract the first element, for example.
-  }
-
   if (BaseIVTy->isFloatingPointTy())
     StartIdx0 = Builder.CreateSIToFP(StartIdx0, BaseIVTy);
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 4b7f90118374b..f3f422fdce933 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -785,12 +785,17 @@ static void legalizeAndOptimizeInductions(VPlan &Plan) {
         WideIV->getDebugLoc(), Builder);
 
     // Update scalar users of IV to use Step instead.
-    if (!HasOnlyVectorVFs)
+    if (!HasOnlyVectorVFs) {
       WideIV->replaceAllUsesWith(Steps);
-    else
-      WideIV->replaceUsesWithIf(Steps, [WideIV](VPUser &U, unsigned) {
-        return U.usesScalars(WideIV);
-      });
+    } else {
+      bool HasScalableVF = Plan.hasScalableVF();
+      WideIV->replaceUsesWithIf(Steps,
+                                [WideIV, HasScalableVF](VPUser &U, unsigned) {
+                                  if (HasScalableVF)
+                                    return U.usesFirstLaneOnly(WideIV);
+                                  return U.usesScalars(WideIV);
+                                });
+    }
   }
 }
 
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll b/llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll
index 88e035ebf3be8..131b3d1b02727 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll
@@ -15,25 +15,22 @@ define void @test_invar_gep(ptr %dst) #0 {
 ; CHECK-NEXT:    [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 4
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = mul nsw <vscale x 4 x i64> [[TMP5]], splat (i64 1)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> zeroinitializer, [[TMP4]]
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP3]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX]], i64 0
-; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP10:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP5]]
-; CHECK-NEXT:    [[TMP4:%.*]] = mul <vscale x 4 x i64> [[TMP10]], splat (i64 1)
-; CHECK-NEXT:    [[TMP9:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP4]]
-; CHECK-NEXT:    [[TMP6:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP7:%.*]] = add i64 [[INDEX]], 1
-; CHECK-NEXT:    [[TMP8:%.*]] = add i64 [[INDEX]], 2
-; CHECK-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 3
+; CHECK-NEXT:    [[TMP9:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP15:%.*]] = call i32 @llvm.vscale.i32()
 ; CHECK-NEXT:    [[TMP16:%.*]] = mul nuw i32 [[TMP15]], 4
 ; CHECK-NEXT:    [[TMP17:%.*]] = sub i32 [[TMP16]], 1
 ; CHECK-NEXT:    [[TMP18:%.*]] = extractelement <vscale x 4 x i64> [[TMP9]], i32 [[TMP17]]
 ; CHECK-NEXT:    store i64 [[TMP18]], ptr [[TMP14:%.*]], align 1
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP3]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP9]], [[DOTSPLAT]]
 ; CHECK-NEXT:    [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       middle.block:
@@ -60,38 +57,26 @@ define void @test_invar_gep(ptr %dst) #0 {
 ; IC2:       vector.ph:
 ; IC2-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
 ; IC2-NEXT:    [[TMP11:%.*]] = mul nuw i64 [[TMP2]], 4
+; IC2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP11]], i64 0
+; IC2-NEXT:    [[TMP21:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; IC2-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP11]], 2
 ; IC2-NEXT:    [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]]
 ; IC2-NEXT:    [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]]
+; IC2-NEXT:    [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; IC2-NEXT:    [[TMP12:%.*]] = mul nsw <vscale x 4 x i64> [[TMP5]], splat (i64 1)
+; IC2-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> zeroinitializer, [[TMP12]]
 ; IC2-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; IC2:       vector.body:
 ; IC2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; IC2-NEXT:    [[BROADCAST_SPLAT:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; IC2-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX]], i64 0
-; IC2-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[DOTSPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP11]], i64 0
-; IC2-NEXT:    [[VEC_IND:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[TMP5:%.*]] = add <vscale x 4 x i64> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; IC2-NEXT:    [[TMP21:%.*]] = mul <vscale x 4 x i64> [[TMP5]], splat (i64 1)
+; IC2-NEXT:    [[DOTSPLAT:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; IC2-NEXT:    [[TMP22:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP21]]
-; IC2-NEXT:    [[TMP23:%.*]] = add i64 [[TMP11]], 0
-; IC2-NEXT:    [[TMP24:%.*]] = mul i64 [[TMP23]], 1
-; IC2-NEXT:    [[TMP25:%.*]] = add i64 [[INDEX]], [[TMP24]]
-; IC2-NEXT:    [[TMP12:%.*]] = add i64 [[TMP11]], 1
-; IC2-NEXT:    [[TMP13:%.*]] = mul i64 [[TMP12]], 1
-; IC2-NEXT:    [[TMP14:%.*]] = add i64 [[INDEX]], [[TMP13]]
-; IC2-NEXT:    [[TMP15:%.*]] = add i64 [[TMP11]], 2
-; IC2-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 1
-; IC2-NEXT:    [[TMP17:%.*]] = add i64 [[INDEX]], [[TMP16]]
-; IC2-NEXT:    [[TMP18:%.*]] = add i64 [[TMP11]], 3
-; IC2-NEXT:    [[TMP19:%.*]] = mul i64 [[TMP18]], 1
-; IC2-NEXT:    [[TMP20:%.*]] = add i64 [[INDEX]], [[TMP19]]
 ; IC2-NEXT:    [[TMP6:%.*]] = call i32 @llvm.vscale.i32()
 ; IC2-NEXT:    [[TMP7:%.*]] = mul nuw i32 [[TMP6]], 4
 ; IC2-NEXT:    [[TMP8:%.*]] = sub i32 [[TMP7]], 1
 ; IC2-NEXT:    [[TMP9:%.*]] = extractelement <vscale x 4 x i64> [[TMP22]], i32 [[TMP8]]
 ; IC2-NEXT:    store i64 [[TMP9]], ptr [[DST:%.*]], align 1
 ; IC2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP3]]
+; IC2-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP22]], [[TMP21]]
 ; IC2-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; IC2-NEXT:    br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; IC2:       middle.block:
@@ -139,26 +124,24 @@ define void @test_invar_gep_var_start(i64 %start, ptr %dst) #0 {
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], [[TMP4]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i64 [[START]], [[N_VEC]]
+; CHECK-NEXT:    [[TMP6:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[START]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP13:%.*]] = mul nsw <vscale x 4 x i64> [[TMP6]], splat (i64 1)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> [[DOTSPLAT]], [[TMP13]]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[START]], [[INDEX]]
-; CHECK-NEXT:    [[TMP6:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[OFFSET_IDX]], i64 0
-; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP14:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP6]]
-; CHECK-NEXT:    [[TMP15:%.*]] = mul <vscale x 4 x i64> [[TMP14]], splat (i64 1)
-; CHECK-NEXT:    [[TMP7:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP15]]
-; CHECK-NEXT:    [[TMP16:%.*]] = add i64 [[OFFSET_IDX]], 0
-; CHECK-NEXT:    [[TMP17:%.*]] = add i64 [[OFFSET_IDX]], 1
-; CHECK-NEXT:    [[TMP18:%.*]] = add i64 [[OFFSET_IDX]], 2
-; CHECK-NEXT:    [[TMP13:%.*]] = add i64 [[OFFSET_IDX]], 3
+; CHECK-NEXT:    [[TMP7:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP8:%.*]] = call i32 @llvm.vscale.i32()
 ; CHECK-NEXT:    [[TMP9:%.*]] = mul nuw i32 [[TMP8]], 4
 ; CHECK-NEXT:    [[TMP10:%.*]] = sub i32 [[TMP9]], 1
 ; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <vscale x 4 x i64> [[TMP7]], i32 [[TMP10]]
 ; CHECK-NEXT:    store i64 [[TMP11]], ptr [[DST:%.*]], align 1
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP7]], [[BROADCAST_SPLAT2]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       middle.block:
@@ -187,40 +170,29 @@ define void @test_invar_gep_var_start(i64 %start, ptr %dst) #0 {
 ; IC2:       vector.ph:
 ; IC2-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
 ; IC2-NEXT:    [[TMP4:%.*]] = mul nuw i64 [[TMP3]], 4
+; IC2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
+; IC2-NEXT:    [[TMP9:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; IC2-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 2
 ; IC2-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], [[TMP5]]
 ; IC2-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
 ; IC2-NEXT:    [[TMP6:%.*]] = add i64 [[START]], [[N_VEC]]
+; IC2-NEXT:    [[BROADCAST_SPLAT:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; IC2-NEXT:    [[DOTSPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[START]], i64 0
+; IC2-NEXT:    [[VEC_IND:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; IC2-NEXT:    [[TMP8:%.*]] = mul nsw <vscale x 4 x i64> [[BROADCAST_SPLAT]], splat (i64 1)
+; IC2-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> [[VEC_IND]], [[TMP8]]
 ; IC2-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; IC2:       vector.body:
 ; IC2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; IC2-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[START]], [[INDEX]]
-; IC2-NEXT:    [[BROADCAST_SPLAT:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; IC2-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[OFFSET_IDX]], i64 0
-; IC2-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[DOTSPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
-; IC2-NEXT:    [[VEC_IND:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[TMP11:%.*]] = add <vscale x 4 x i64> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; IC2-NEXT:    [[TMP9:%.*]] = mul <vscale x 4 x i64> [[TMP11]], splat (i64 1)
+; IC2-NEXT:    [[DOTSPLAT:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; IC2-NEXT:    [[TMP10:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP9]]
-; IC2-NEXT:    [[TMP23:%.*]] = add i64 [[TMP4]], 0
-; IC2-NEXT:    [[TMP24:%.*]] = mul i64 [[TMP23]], 1
-; IC2-NEXT:    [[TMP25:%.*]] = add i64 [[OFFSET_IDX]], [[TMP24]]
-; IC2-NEXT:    [[TMP26:%.*]] = add i64 [[TMP4]], 1
-; IC2-NEXT:    [[TMP27:%.*]] = mul i64 [[TMP26]], 1
-; IC2-NEXT:    [[TMP28:%.*]] = add i64 [[OFFSET_IDX]], [[TMP27]]
-; IC2-NEXT:    [[TMP17:%.*]] = add i64 [[TMP4]], 2
-; IC2-NEXT:    [[TMP18:%.*]] = mul i64 [[TMP17]], 1
-; IC2-NEXT:    [[TMP19:%.*]] = add i64 [[OFFSET_IDX]], [[TMP18]]
-; IC2-NEXT:    [[TMP20:%.*]] = add i64 [[TMP4]], 3
-; IC2-NEXT:    [[TMP21:%.*]] = mul i64 [[TMP20]], 1
-; IC2-NEXT:    [[TMP22:%.*]] = add i64 [[OFFSET_IDX]], [[TMP21]]
 ; IC2-NEXT:    [[TMP12:%.*]] = call i32 @llvm.vscale.i32()
 ; IC2-NEXT:    [[TMP13:%.*]] = mul nuw i32 [[TMP12]], 4
 ; IC2-NEXT:    [[TMP14:%.*]] = sub i32 [[TMP13]], 1
 ; IC2-NEXT:    [[TMP15:%.*]] = extractelement <vscale x 4 x i64> [[TMP10]], i32 [[TMP14]]
 ; IC2-NEXT:    store i64 [[TMP15]], ptr [[DST:%.*]], align 1
 ; IC2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]
+; IC2-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP10]], [[TMP9]]
 ; IC2-NEXT:    [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; IC2-NEXT:    br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
 ; IC2:       middle.block:
@@ -269,36 +241,34 @@ define void @test_invar_gep_var_start_step_2(i64 %start, ptr %dst) #0 {
 ; CHECK-NEXT:    [[TMP6:%.*]] = mul nuw i64 [[TMP5]], 4
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP6]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
-; CHECK-NEXT:    [[TMP7:%.*]] = mul i64 [[N_VEC]], 2
-; CHECK-NEXT:    [[TMP8:%.*]] = add i64 [[START]], [[TMP7]]
-; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
-; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP9:%.*]] = mul i64 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i64 [[N_VEC]], 2
 ; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[START]], [[TMP9]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[OFFSET_IDX]], i64 0
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[START]], i64 0
 ; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP11:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP10]]
-; CHECK-NEXT:    [[TMP18:%.*]] = mul <vscale x 4 x i64> [[TMP11]], splat (i64 2)
-; CHECK-NEXT:    [[TMP12:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP18]]
-; CHECK-NEXT:    [[TMP19:%.*]] = add i64 [[OFFSET_IDX]], 0
-; CHECK-NEXT:    [[TMP20:%.*]] = add i64 [[OFFSET_IDX]], 2
-; CHECK-NEXT:    [[TMP21:%.*]] = add i64 [[OFFSET_IDX]], 4
-; CHECK-NEXT:    [[TMP22:%.*]] = add i64 [[OFFSET_IDX]], 6
+; CHECK-NEXT:    [[TMP18:%.*]] = mul nsw <vscale x 4 x i64> [[TMP10]], splat (i64 2)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> [[DOTSPLAT]], [[TMP18]]
+; CHECK-NEXT:    [[TMP11:%.*]] = mul nsw i64 2, [[TMP6]]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP11]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP12:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP13:%.*]] = call i32 @llvm.vscale.i32()
 ; CHECK-NEXT:    [[TMP14:%.*]] = mul nuw i32 [[TMP13]], 4
 ; CHECK-NEXT:    [[TMP15:%.*]] = sub i32 [[TMP14]], 1
 ; CHECK-NEXT:    [[TMP16:%.*]] = extractelement <vscale x 4 x i64> [[TMP12]], i32 [[TMP15]]
 ; CHECK-NEXT:    store i64 [[TMP16]], ptr [[DST:%.*]], align 1
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP12]], [[BROADCAST_SPLAT2]]
 ; CHECK-NEXT:    [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       middle.block:
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
-; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP8]], [[MIDDLE_BLOCK]] ], [ [[START]], [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[OFFSET_IDX]], [[MIDDLE_BLOCK]] ], [ [[START]], [[ENTRY:%.*]] ]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
@@ -322,49 +292,38 @@ define void @test_invar_gep_var_start_step_2(i64 %start, ptr %dst) #0 {
 ; IC2:       vector.ph:
 ; IC2-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; IC2-NEXT:    [[TMP6:%.*]] = mul nuw i64 [[TMP5]], 4
+; IC2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP6]], i64 0
+; IC2-NEXT:    [[BROADCAST_SPLAT1:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison,...
[truncated]

fhahn added a commit to fhahn/llvm-project that referenced this pull request Nov 30, 2025
Extend replicateByVF to also handle VPScalarIVStepsRecipe. To do so, the
patch adds a new lane operand to VPScalarIVStepsRecipe, which is only
added when replicating. This enables removing a number of lane 0
computations. The lane operand will also be used to explicitly replicate
replicate regions in a follow-up.

Depends on llvm#169796 (included in
PR).
@fhahn fhahn force-pushed the vplan-handle-scalable-users-of-ivsteps branch from 0ec3c3b to 6b2f03e Compare December 1, 2025 09:50
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]]
; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <vscale x 4 x i64> [[TMP5]], splat (i64 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation, but looks like we're missing a simple fold here - this should be a nop. I might take a look and see what's going on ...

Copy link
Contributor

@lukel97 lukel97 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the loops in this test will get folded away by LICM. Were we seeing something in the wild where we needed to generate the last lane of a widened induction variable with a scalable VF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are cases where we currently may not be able to sink I think. The tests cover (the only?) code path where we extract the last lane of an IV in the loop for scalable vectors.

bool HasScalableVF = Plan.hasScalableVF();
WideIV->replaceUsesWithIf(Steps,
[WideIV, HasScalableVF](VPUser &U, unsigned) {
if (HasScalableVF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are plans containing scalable VFs always guaranteed to never include a fixed-width VF? I think that's probably true in practice, but this might give strange behaviour if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently we build the plans for scalable/fixed vectorr completely separately, so they should never mix.


// Update scalar users of IV to use Step instead.
if (!HasOnlyVectorVFs)
if (!HasOnlyVectorVFs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding an assert here that the plan doesn't have a scalable VF? i.e.

  if (!HasOnlyVectorVFs) {
    assert(!Plan.hasScalableVF() && ...);
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added thanks. If the plan contains the scalar VF, it should never include scalable VFs.

fhahn added 2 commits December 1, 2025 14:30
For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar
step values. At the moment, it creates a vector, in addition to to the
first lane. The only supported case for this is when only the last lane
is used. A recipe should not set both scalar and vector values.

Instead, we can simply use a vector induction. It would also be possible
to preserve the current vector code-gen, by creating VPInstructions
based on the first lane of VPScalarIVStepsRecipe, but using a vector
induction seems simpler.
@fhahn fhahn force-pushed the vplan-handle-scalable-users-of-ivsteps branch from 6b2f03e to 87cbef7 Compare December 1, 2025 14:37
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]]
; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <vscale x 4 x i64> [[TMP5]], splat (i64 1)
Copy link
Contributor

@lukel97 lukel97 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the loops in this test will get folded away by LICM. Were we seeing something in the wild where we needed to generate the last lane of a widened induction variable with a scalable VF?

@fhahn fhahn enabled auto-merge (squash) December 1, 2025 17:07
@fhahn fhahn merged commit 25ab47b into llvm:main Dec 1, 2025
9 of 10 checks passed
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Dec 1, 2025
…able vectors. (#169796)

For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar
step values. At the moment, it creates a vector, in addition to to the
first lane. The only supported case for this is when only the last lane
is used. A recipe should not set both scalar and vector values.

Instead, we can simply use a vector induction. It would also be possible
to preserve the current vector code-gen, by creating VPInstructions
based on the first lane of VPScalarIVStepsRecipe, but using a vector
induction seems simpler.

PR: llvm/llvm-project#169796
@fhahn fhahn deleted the vplan-handle-scalable-users-of-ivsteps branch December 1, 2025 20:24
fhahn added a commit to fhahn/llvm-project that referenced this pull request Dec 1, 2025
Extend replicateByVF to also handle VPScalarIVStepsRecipe. To do so, the
patch adds a new lane operand to VPScalarIVStepsRecipe, which is only
added when replicating. This enables removing a number of lane 0
computations. The lane operand will also be used to explicitly replicate
replicate regions in a follow-up.

Depends on llvm#169796 (included in
PR).
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Dec 3, 2025
…s. (llvm#169796)

For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar
step values. At the moment, it creates a vector, in addition to to the
first lane. The only supported case for this is when only the last lane
is used. A recipe should not set both scalar and vector values.

Instead, we can simply use a vector induction. It would also be possible
to preserve the current vector code-gen, by creating VPInstructions
based on the first lane of VPScalarIVStepsRecipe, but using a vector
induction seems simpler.

PR: llvm#169796
kcloudy0717 pushed a commit to kcloudy0717/llvm-project that referenced this pull request Dec 4, 2025
…s. (llvm#169796)

For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar
step values. At the moment, it creates a vector, in addition to to the
first lane. The only supported case for this is when only the last lane
is used. A recipe should not set both scalar and vector values.

Instead, we can simply use a vector induction. It would also be possible
to preserve the current vector code-gen, by creating VPInstructions
based on the first lane of VPScalarIVStepsRecipe, but using a vector
induction seems simpler.

PR: llvm#169796
fhahn added a commit to fhahn/llvm-project that referenced this pull request Dec 4, 2025
Extend replicateByVF to also handle VPScalarIVStepsRecipe. To do so, the
patch adds a new lane operand to VPScalarIVStepsRecipe, which is only
added when replicating. This enables removing a number of lane 0
computations. The lane operand will also be used to explicitly replicate
replicate regions in a follow-up.

Depends on llvm#169796 (included in
PR).
fhahn added a commit to fhahn/llvm-project that referenced this pull request Dec 5, 2025
Extend replicateByVF to also handle VPScalarIVStepsRecipe. To do so, the
patch adds a new lane operand to VPScalarIVStepsRecipe, which is only
added when replicating. This enables removing a number of lane 0
computations. The lane operand will also be used to explicitly replicate
replicate regions in a follow-up.

Depends on llvm#169796 (included in
PR).
fhahn added a commit to fhahn/llvm-project that referenced this pull request Dec 5, 2025
Extend replicateByVF to also handle VPScalarIVStepsRecipe. To do so, the
patch adds a new lane operand to VPScalarIVStepsRecipe, which is only
added when replicating. This enables removing a number of lane 0
computations. The lane operand will also be used to explicitly replicate
replicate regions in a follow-up.

Depends on llvm#169796 (included in
PR).
fhahn added a commit to fhahn/llvm-project that referenced this pull request Dec 5, 2025
Extend replicateByVF to also handle VPScalarIVStepsRecipe. To do so, the
patch adds a new lane operand to VPScalarIVStepsRecipe, which is only
added when replicating. This enables removing a number of lane 0
computations. The lane operand will also be used to explicitly replicate
replicate regions in a follow-up.

Depends on llvm#169796 (included in
PR).
honeygoyal pushed a commit to honeygoyal/llvm-project that referenced this pull request Dec 9, 2025
…s. (llvm#169796)

For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar
step values. At the moment, it creates a vector, in addition to to the
first lane. The only supported case for this is when only the last lane
is used. A recipe should not set both scalar and vector values.

Instead, we can simply use a vector induction. It would also be possible
to preserve the current vector code-gen, by creating VPInstructions
based on the first lane of VPScalarIVStepsRecipe, but using a vector
induction seems simpler.

PR: llvm#169796
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants