[VPlan] Implement VPWidenLoad/StoreEVLRecipe::computeCost(). #109644

ElvisWang123 · 2024-09-23T10:04:41Z

Currently the EVL recipes transfer the tail masking to the EVL.
But in the legacy cost model, the mask exist and will calculate the
instruction cost of the mask.
To fix the difference between the VPlan-based cost model and the legacy
cost model, we always calculate the instruction cost of the mask in the EVL recipes.

Note that we should remove the mask cost in the EVL recipes when we
don't need to compare to the legacy cost model.

This patch also fixes #109468.

This patch makes VPWidenMemoryRecipe and the legacy model not adding reverse shuffle cost when the stored value is loop invariant. This patch also fixes llvm#109468.

llvmbot · 2024-09-23T10:05:13Z

@llvm/pr-subscribers-llvm-transforms

Author: Elvis Wang (ElvisWang123)

Changes

This patch makes VPWidenMemoryRecipe and the legacy model not adding reverse shuffle cost when the stored value is loop invariant.

This patch also fixes #109468.

Full diff: https://github.com/llvm/llvm-project/pull/109644.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+4-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+3-1)
(added) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll (+160)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 806dbd5bb7304e..ec43cafc685a9e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5666,7 +5666,10 @@ LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
   }
 
   bool Reverse = ConsecutiveStride < 0;
-  if (Reverse)
+  const StoreInst *SI = dyn_cast<StoreInst>(I);
+  bool IsLoopInvariantStoreValue =
+      SI && Legal->isInvariant(const_cast<StoreInst *>(SI)->getValueOperand());
+  if (Reverse && !IsLoopInvariantStoreValue)
     Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, {},
                                CostKind, 0);
   return Cost;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 9a2cfbc35cb84f..8bd67109403d35 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -2253,7 +2253,9 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF,
     Cost += Ctx.TTI.getMemoryOpCost(Ingredient.getOpcode(), Ty, Alignment, AS,
                                     CostKind, OpInfo, &Ingredient);
   }
-  if (!Reverse)
+  // If the store value is a live-in scalar value which is uniform, we don't
+  // need to calculate the reverse cost.
+  if (!Reverse || (isa<StoreInst>(Ingredient) && getOperand(1)->isLiveIn()))
     return Cost;
 
   return Cost += Ctx.TTI.getShuffleCost(TargetTransformInfo::SK_Reverse,
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll
new file mode 100644
index 00000000000000..7fd9b50a53836d
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll
@@ -0,0 +1,160 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -S| FileCheck %s
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl -S| FileCheck %s --check-prefixes=EVL
+; COM: From issue #109468
+target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
+target triple = "riscv64-unknown-linux-gnu"
+
+define void @lshift_significand(i32 %n, ptr nocapture writeonly %0) local_unnamed_addr #0 {
+; CHECK-LABEL: define void @lshift_significand(
+; CHECK-SAME: i32 [[N:%.*]], ptr nocapture writeonly [[TMP0:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[CMP1_PEEL:%.*]] = icmp eq i32 [[N]], 0
+; CHECK-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[CMP1_PEEL]], i64 2, i64 0
+; CHECK-NEXT:    [[TMP1:%.*]] = sub i64 3, [[SPEC_SELECT]]
+; CHECK-NEXT:    [[TMP2:%.*]] = sub i64 -1, [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = mul i64 [[TMP3]], 2
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP6]], 2
+; CHECK-NEXT:    [[TMP8:%.*]] = sub i64 [[TMP7]], 1
+; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 [[TMP1]], [[TMP8]]
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP7]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[IND_END:%.*]] = add i64 [[SPEC_SELECT]], [[N_VEC]]
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP10:%.*]] = mul i64 [[TMP9]], 2
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[SPEC_SELECT]], [[INDEX]]
+; CHECK-NEXT:    [[TMP11:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[INDEX]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP12:%.*]] = call <vscale x 2 x i64> @llvm.stepvector.nxv2i64()
+; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 2 x i64> zeroinitializer, [[TMP12]]
+; CHECK-NEXT:    [[VEC_IV:%.*]] = add <vscale x 2 x i64> [[BROADCAST_SPLAT]], [[TMP13]]
+; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <vscale x 2 x i64> [[VEC_IV]], i32 0
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP14]], i64 [[TMP1]])
+; CHECK-NEXT:    [[IDXPROM12:%.*]] = sub nuw nsw i64 1, [[TMP11]]
+; CHECK-NEXT:    [[ARRAYIDX13:%.*]] = getelementptr [3 x i64], ptr [[TMP0]], i64 0, i64 [[IDXPROM12]]
+; CHECK-NEXT:    [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP18:%.*]] = mul i64 [[TMP17]], 2
+; CHECK-NEXT:    [[TMP19:%.*]] = mul i64 0, [[TMP18]]
+; CHECK-NEXT:    [[TMP20:%.*]] = sub i64 1, [[TMP18]]
+; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr i64, ptr [[ARRAYIDX13]], i64 [[TMP19]]
+; CHECK-NEXT:    [[TMP22:%.*]] = getelementptr i64, ptr [[TMP21]], i64 [[TMP20]]
+; CHECK-NEXT:    [[REVERSE:%.*]] = call <vscale x 2 x i1> @llvm.vector.reverse.nxv2i1(<vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[REVERSE1:%.*]] = call <vscale x 2 x i64> @llvm.vector.reverse.nxv2i64(<vscale x 2 x i64> zeroinitializer)
+; CHECK-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[REVERSE1]], ptr [[TMP22]], i32 8, <vscale x 2 x i1> [[REVERSE]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP10]]
+; CHECK-NEXT:    [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP23]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br i1 true, label %[[FOR_END16:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[SPEC_SELECT]], %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[FOR_BODY9:.*]]
+; CHECK:       [[FOR_BODY9]]:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY9]] ]
+; CHECK-NEXT:    [[TMP24:%.*]] = sub nuw nsw i64 1, [[INDVARS_IV]]
+; CHECK-NEXT:    [[ARRAYIDX14:%.*]] = getelementptr [3 x i64], ptr [[TMP0]], i64 0, i64 [[TMP24]]
+; CHECK-NEXT:    store i64 0, ptr [[ARRAYIDX14]], align 8
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 3
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_END16]], label %[[FOR_BODY9]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[FOR_END16]]:
+; CHECK-NEXT:    ret void
+;
+; EVL-LABEL: define void @lshift_significand(
+; EVL-SAME: i32 [[N:%.*]], ptr nocapture writeonly [[TMP0:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; EVL-NEXT:  [[ENTRY:.*]]:
+; EVL-NEXT:    [[CMP1_PEEL:%.*]] = icmp eq i32 [[N]], 0
+; EVL-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[CMP1_PEEL]], i64 2, i64 0
+; EVL-NEXT:    [[TMP1:%.*]] = sub i64 3, [[SPEC_SELECT]]
+; EVL-NEXT:    [[TMP2:%.*]] = sub i64 -1, [[TMP1]]
+; EVL-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT:    [[TMP4:%.*]] = mul i64 [[TMP3]], 2
+; EVL-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]
+; EVL-NEXT:    br i1 [[TMP5]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; EVL:       [[VECTOR_PH]]:
+; EVL-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP6]], 2
+; EVL-NEXT:    [[TMP8:%.*]] = sub i64 [[TMP7]], 1
+; EVL-NEXT:    [[N_RND_UP:%.*]] = add i64 [[TMP1]], [[TMP8]]
+; EVL-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP7]]
+; EVL-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; EVL-NEXT:    [[IND_END:%.*]] = add i64 [[SPEC_SELECT]], [[N_VEC]]
+; EVL-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT:    [[TMP10:%.*]] = mul i64 [[TMP9]], 2
+; EVL-NEXT:    br label %[[VECTOR_BODY:.*]]
+; EVL:       [[VECTOR_BODY]]:
+; EVL-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT:    [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT:    [[TMP11:%.*]] = sub i64 [[TMP1]], [[EVL_BASED_IV]]
+; EVL-NEXT:    [[SUB11:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[TMP11]], i32 2, i1 true)
+; EVL-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[SPEC_SELECT]], [[EVL_BASED_IV]]
+; EVL-NEXT:    [[TMP13:%.*]] = add i64 [[OFFSET_IDX]], 0
+; EVL-NEXT:    [[TMP14:%.*]] = sub nuw nsw i64 1, [[TMP13]]
+; EVL-NEXT:    [[TMP15:%.*]] = getelementptr [3 x i64], ptr [[TMP0]], i64 0, i64 [[TMP14]]
+; EVL-NEXT:    [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT:    [[TMP17:%.*]] = mul i64 [[TMP16]], 2
+; EVL-NEXT:    [[TMP18:%.*]] = mul i64 0, [[TMP17]]
+; EVL-NEXT:    [[TMP19:%.*]] = sub i64 1, [[TMP17]]
+; EVL-NEXT:    [[TMP20:%.*]] = getelementptr i64, ptr [[TMP15]], i64 [[TMP18]]
+; EVL-NEXT:    [[TMP21:%.*]] = getelementptr i64, ptr [[TMP20]], i64 [[TMP19]]
+; EVL-NEXT:    [[VP_REVERSE:%.*]] = call <vscale x 2 x i64> @llvm.experimental.vp.reverse.nxv2i64(<vscale x 2 x i64> zeroinitializer, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), i32 [[SUB11]])
+; EVL-NEXT:    call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> [[VP_REVERSE]], ptr align 8 [[TMP21]], <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), i32 [[SUB11]])
+; EVL-NEXT:    [[IDXPROM12:%.*]] = zext i32 [[SUB11]] to i64
+; EVL-NEXT:    [[INDEX_EVL_NEXT]] = add i64 [[IDXPROM12]], [[EVL_BASED_IV]]
+; EVL-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP10]]
+; EVL-NEXT:    [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; EVL-NEXT:    br i1 [[TMP23]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; EVL:       [[MIDDLE_BLOCK]]:
+; EVL-NEXT:    br i1 true, label %[[FOR_END16:.*]], label %[[SCALAR_PH]]
+; EVL:       [[SCALAR_PH]]:
+; EVL-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[SPEC_SELECT]], %[[ENTRY]] ]
+; EVL-NEXT:    br label %[[FOR_BODY9:.*]]
+; EVL:       [[FOR_BODY9]]:
+; EVL-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY9]] ]
+; EVL-NEXT:    [[TMP24:%.*]] = sub nuw nsw i64 1, [[INDVARS_IV]]
+; EVL-NEXT:    [[ARRAYIDX13:%.*]] = getelementptr [3 x i64], ptr [[TMP0]], i64 0, i64 [[TMP24]]
+; EVL-NEXT:    store i64 0, ptr [[ARRAYIDX13]], align 8
+; EVL-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; EVL-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 3
+; EVL-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_END16]], label %[[FOR_BODY9]], !llvm.loop [[LOOP3:![0-9]+]]
+; EVL:       [[FOR_END16]]:
+; EVL-NEXT:    ret void
+;
+; Function Attrs: nofree norecurse nosync nounwind memory(argmem: write)
+entry:
+  %cmp1.peel = icmp eq i32 %n, 0
+  %spec.select = select i1 %cmp1.peel, i64 2, i64 0
+  br label %for.body9
+
+for.body9:                                        ; preds = %entry, %for.body9
+  %indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ]
+  %1 = sub nuw nsw i64 1, %indvars.iv
+  %arrayidx13 = getelementptr [3 x i64], ptr %0, i64 0, i64 %1
+  store i64 0, ptr %arrayidx13, align 8
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, 3
+  br i1 %exitcond.not, label %for.end16, label %for.body9
+
+for.end16:                                        ; preds = %for.body9
+  ret void
+}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+;.
+; EVL: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; EVL: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; EVL: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; EVL: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+;.

fhahn · 2024-09-23T11:55:29Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -S| FileCheck %s
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl -S| FileCheck %s --check-prefixes=EVL
+; COM: From issue #109468


What does COM stand for?

COM for comments.

fhahn · 2024-09-23T11:55:44Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

  }
-  if (!Reverse)
+  // If the store value is a live-in scalar value which is uniform, we don't
+  // need to calculate the reverse cost.


At the moment, the revers will be done by to store recipe, which is why it is included here.

Could you elaborate how this fixes a difference in legacy and VPlan-based cost model? AFAICT the patch also extends the legacy cost model to include logic to skip shuffle costs for invariant store operands.

This patch cannot fix the instruction cost difference between the VPlan-based and legacy cost model make the VF selection closer.

The cost changed after the patch

Scalar cost: 4.

legacy cost model:

Vscale x 1: from 1 (maskedMemoryOpCost) + 6 (shuffle cost) + 3(other cost) to 1 + 3(other cost)

vscale x 2: from 2 (MaskedMemoryOpCost) + 11 (shuffle cost) + 3 (other cost) to 2 + 3 (other cost)

VPlan-based cost model:

vscale x 1: from 2 (MemoryOpCost) + 6 (shuffle cost) + 3 (other cost) to 2 + 3 (other cost)

vscale x 2: from 3 (MemoryOpCost) + 11 (shuffle cost) + 3 (other cost) to 3 + 3 (other cost)

The root case is caused by the mask for tail folding will transform to EVL after vplan transformations.
And in the RISCVTTI, the instruction cost from getMaskedMemoryOpCost() is difference to instruction cost from MemoryOpCost().
The legacy cost model cannot know if the mask is tail folding and if is valid to transform to EVL. So the legacy cost model will use getMaskedMemoryOpCost() to get the instruction cost.
The vplan-based cost model knows it is the EVL recipe and remover the tail mask so it will query the getMemoryOpCost().

Thanks for elaborating

IIUC the issue is that the legacy cost model doesn't know about EVL at all, but at the VPlan level we may not need a mask due to using EVL instead.

To match the legacy behavior, wouldn't it be better to implement computeCost for the EVL memory recipes and always include the mask cost. With a TODO to make this more accurate once the legacy cost model is retired?

Thanks, that is a good idea!

I will implement the computeCost for Load/Store EVL recipes and always calculate the mask cost.
This implementation can fix the difference between the legacy cost model and the vplan-based cost model.

fhahn · 2024-09-23T11:58:34Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-  if (Reverse)
+  const StoreInst *SI = dyn_cast<StoreInst>(I);
+  bool IsLoopInvariantStoreValue =
+      SI && Legal->isInvariant(const_cast<StoreInst *>(SI)->getValueOperand());


isInvariant uses SCEV to determine loop-invariance, while isLiveIn only returns true for values defined outside the VPlan. This may introduce additional divergences, where the operand is invariant via SCEV but defined inside the loop

You are right, using this implementation may introduce extra divergences.
I think the implementation in the legacy cost model is what we want but we cannot get the analysis in the VPlanRecipes.

Do you have other better methods to figure out if the value is loop invariant in the VPlanRecipes?

Currently the EVL recipes transfer the tail masking to the EVL. But in the legacy cost model, the mask exist and will calculate the instruction cost of the mask. To fix the difference between the VPlan-based cost model and the legacy cost model, we always calculate the instruction cost of the mask in the EVL recipes. Note that we should remove the mask cost in the EVL recipes when we don't need to compare to the legacy cost model. This patch also fix llvm#109468.

fhahn

Thanks for the update! Do we need a similar change for VPWidenLoadEVL?

fhahn · 2024-09-24T07:13:09Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // TODO: Using the original IR may not be accurate.
+    // Currently, ARM will use the underlying IR to calculate gather/scatter
+    // instruction cost.
+    const Value *Ptr = getLoadStorePointerOperand(&Ingredient);
+    assert(!Reverse &&
+           "Inconsecutive memory access should not have the order.");
+    return Ctx.TTI.getAddressComputationCost(Ty) +
+           Ctx.TTI.getGatherScatterOpCost(Ingredient.getOpcode(), Ty, Ptr,
+                                          IsMasked, Alignment, CostKind,
+                                          &Ingredient);


Suggested change

// TODO: Using the original IR may not be accurate.

// Currently, ARM will use the underlying IR to calculate gather/scatter

// instruction cost.

const Value *Ptr = getLoadStorePointerOperand(&Ingredient);

assert(!Reverse &&

"Inconsecutive memory access should not have the order.");

return Ctx.TTI.getAddressComputationCost(Ty) +

Ctx.TTI.getGatherScatterOpCost(Ingredient.getOpcode(), Ty, Ptr,

IsMasked, Alignment, CostKind,

&Ingredient);

return VPWidenStoreRecipe::computeCost();

Use cost from base class if possible?

If so, sink variable assignments closer to use

Sure. Reuse the VPWidenMemoryRecipe::computeCost() when the load/store is not consecutive or masked.

fhahn · 2024-09-24T07:19:56Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

@@ -0,0 +1,25 @@
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl


Do we need both run lines?

Please also check the IR output

fhahn · 2024-09-24T07:21:06Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

+  %exitcond.not = icmp eq i64 %indvars.iv.next, 3
+  br i1 %exitcond.not, label %for.end16, label %for.body9
+
+for.end16:                                        ; preds = %for.body9


Suggested change

for.end16: ; preds = %for.body9

exit:

Removed, thanks.

fhahn · 2024-09-24T07:21:13Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

+  br label %for.body9
+
+for.body9:                                        ; preds = %entry, %for.body9
+  %indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ]


Suggested change

%indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ]

%iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ]

Renamed, thanks.

fhahn · 2024-09-24T07:22:14Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

@@ -0,0 +1,25 @@
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl
+; Generated from issue #109468.


Could you add an explanation to the test. IIUC the important bit is that the store doesn't need a mask with EVL?

Added thanks.

fhahn · 2024-09-24T07:22:37Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

+for.body9:                                        ; preds = %entry, %for.body9
+  %indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ]
+  %1 = sub nuw nsw i64 1, %indvars.iv
+  %arrayidx13 = getelementptr [3 x i64], ptr %0, i64 0, i64 %1


Suggested change

%arrayidx13 = getelementptr [3 x i64], ptr %0, i64 0, i64 %1

%arrayidx13 = getelementptr i64, ptr %0, i64 %1

Removed, thanks.

fhahn · 2024-09-24T07:22:48Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

+  %spec.select = select i1 %cmp1.peel, i64 2, i64 0
+  br label %for.body9
+
+for.body9:                                        ; preds = %entry, %for.body9


Suggested change

for.body9: ; preds = %entry, %for.body9

loop:

Removed, thanks.

fhahn · 2024-09-24T07:23:51Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

+target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
+target triple = "riscv64-unknown-linux-gnu"
+
+define void @lshift_significand(i32 %n, ptr nocapture writeonly %0) local_unnamed_addr #0 {


Suggested change

define void @lshift_significand(i32 %n, ptr nocapture writeonly %0) local_unnamed_addr #0 {

define void @evl_store_cost(i32 %n, ptr nocapture writeonly %dst) {

Renamed and removed, thanks.

fhahn · 2024-09-24T07:26:48Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  InstructionCost Cost = 0;
+  // We need to use the getMaskedMemoryOpCost() instead of getMemoryOpCost()
+  // here because the EVL recipes using EVL to replace the tail mask. But in the
+  // legacy model, it will always calculate the cost of mask.
+  // TODO: Using getMemoryOpCost() instead of getMaskedMemoryOpCost when we
+  // don't need to care the legacy cost model.
+  Cost += Ctx.TTI.getMaskedMemoryOpCost(Ingredient.getOpcode(), Ty, Alignment,


Suggested change

InstructionCost Cost = 0;

// We need to use the getMaskedMemoryOpCost() instead of getMemoryOpCost()

// here because the EVL recipes using EVL to replace the tail mask. But in the

// legacy model, it will always calculate the cost of mask.

// TODO: Using getMemoryOpCost() instead of getMaskedMemoryOpCost when we

// don't need to care the legacy cost model.

Cost += Ctx.TTI.getMaskedMemoryOpCost(Ingredient.getOpcode(), Ty, Alignment,

// We need to use the getMaskedMemoryOpCost() instead of getMemoryOpCost()

// here because the EVL recipes using EVL to replace the tail mask. But in the

// legacy model, it will always calculate the cost of mask.

// TODO: Using getMemoryOpCost() instead of getMaskedMemoryOpCost when we

// don't need to care the legacy cost model.

InstructionCost Cost = Ctx.TTI.getMaskedMemoryOpCost(Ingredient.getOpcode(), Ty, Alignment,

Updated, thanks.

Reuse VPWidenMemoryRecipe::computeCost(). Implement VPWidenLoadEVLRecipe::computeCost(). Rename basic blocks in the testcase.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll

fhahn

LGTM, thanks!

…9644) Currently the EVL recipes transfer the tail masking to the EVL. But in the legacy cost model, the mask exist and will calculate the instruction cost of the mask. To fix the difference between the VPlan-based cost model and the legacy cost model, we always calculate the instruction cost for the mask in the EVL recipes. Note that we should remove the mask cost in the EVL recipes when we don't need to compare to the legacy cost model. This patch also fixes llvm#109468.

[LV][VPlan] Not adding shuffle cost when store loop invariant value.

5be3a79

This patch makes VPWidenMemoryRecipe and the legacy model not adding reverse shuffle cost when the stored value is loop invariant. This patch also fixes llvm#109468.

ElvisWang123 requested review from arcbbb, ayalz, fhahn and mshockwave September 23, 2024 10:04

llvmbot added vectorizers llvm:transforms labels Sep 23, 2024

ElvisWang123 mentioned this pull request Sep 23, 2024

[LV][VPlan] Crash due to disagreements on the VPlan cost v.s. legacy cost model #109468

Closed

fhahn reviewed Sep 23, 2024

View reviewed changes

ElvisWang123 changed the title ~~[LV][VPlan] Not adding shuffle cost when store loop invariant value.~~ [VPlan] Implement VPWidenStoreEVLRecipe::computeCost(). Sep 23, 2024

fhahn reviewed Sep 24, 2024

View reviewed changes

Address comments

d1af59e

Reuse VPWidenMemoryRecipe::computeCost(). Implement VPWidenLoadEVLRecipe::computeCost(). Rename basic blocks in the testcase.

ElvisWang123 changed the title ~~[VPlan] Implement VPWidenStoreEVLRecipe::computeCost().~~ [VPlan] Implement VPWidenLoad/StoreEVLRecipe::computeCost(). Sep 24, 2024

fhahn reviewed Sep 25, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll Outdated Show resolved Hide resolved

Address comments

21e00c8

fhahn approved these changes Sep 25, 2024

View reviewed changes

ElvisWang123 merged commit a068b97 into llvm:main Sep 25, 2024
4 checks passed

ElvisWang123 deleted the fix-assert-gcc-vectorizer branch September 25, 2024 23:14

		@@ -0,0 +1,25 @@
		; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f
		; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl

	%indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ]
	%iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ]

	%arrayidx13 = getelementptr [3 x i64], ptr %0, i64 0, i64 %1
	%arrayidx13 = getelementptr i64, ptr %0, i64 %1

	define void @lshift_significand(i32 %n, ptr nocapture writeonly %0) local_unnamed_addr #0 {
	define void @evl_store_cost(i32 %n, ptr nocapture writeonly %dst) {

[VPlan] Implement VPWidenLoad/StoreEVLRecipe::computeCost(). #109644

[VPlan] Implement VPWidenLoad/StoreEVLRecipe::computeCost(). #109644

Uh oh!

Conversation

ElvisWang123 commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElvisWang123 Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ElvisWang123 commented Sep 23, 2024 •

edited

Loading

ElvisWang123 Sep 23, 2024 •

edited

Loading