-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[VPlan] Implement VPWidenLoad/StoreEVLRecipe::computeCost(). #109644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPlan] Implement VPWidenLoad/StoreEVLRecipe::computeCost(). #109644
Conversation
This patch makes VPWidenMemoryRecipe and the legacy model not adding reverse shuffle cost when the stored value is loop invariant. This patch also fixes llvm#109468.
|
@llvm/pr-subscribers-llvm-transforms Author: Elvis Wang (ElvisWang123) ChangesThis patch makes VPWidenMemoryRecipe and the legacy model not adding reverse shuffle cost when the stored value is loop invariant. This patch also fixes #109468. Full diff: https://github.com/llvm/llvm-project/pull/109644.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 806dbd5bb7304e..ec43cafc685a9e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5666,7 +5666,10 @@ LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
}
bool Reverse = ConsecutiveStride < 0;
- if (Reverse)
+ const StoreInst *SI = dyn_cast<StoreInst>(I);
+ bool IsLoopInvariantStoreValue =
+ SI && Legal->isInvariant(const_cast<StoreInst *>(SI)->getValueOperand());
+ if (Reverse && !IsLoopInvariantStoreValue)
Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, {},
CostKind, 0);
return Cost;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 9a2cfbc35cb84f..8bd67109403d35 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -2253,7 +2253,9 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF,
Cost += Ctx.TTI.getMemoryOpCost(Ingredient.getOpcode(), Ty, Alignment, AS,
CostKind, OpInfo, &Ingredient);
}
- if (!Reverse)
+ // If the store value is a live-in scalar value which is uniform, we don't
+ // need to calculate the reverse cost.
+ if (!Reverse || (isa<StoreInst>(Ingredient) && getOperand(1)->isLiveIn()))
return Cost;
return Cost += Ctx.TTI.getShuffleCost(TargetTransformInfo::SK_Reverse,
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll
new file mode 100644
index 00000000000000..7fd9b50a53836d
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll
@@ -0,0 +1,160 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -S| FileCheck %s
+; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl -S| FileCheck %s --check-prefixes=EVL
+; COM: From issue #109468
+target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
+target triple = "riscv64-unknown-linux-gnu"
+
+define void @lshift_significand(i32 %n, ptr nocapture writeonly %0) local_unnamed_addr #0 {
+; CHECK-LABEL: define void @lshift_significand(
+; CHECK-SAME: i32 [[N:%.*]], ptr nocapture writeonly [[TMP0:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: [[CMP1_PEEL:%.*]] = icmp eq i32 [[N]], 0
+; CHECK-NEXT: [[SPEC_SELECT:%.*]] = select i1 [[CMP1_PEEL]], i64 2, i64 0
+; CHECK-NEXT: [[TMP1:%.*]] = sub i64 3, [[SPEC_SELECT]]
+; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[TMP1]]
+; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 2
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]
+; CHECK-NEXT: br i1 [[TMP5]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 2
+; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP1]], [[TMP8]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP7]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[IND_END:%.*]] = add i64 [[SPEC_SELECT]], [[N_VEC]]
+; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 2
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 [[SPEC_SELECT]], [[INDEX]]
+; CHECK-NEXT: [[TMP11:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[INDEX]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = call <vscale x 2 x i64> @llvm.stepvector.nxv2i64()
+; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 2 x i64> zeroinitializer, [[TMP12]]
+; CHECK-NEXT: [[VEC_IV:%.*]] = add <vscale x 2 x i64> [[BROADCAST_SPLAT]], [[TMP13]]
+; CHECK-NEXT: [[TMP14:%.*]] = extractelement <vscale x 2 x i64> [[VEC_IV]], i32 0
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP14]], i64 [[TMP1]])
+; CHECK-NEXT: [[IDXPROM12:%.*]] = sub nuw nsw i64 1, [[TMP11]]
+; CHECK-NEXT: [[ARRAYIDX13:%.*]] = getelementptr [3 x i64], ptr [[TMP0]], i64 0, i64 [[IDXPROM12]]
+; CHECK-NEXT: [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP18:%.*]] = mul i64 [[TMP17]], 2
+; CHECK-NEXT: [[TMP19:%.*]] = mul i64 0, [[TMP18]]
+; CHECK-NEXT: [[TMP20:%.*]] = sub i64 1, [[TMP18]]
+; CHECK-NEXT: [[TMP21:%.*]] = getelementptr i64, ptr [[ARRAYIDX13]], i64 [[TMP19]]
+; CHECK-NEXT: [[TMP22:%.*]] = getelementptr i64, ptr [[TMP21]], i64 [[TMP20]]
+; CHECK-NEXT: [[REVERSE:%.*]] = call <vscale x 2 x i1> @llvm.vector.reverse.nxv2i1(<vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[REVERSE1:%.*]] = call <vscale x 2 x i64> @llvm.vector.reverse.nxv2i64(<vscale x 2 x i64> zeroinitializer)
+; CHECK-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[REVERSE1]], ptr [[TMP22]], i32 8, <vscale x 2 x i1> [[REVERSE]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP10]]
+; CHECK-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP23]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br i1 true, label %[[FOR_END16:.*]], label %[[SCALAR_PH]]
+; CHECK: [[SCALAR_PH]]:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[SPEC_SELECT]], %[[ENTRY]] ]
+; CHECK-NEXT: br label %[[FOR_BODY9:.*]]
+; CHECK: [[FOR_BODY9]]:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY9]] ]
+; CHECK-NEXT: [[TMP24:%.*]] = sub nuw nsw i64 1, [[INDVARS_IV]]
+; CHECK-NEXT: [[ARRAYIDX14:%.*]] = getelementptr [3 x i64], ptr [[TMP0]], i64 0, i64 [[TMP24]]
+; CHECK-NEXT: store i64 0, ptr [[ARRAYIDX14]], align 8
+; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 3
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_END16]], label %[[FOR_BODY9]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK: [[FOR_END16]]:
+; CHECK-NEXT: ret void
+;
+; EVL-LABEL: define void @lshift_significand(
+; EVL-SAME: i32 [[N:%.*]], ptr nocapture writeonly [[TMP0:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; EVL-NEXT: [[ENTRY:.*]]:
+; EVL-NEXT: [[CMP1_PEEL:%.*]] = icmp eq i32 [[N]], 0
+; EVL-NEXT: [[SPEC_SELECT:%.*]] = select i1 [[CMP1_PEEL]], i64 2, i64 0
+; EVL-NEXT: [[TMP1:%.*]] = sub i64 3, [[SPEC_SELECT]]
+; EVL-NEXT: [[TMP2:%.*]] = sub i64 -1, [[TMP1]]
+; EVL-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 2
+; EVL-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]
+; EVL-NEXT: br i1 [[TMP5]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; EVL: [[VECTOR_PH]]:
+; EVL-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 2
+; EVL-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
+; EVL-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP1]], [[TMP8]]
+; EVL-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP7]]
+; EVL-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; EVL-NEXT: [[IND_END:%.*]] = add i64 [[SPEC_SELECT]], [[N_VEC]]
+; EVL-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 2
+; EVL-NEXT: br label %[[VECTOR_BODY:.*]]
+; EVL: [[VECTOR_BODY]]:
+; EVL-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT: [[TMP11:%.*]] = sub i64 [[TMP1]], [[EVL_BASED_IV]]
+; EVL-NEXT: [[SUB11:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[TMP11]], i32 2, i1 true)
+; EVL-NEXT: [[OFFSET_IDX:%.*]] = add i64 [[SPEC_SELECT]], [[EVL_BASED_IV]]
+; EVL-NEXT: [[TMP13:%.*]] = add i64 [[OFFSET_IDX]], 0
+; EVL-NEXT: [[TMP14:%.*]] = sub nuw nsw i64 1, [[TMP13]]
+; EVL-NEXT: [[TMP15:%.*]] = getelementptr [3 x i64], ptr [[TMP0]], i64 0, i64 [[TMP14]]
+; EVL-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 2
+; EVL-NEXT: [[TMP18:%.*]] = mul i64 0, [[TMP17]]
+; EVL-NEXT: [[TMP19:%.*]] = sub i64 1, [[TMP17]]
+; EVL-NEXT: [[TMP20:%.*]] = getelementptr i64, ptr [[TMP15]], i64 [[TMP18]]
+; EVL-NEXT: [[TMP21:%.*]] = getelementptr i64, ptr [[TMP20]], i64 [[TMP19]]
+; EVL-NEXT: [[VP_REVERSE:%.*]] = call <vscale x 2 x i64> @llvm.experimental.vp.reverse.nxv2i64(<vscale x 2 x i64> zeroinitializer, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), i32 [[SUB11]])
+; EVL-NEXT: call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> [[VP_REVERSE]], ptr align 8 [[TMP21]], <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), i32 [[SUB11]])
+; EVL-NEXT: [[IDXPROM12:%.*]] = zext i32 [[SUB11]] to i64
+; EVL-NEXT: [[INDEX_EVL_NEXT]] = add i64 [[IDXPROM12]], [[EVL_BASED_IV]]
+; EVL-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP10]]
+; EVL-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; EVL-NEXT: br i1 [[TMP23]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; EVL: [[MIDDLE_BLOCK]]:
+; EVL-NEXT: br i1 true, label %[[FOR_END16:.*]], label %[[SCALAR_PH]]
+; EVL: [[SCALAR_PH]]:
+; EVL-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[SPEC_SELECT]], %[[ENTRY]] ]
+; EVL-NEXT: br label %[[FOR_BODY9:.*]]
+; EVL: [[FOR_BODY9]]:
+; EVL-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY9]] ]
+; EVL-NEXT: [[TMP24:%.*]] = sub nuw nsw i64 1, [[INDVARS_IV]]
+; EVL-NEXT: [[ARRAYIDX13:%.*]] = getelementptr [3 x i64], ptr [[TMP0]], i64 0, i64 [[TMP24]]
+; EVL-NEXT: store i64 0, ptr [[ARRAYIDX13]], align 8
+; EVL-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; EVL-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 3
+; EVL-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_END16]], label %[[FOR_BODY9]], !llvm.loop [[LOOP3:![0-9]+]]
+; EVL: [[FOR_END16]]:
+; EVL-NEXT: ret void
+;
+; Function Attrs: nofree norecurse nosync nounwind memory(argmem: write)
+entry:
+ %cmp1.peel = icmp eq i32 %n, 0
+ %spec.select = select i1 %cmp1.peel, i64 2, i64 0
+ br label %for.body9
+
+for.body9: ; preds = %entry, %for.body9
+ %indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ]
+ %1 = sub nuw nsw i64 1, %indvars.iv
+ %arrayidx13 = getelementptr [3 x i64], ptr %0, i64 0, i64 %1
+ store i64 0, ptr %arrayidx13, align 8
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %exitcond.not = icmp eq i64 %indvars.iv.next, 3
+ br i1 %exitcond.not, label %for.end16, label %for.body9
+
+for.end16: ; preds = %for.body9
+ ret void
+}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+;.
+; EVL: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; EVL: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; EVL: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; EVL: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+;.
|
| ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
| ; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -S| FileCheck %s | ||
| ; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl -S| FileCheck %s --check-prefixes=EVL | ||
| ; COM: From issue #109468 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does COM stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COM for comments.
| } | ||
| if (!Reverse) | ||
| // If the store value is a live-in scalar value which is uniform, we don't | ||
| // need to calculate the reverse cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, the revers will be done by to store recipe, which is why it is included here.
Could you elaborate how this fixes a difference in legacy and VPlan-based cost model? AFAICT the patch also extends the legacy cost model to include logic to skip shuffle costs for invariant store operands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch cannot fix the instruction cost difference between the VPlan-based and legacy cost model make the VF selection closer.
The cost changed after the patch
- Scalar cost: 4.
- legacy cost model:
- Vscale x 1: from 1 (maskedMemoryOpCost) + 6 (shuffle cost) + 3(other cost) to 1 + 3(other cost)
- vscale x 2: from 2 (MaskedMemoryOpCost) + 11 (shuffle cost) + 3 (other cost) to 2 + 3 (other cost)
- VPlan-based cost model:
- vscale x 1: from 2 (MemoryOpCost) + 6 (shuffle cost) + 3 (other cost) to 2 + 3 (other cost)
- vscale x 2: from 3 (MemoryOpCost) + 11 (shuffle cost) + 3 (other cost) to 3 + 3 (other cost)
The root case is caused by the mask for tail folding will transform to EVL after vplan transformations.
And in the RISCVTTI, the instruction cost from getMaskedMemoryOpCost() is difference to instruction cost from MemoryOpCost().
The legacy cost model cannot know if the mask is tail folding and if is valid to transform to EVL. So the legacy cost model will use getMaskedMemoryOpCost() to get the instruction cost.
The vplan-based cost model knows it is the EVL recipe and remover the tail mask so it will query the getMemoryOpCost().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for elaborating
IIUC the issue is that the legacy cost model doesn't know about EVL at all, but at the VPlan level we may not need a mask due to using EVL instead.
To match the legacy behavior, wouldn't it be better to implement computeCost for the EVL memory recipes and always include the mask cost. With a TODO to make this more accurate once the legacy cost model is retired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that is a good idea!
I will implement the computeCost for Load/Store EVL recipes and always calculate the mask cost.
This implementation can fix the difference between the legacy cost model and the vplan-based cost model.
| if (Reverse) | ||
| const StoreInst *SI = dyn_cast<StoreInst>(I); | ||
| bool IsLoopInvariantStoreValue = | ||
| SI && Legal->isInvariant(const_cast<StoreInst *>(SI)->getValueOperand()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isInvariant uses SCEV to determine loop-invariance, while isLiveIn only returns true for values defined outside the VPlan. This may introduce additional divergences, where the operand is invariant via SCEV but defined inside the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, using this implementation may introduce extra divergences.
I think the implementation in the legacy cost model is what we want but we cannot get the analysis in the VPlanRecipes.
Do you have other better methods to figure out if the value is loop invariant in the VPlanRecipes?
Currently the EVL recipes transfer the tail masking to the EVL. But in the legacy cost model, the mask exist and will calculate the instruction cost of the mask. To fix the difference between the VPlan-based cost model and the legacy cost model, we always calculate the instruction cost of the mask in the EVL recipes. Note that we should remove the mask cost in the EVL recipes when we don't need to compare to the legacy cost model. This patch also fix llvm#109468.
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Do we need a similar change for VPWidenLoadEVL?
| // TODO: Using the original IR may not be accurate. | ||
| // Currently, ARM will use the underlying IR to calculate gather/scatter | ||
| // instruction cost. | ||
| const Value *Ptr = getLoadStorePointerOperand(&Ingredient); | ||
| assert(!Reverse && | ||
| "Inconsecutive memory access should not have the order."); | ||
| return Ctx.TTI.getAddressComputationCost(Ty) + | ||
| Ctx.TTI.getGatherScatterOpCost(Ingredient.getOpcode(), Ty, Ptr, | ||
| IsMasked, Alignment, CostKind, | ||
| &Ingredient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // TODO: Using the original IR may not be accurate. | |
| // Currently, ARM will use the underlying IR to calculate gather/scatter | |
| // instruction cost. | |
| const Value *Ptr = getLoadStorePointerOperand(&Ingredient); | |
| assert(!Reverse && | |
| "Inconsecutive memory access should not have the order."); | |
| return Ctx.TTI.getAddressComputationCost(Ty) + | |
| Ctx.TTI.getGatherScatterOpCost(Ingredient.getOpcode(), Ty, Ptr, | |
| IsMasked, Alignment, CostKind, | |
| &Ingredient); | |
| return VPWidenStoreRecipe::computeCost(); |
Use cost from base class if possible?
If so, sink variable assignments closer to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Reuse the VPWidenMemoryRecipe::computeCost() when the load/store is not consecutive or masked.
| @@ -0,0 +1,25 @@ | |||
| ; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f | |||
| ; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both run lines?
Please also check the IR output
| %exitcond.not = icmp eq i64 %indvars.iv.next, 3 | ||
| br i1 %exitcond.not, label %for.end16, label %for.body9 | ||
|
|
||
| for.end16: ; preds = %for.body9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for.end16: ; preds = %for.body9 | |
| exit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks.
| br label %for.body9 | ||
|
|
||
| for.body9: ; preds = %entry, %for.body9 | ||
| %indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| %indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ] | |
| %iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed, thanks.
| @@ -0,0 +1,25 @@ | |||
| ; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f | |||
| ; RUN: opt < %s --prefer-predicate-over-epilogue=predicate-dont-vectorize --passes=loop-vectorize -mcpu=sifive-p470 -mattr=+v,+f -force-tail-folding-style=data-with-evl | |||
| ; Generated from issue #109468. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an explanation to the test. IIUC the important bit is that the store doesn't need a mask with EVL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added thanks.
| for.body9: ; preds = %entry, %for.body9 | ||
| %indvars.iv = phi i64 [ %spec.select, %entry ], [ %indvars.iv.next, %for.body9 ] | ||
| %1 = sub nuw nsw i64 1, %indvars.iv | ||
| %arrayidx13 = getelementptr [3 x i64], ptr %0, i64 0, i64 %1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| %arrayidx13 = getelementptr [3 x i64], ptr %0, i64 0, i64 %1 | |
| %arrayidx13 = getelementptr i64, ptr %0, i64 %1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks.
| %spec.select = select i1 %cmp1.peel, i64 2, i64 0 | ||
| br label %for.body9 | ||
|
|
||
| for.body9: ; preds = %entry, %for.body9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for.body9: ; preds = %entry, %for.body9 | |
| loop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks.
| target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128" | ||
| target triple = "riscv64-unknown-linux-gnu" | ||
|
|
||
| define void @lshift_significand(i32 %n, ptr nocapture writeonly %0) local_unnamed_addr #0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| define void @lshift_significand(i32 %n, ptr nocapture writeonly %0) local_unnamed_addr #0 { | |
| define void @evl_store_cost(i32 %n, ptr nocapture writeonly %dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed and removed, thanks.
| InstructionCost Cost = 0; | ||
| // We need to use the getMaskedMemoryOpCost() instead of getMemoryOpCost() | ||
| // here because the EVL recipes using EVL to replace the tail mask. But in the | ||
| // legacy model, it will always calculate the cost of mask. | ||
| // TODO: Using getMemoryOpCost() instead of getMaskedMemoryOpCost when we | ||
| // don't need to care the legacy cost model. | ||
| Cost += Ctx.TTI.getMaskedMemoryOpCost(Ingredient.getOpcode(), Ty, Alignment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| InstructionCost Cost = 0; | |
| // We need to use the getMaskedMemoryOpCost() instead of getMemoryOpCost() | |
| // here because the EVL recipes using EVL to replace the tail mask. But in the | |
| // legacy model, it will always calculate the cost of mask. | |
| // TODO: Using getMemoryOpCost() instead of getMaskedMemoryOpCost when we | |
| // don't need to care the legacy cost model. | |
| Cost += Ctx.TTI.getMaskedMemoryOpCost(Ingredient.getOpcode(), Ty, Alignment, | |
| // We need to use the getMaskedMemoryOpCost() instead of getMemoryOpCost() | |
| // here because the EVL recipes using EVL to replace the tail mask. But in the | |
| // legacy model, it will always calculate the cost of mask. | |
| // TODO: Using getMemoryOpCost() instead of getMaskedMemoryOpCost when we | |
| // don't need to care the legacy cost model. | |
| InstructionCost Cost = Ctx.TTI.getMaskedMemoryOpCost(Ingredient.getOpcode(), Ty, Alignment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
Reuse VPWidenMemoryRecipe::computeCost(). Implement VPWidenLoadEVLRecipe::computeCost(). Rename basic blocks in the testcase.
llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll
Outdated
Show resolved
Hide resolved
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
…9644) Currently the EVL recipes transfer the tail masking to the EVL. But in the legacy cost model, the mask exist and will calculate the instruction cost of the mask. To fix the difference between the VPlan-based cost model and the legacy cost model, we always calculate the instruction cost for the mask in the EVL recipes. Note that we should remove the mask cost in the EVL recipes when we don't need to compare to the legacy cost model. This patch also fixes llvm#109468.
Currently the EVL recipes transfer the tail masking to the EVL.
But in the legacy cost model, the mask exist and will calculate the
instruction cost of the mask.
To fix the difference between the VPlan-based cost model and the legacy
cost model, we always calculate the instruction cost of the mask in the EVL recipes.
Note that we should remove the mask cost in the EVL recipes when we
don't need to compare to the legacy cost model.
This patch also fixes #109468.