Skip to content
35 changes: 35 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1084,6 +1084,40 @@ void VPlanTransforms::simplifyRecipes(VPlan &Plan, Type &CanonicalIVTy) {
}
}

static void convertToUniformRecipes(VPlan &Plan) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static void convertToUniformRecipes(VPlan &Plan) {
static void convertToSingleScalarRecipes(VPlan &Plan) {

as this captures both uniformity and only-first-lane-used? Also affects title of patch.

Analogous to truncateToMinimalBitwidths() which aims to reduce each lane to fewer bits, this aims to reduce each part to fewest lanes - to one. Perhaps both should start with narrow, as used in the now inlined lambda.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to narrowToSingleScalarRecipe, thanks

auto TryToNarrow = [](VPBasicBlock *VPBB) {
for (VPRecipeBase &R : make_early_inc_range(reverse(*VPBB))) {
// Try to narrow wide and replicating recipes to uniform recipes, based on
// VPlan analysis.
auto *Def = dyn_cast<VPSingleDefRecipe>(&R);
if (!Def || !isa<VPReplicateRecipe, VPWidenRecipe>(Def) ||
!Def->getUnderlyingValue())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!Def->getUnderlyingValue())

can be asserted instead if desired - these recipes must have underlying values - in order to know what to replicate or widen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be asserted for VPReplicateRecipe, but not for VPWidenRecipe which does not require a underlying value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if VPWidenRecipe uses the (optional) underlying value only for metadata and FMF, could it be replaced with VPInstruction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we should be able to do that soon. I'll give it a try, need to see if we rely on the encoded facts that certain recipes are widened in various places.

continue;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *Def = dyn_cast<VPSingleDefRecipe>(&R);
if (!Def || !isa<VPReplicateRecipe, VPWidenRecipe>(Def) ||
!Def->getUnderlyingValue())
continue;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

auto *RepR = dyn_cast<VPReplicateRecipe>(&R);
if (RepR && RepR->isUniform())
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *RepR = dyn_cast<VPReplicateRecipe>(&R);
if (RepR && RepR->isUniform())
continue;
auto *RepR = dyn_cast<VPReplicateRecipe>(&R);
if (!RepR && !isa<VPWidenRecipe>(&R))
continue;
if (RepR && RepR->isUniform())
continue;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *SingleDef = cast<VPSingleDefRecipe>(&R);

or RepOrWidenR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

// Skip recipes that aren't uniform and don't have only their scalar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Skip recipes that aren't uniform and don't have only their scalar
// Skip recipes that aren't single scalar or don't have only their scalar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The and here should be accurate, it skips cases that have non-scalar uses, as this may require introducing broadcasts. This is something that will be generalized in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has an ||?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, was thinking about the recipes we process below, updated, thanks

// results used. In the later case, we would introduce extra broadcasts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// results used. In the later case, we would introduce extra broadcasts.
// results used. In the latter case, we would introduce extra broadcasts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

if (!vputils::isUniformAfterVectorization(Def) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "UniformAfterVectorization" and VPReplicateRecipe's "uniform" field should be renamed. Being uniform (having same value for all lanes) is independent of being before or after vectorization. The term stands for "singleLane" or "singleScalar", which is typically associated with the first lane (as in unit-stride, i.e., clearly non-uniform, GEPs whose first lane is the only one used), and with all lanes when the value is uniform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is long overdue! #140134

any_of(Def->users(),
[Def](VPUser *U) { return !U->usesScalars(Def); }))
continue;

auto *Clone = new VPReplicateRecipe(Def->getUnderlyingInstr(),
Def->operands(), /*IsUniform*/ true);
Clone->insertBefore(Def);
Def->replaceAllUsesWith(Clone);
Def->eraseFromParent();
}
};

for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
vp_depth_first_shallow(Plan.getVectorLoopRegion()->getEntry())))
TryToNarrow(VPBB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason we're using a lambda here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As if this pass had a class of its own where its main runPass() called a TryToNarrow() method on each basic block, as in runOnBasicBlock(), independently (and conceptually in parallel).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some follow-up patches that may calls this on additional blocks, which was why I had the lambda originally. Inlined for now, thanks

}

/// Normalize and simplify VPBlendRecipes. Should be run after simplifyRecipes
/// to make sure the masks are simplified.
static void simplifyBlends(VPlan &Plan) {
Expand Down Expand Up @@ -1778,6 +1812,7 @@ void VPlanTransforms::optimize(VPlan &Plan) {
runPass(simplifyRecipes, Plan, *Plan.getCanonicalIV()->getScalarType());
runPass(simplifyBlends, Plan);
runPass(removeDeadRecipes, Plan);
runPass(convertToUniformRecipes, Plan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to apply uniform analysis if VF=1? If not, could we skip it when VF=1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, suspect so, given that in such case all replicate recipes should already be "uniform" and widen recipes are irrelevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added a check isScalarVFOnly() to the transform

runPass(legalizeAndOptimizeInductions, Plan);
runPass(removeRedundantExpandSCEVRecipes, Plan);
runPass(simplifyRecipes, Plan, *Plan.getCanonicalIV()->getScalarType());
Expand Down
33 changes: 17 additions & 16 deletions llvm/test/Transforms/LoopVectorize/SystemZ/pr47665.ll
Original file line number Diff line number Diff line change
Expand Up @@ -7,86 +7,87 @@ define void @test(ptr %p, i40 %a) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt i1 true, false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the said degradation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, trivial folding that at the moment happens in IRBuilder on VPWidenRecipe, but not on replicate recipes which clone the original instruction. Will be fixed by pending VP constant folder

; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
; CHECK: pred.store.if:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE]]
; CHECK: pred.store.continue:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2:%.*]]
; CHECK: pred.store.if1:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE2]]
; CHECK: pred.store.continue2:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
; CHECK: pred.store.if3:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE4]]
; CHECK: pred.store.continue4:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE6:%.*]]
; CHECK: pred.store.if5:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE6]]
; CHECK: pred.store.continue6:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8:%.*]]
; CHECK: pred.store.if7:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE8]]
; CHECK: pred.store.continue8:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF9:%.*]], label [[PRED_STORE_CONTINUE10:%.*]]
; CHECK: pred.store.if9:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE10]]
; CHECK: pred.store.continue10:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF11:%.*]], label [[PRED_STORE_CONTINUE12:%.*]]
; CHECK: pred.store.if11:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE12]]
; CHECK: pred.store.continue12:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF13:%.*]], label [[PRED_STORE_CONTINUE14:%.*]]
; CHECK: pred.store.if13:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE14]]
; CHECK: pred.store.continue14:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF15:%.*]], label [[PRED_STORE_CONTINUE16:%.*]]
; CHECK: pred.store.if15:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE16]]
; CHECK: pred.store.continue16:
; CHECK-NEXT: br i1 true, label [[PRED_STORE_IF17:%.*]], label [[PRED_STORE_CONTINUE18:%.*]]
; CHECK: pred.store.if17:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE18]]
; CHECK: pred.store.continue18:
; CHECK-NEXT: br i1 false, label [[PRED_STORE_IF19:%.*]], label [[PRED_STORE_CONTINUE20:%.*]]
; CHECK: pred.store.if19:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE20]]
; CHECK: pred.store.continue20:
; CHECK-NEXT: br i1 false, label [[PRED_STORE_IF21:%.*]], label [[PRED_STORE_CONTINUE22:%.*]]
; CHECK: pred.store.if21:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE22]]
; CHECK: pred.store.continue22:
; CHECK-NEXT: br i1 false, label [[PRED_STORE_IF23:%.*]], label [[PRED_STORE_CONTINUE24:%.*]]
; CHECK: pred.store.if23:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE24]]
; CHECK: pred.store.continue24:
; CHECK-NEXT: br i1 false, label [[PRED_STORE_IF25:%.*]], label [[PRED_STORE_CONTINUE26:%.*]]
; CHECK: pred.store.if25:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE26]]
; CHECK: pred.store.continue26:
; CHECK-NEXT: br i1 false, label [[PRED_STORE_IF27:%.*]], label [[PRED_STORE_CONTINUE28:%.*]]
; CHECK: pred.store.if27:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE28]]
; CHECK: pred.store.continue28:
; CHECK-NEXT: br i1 false, label [[PRED_STORE_IF29:%.*]], label [[PRED_STORE_CONTINUE30:%.*]]
; CHECK: pred.store.if29:
; CHECK-NEXT: store i1 false, ptr [[P]], align 1
; CHECK-NEXT: store i1 [[TMP0]], ptr [[P]], align 1
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE30]]
; CHECK: pred.store.continue30:
; CHECK-NEXT: br label [[MIDDLE_BLOCK:%.*]]
Expand Down
5 changes: 1 addition & 4 deletions llvm/test/Transforms/LoopVectorize/X86/cost-model.ll
Original file line number Diff line number Diff line change
Expand Up @@ -890,9 +890,7 @@ define i64 @cost_assume(ptr %end, i64 %N) {
; CHECK: vector.ph:
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 8
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[N:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT]], <2 x i64> poison, <2 x i32> zeroinitializer
; CHECK-NEXT: [[TMP3:%.*]] = icmp ne <2 x i64> [[BROADCAST_SPLAT]], zeroinitializer
; CHECK-NEXT: [[TMP11:%.*]] = icmp ne i64 [[N:%.*]], 0
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
Expand All @@ -904,7 +902,6 @@ define i64 @cost_assume(ptr %end, i64 %N) {
; CHECK-NEXT: [[TMP8]] = add <2 x i64> [[VEC_PHI2]], splat (i64 1)
; CHECK-NEXT: [[TMP9]] = add <2 x i64> [[VEC_PHI3]], splat (i64 1)
; CHECK-NEXT: [[TMP10]] = add <2 x i64> [[VEC_PHI4]], splat (i64 1)
; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP3]], i32 0
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -159,9 +159,6 @@ define void @versioned_sext_use_in_gep(i32 %scale, ptr %dst, i64 %scale.2) {
; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[SCALE]], 1
; CHECK-NEXT: br i1 [[IDENT_CHECK]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[DST]], i64 [[SCALE_2]]
; CHECK-NEXT: [[TMP81:%.*]] = getelementptr i8, ptr [[DST]], i64 [[SCALE_2]]
; CHECK-NEXT: [[TMP82:%.*]] = getelementptr i8, ptr [[DST]], i64 [[SCALE_2]]
; CHECK-NEXT: [[TMP83:%.*]] = getelementptr i8, ptr [[DST]], i64 [[SCALE_2]]
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
Expand All @@ -174,10 +171,10 @@ define void @versioned_sext_use_in_gep(i32 %scale, ptr %dst, i64 %scale.2) {
; CHECK-NEXT: [[TMP13:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP12]]
; CHECK-NEXT: [[TMP15:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP14]]
; CHECK-NEXT: [[TMP17:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP16]]
; CHECK-NEXT: store ptr [[TMP8]], ptr [[TMP11]], align 8
; CHECK-NEXT: store ptr [[TMP8]], ptr [[TMP13]], align 8
; CHECK-NEXT: store ptr [[TMP8]], ptr [[TMP15]], align 8
; CHECK-NEXT: store ptr [[TMP8]], ptr [[TMP17]], align 8
; CHECK-NEXT: store ptr [[TMP83]], ptr [[TMP11]], align 8
; CHECK-NEXT: store ptr [[TMP83]], ptr [[TMP13]], align 8
; CHECK-NEXT: store ptr [[TMP83]], ptr [[TMP15]], align 8
; CHECK-NEXT: store ptr [[TMP83]], ptr [[TMP17]], align 8
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
Expand Down
Loading