Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7291,6 +7291,7 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
// cost model is complete for better cost estimates.
VPlanTransforms::runPass(VPlanTransforms::unrollByUF, BestVPlan, BestUF,
OrigLoop->getHeader()->getContext());
VPlanTransforms::runPass(VPlanTransforms::unrollByVF, BestVPlan, BestVF);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to unroll by VF before unrolling by UF rather than afterwards? BestVF is conceptually chosen before BestUF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason to run it after unrolling by UF is that the current position matches the order VPReplicateRecipes currently generate code, so less test changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the general idea that by unrolling the replicate recipes we open up more folding/dce opportunities within vplan, to reduce the burden on dce/instcombine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing more folding opportunities is a nice benefit, although besides that I think the main benefits are simpler recipe ::execute and simpler state management (VPTransformState).

It also encourages more flexible/general recipes and adds utilities that can be useful in the future, e.g. BuildVector for SLP and potentially also allows more accurate cost estimates, by making vector <-> scalar transitions explicit.

VPlanTransforms::runPass(VPlanTransforms::materializeBroadcasts, BestVPlan);
VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
VPlanTransforms::simplifyRecipes(BestVPlan, *Legal->getWidestInductionType());
Expand Down
8 changes: 8 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,14 @@ Value *VPTransformState::get(const VPValue *Def, const VPLane &Lane) {
return Data.VPV2Scalars[Def][0];
}

// Look through BuildVector to avoid redundant extracts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done now rather than later to reduce test diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep without the fold we would have additional insert/extracts for uses inside replicate regions, with corresponding test changes.

// TODO: Remove once replicate regions are unrolled explicitly.
auto *BV = dyn_cast<VPInstruction>(Def);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *BV = dyn_cast<VPInstruction>(Def);
auto *BuildV = dyn_cast<VPInstruction>(Def);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

if (Lane.getKind() == VPLane::Kind::First && BV &&
BV->getOpcode() == VPInstruction::BuildVector) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth matching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matched using a matcher w/o ops, thanks.

return get(BV->getOperand(Lane.getKnownLane()), true);
}

assert(hasVectorValue(Def));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent: missing error message.

auto *VecPart = Data.VPV2Vector[Def];
if (!VecPart->getType()->isVectorTy()) {
Expand Down
6 changes: 6 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.h
Original file line number Diff line number Diff line change
Expand Up @@ -907,6 +907,12 @@ class VPInstruction : public VPRecipeWithIRFlags,
BranchOnCount,
BranchOnCond,
Broadcast,
/// Creates a vector containing all operands. The vector element count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with scalable vectors? It sounds very similar to the BuildVector ISD node in codegen. If it's not expected to work for scalable element counts it's worth spelling it out explicitly in the comments I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep scalable vectors aren't supported, updated to say Creates a fixed-width vector ..., thanks

/// matches the number of operands.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Creates a vector containing all operands. The vector element count
/// matches the number of operands.
/// Creates a vector containing all operands. The number of operands
/// matches the vector element count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

BuildVector,
/// Creates a struct of vectors containing all operands. The vector element
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added fixed-width vectors here

/// count matches the number of operands.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Creates a struct of vectors containing all operands. The vector element
/// count matches the number of operands.
/// Creates a struct of vectors containing all operands. The number of operands
/// matches the number of fields in the struct.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated thanks

BuildStructVector,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lex order has BuildVector after BuildStructVector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reordered, thanks

ComputeAnyOfResult,
ComputeFindLastIVResult,
ComputeReductionResult,
Expand Down
2 changes: 2 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::AnyOf:
case VPInstruction::BuildVector:
case VPInstruction::BuildStructVector:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case VPInstruction::BuildVector:
case VPInstruction::BuildStructVector:
case VPInstruction::BuildStructVector:
case VPInstruction::BuildVector:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated thanks

return SetResultTyFromOp();
case VPInstruction::FirstActiveLane:
return Type::getIntNTy(Ctx, 64);
Expand Down
61 changes: 43 additions & 18 deletions llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -493,6 +493,9 @@ Value *VPInstruction::generate(VPTransformState &State) {
}
case Instruction::ExtractElement: {
assert(State.VF.isVector() && "Only extract elements from vectors");
return State.get(getOperand(0),
VPLane(cast<ConstantInt>(getOperand(1)->getLiveInIRValue())
->getZExtValue()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return State.get(getOperand(0),
VPLane(cast<ConstantInt>(getOperand(1)->getLiveInIRValue())
->getZExtValue()));
auto ElementToExtract = cast<ConstantInt>(getOperand(1)->getLiveInIRValue())->getZExtValue());
return State.get(getOperand(0), VPLane(ElementToExtract));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thank.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If second operand of ExtractElement must be a (small, less than VF) compile-time constant, would it be better held in VPIRFlags, given that other flags are unneeded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be possible, although in theory the VF could be > 255, which would exceed VPIRFlags' char?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed theoretically, but this deals with explicit unrolling by VF, which may be worth bounding to prevent code bloat, which may also impact performance. I.e., VF's larger than 255 could still be supported if only vectors are produced. (OTOH, legalization could also take place later...).

Value *Vec = State.get(getOperand(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are lines 499-501 not deleted, given that we've already returned on line 496?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep removed, thanks

Value *Idx = State.get(getOperand(1), /*IsScalar=*/true);
return Builder.CreateExtractElement(Vec, Idx, Name);
Comment on lines 559 to 561
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Value *Vec = State.get(getOperand(0));
Value *Idx = State.get(getOperand(1), /*IsScalar=*/true);
return Builder.CreateExtractElement(Vec, Idx, Name);

unreachable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed thanks

Expand Down Expand Up @@ -604,6 +607,33 @@ Value *VPInstruction::generate(VPTransformState &State) {
return Builder.CreateVectorSplat(
State.VF, State.get(getOperand(0), /*IsScalar*/ true), "broadcast");
}
case VPInstruction::BuildVector: {
auto *ScalarTy = State.TypeAnalysis.inferScalarType(getOperand(0));
Value *Res = PoisonValue::get(
toVectorizedTy(ScalarTy, ElementCount::getFixed(getNumOperands())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Value *Res = PoisonValue::get(
toVectorizedTy(ScalarTy, ElementCount::getFixed(getNumOperands())));
auto NumOfElements = ElementCount::getFixed(getNumOperands());
Value *Res = PoisonValue::get(toVectorizedTy(ScalarTy, NumOfElements));

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

for (const auto &[Idx, Op] : enumerate(operands()))
Res = State.Builder.CreateInsertElement(Res, State.get(Op, true),
State.Builder.getInt32(Idx));
return Res;
}
case VPInstruction::BuildStructVector: {
// For struct types, we need to build a new 'wide' struct type, where each
// element is widened.
auto *STy =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *STy =
auto *StructTy =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

cast<StructType>(State.TypeAnalysis.inferScalarType(getOperand(0)));
Value *Res = PoisonValue::get(
toVectorizedTy(STy, ElementCount::getFixed(getNumOperands())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Value *Res = PoisonValue::get(
toVectorizedTy(STy, ElementCount::getFixed(getNumOperands())));
auto NumOfElements = ElementCount::getFixed(getNumOperands());
Value *Res = PoisonValue::get(toVectorizedTy(StructTy, NumOfElements));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

for (const auto &[Idx, Op] : enumerate(operands())) {
for (unsigned I = 0, E = STy->getNumElements(); I != E; I++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (unsigned I = 0, E = STy->getNumElements(); I != E; I++) {
for (unsigned I = 0; I < NumOfElements; I++) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

Value *ScalarValue = Builder.CreateExtractValue(State.get(Op, true), I);
Value *VectorValue = Builder.CreateExtractValue(Res, I);
VectorValue =
Builder.CreateInsertElement(VectorValue, ScalarValue, Idx);
Res = Builder.CreateInsertValue(Res, VectorValue, I);
}
}
return Res;
}
case VPInstruction::ComputeAnyOfResult: {
// FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
// and will be removed by breaking up the recipe further.
Expand Down Expand Up @@ -872,10 +902,11 @@ void VPInstruction::execute(VPTransformState &State) {
if (!hasResult())
return;
assert(GeneratedValue && "generate must produce a value");
assert(
(GeneratedValue->getType()->isVectorTy() == !GeneratesPerFirstLaneOnly ||
State.VF.isScalar()) &&
"scalar value but not only first lane defined");
assert((((GeneratedValue->getType()->isVectorTy() ||
GeneratedValue->getType()->isStructTy()) ==
!GeneratesPerFirstLaneOnly) ||
State.VF.isScalar()) &&
"scalar value but not only first lane defined");
State.set(this, GeneratedValue,
/*IsScalar*/ GeneratesPerFirstLaneOnly);
}
Expand All @@ -889,6 +920,8 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
case Instruction::ICmp:
case Instruction::Select:
case VPInstruction::AnyOf:
case VPInstruction::BuildVector:
case VPInstruction::BuildStructVector:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case VPInstruction::BuildVector:
case VPInstruction::BuildStructVector:
case VPInstruction::BuildStructVector:
case VPInstruction::BuildVector:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::ExtractLastElement:
Expand Down Expand Up @@ -1008,6 +1041,12 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
case VPInstruction::Broadcast:
O << "broadcast";
break;
case VPInstruction::BuildVector:
O << "buildvector";
break;
case VPInstruction::BuildStructVector:
O << "buildstructvector";
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case VPInstruction::BuildVector:
O << "buildvector";
break;
case VPInstruction::BuildStructVector:
O << "buildstructvector";
break;
case VPInstruction::BuildStructVector:
O << "buildstructvector";
break;
case VPInstruction::BuildVector:
O << "buildvector";
break;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

case VPInstruction::ExtractLastElement:
O << "extract-last-element";
break;
Expand Down Expand Up @@ -2763,20 +2802,6 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
scalarizeInstruction(UI, this, VPLane(0), State);
return;
}
Comment on lines 2735 to 2737
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle simpler single-scalar case first and early-return, assert one of above two cases applies:

  if (!State.Lane) {
    assert(IsSingleScalar && "...");
    scalarizeInstruction(UI, this, VPLane(0), State);
    return;
  }

  // Generate a single instance.
  ...

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks


// A store of a loop varying value to a uniform address only needs the last
// copy of the store.
if (isa<StoreInst>(UI) && vputils::isSingleScalar(getOperand(1))) {
auto Lane = VPLane::getLastLaneForVF(State.VF);
scalarizeInstruction(UI, this, VPLane(Lane), State);
return;
}

// Generate scalar instances for all VF lanes.
assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");
const unsigned EndLane = State.VF.getKnownMinValue();
for (unsigned Lane = 0; Lane < EndLane; ++Lane)
scalarizeInstruction(UI, this, VPLane(Lane), State);
}

bool VPReplicateRecipe::shouldPack() const {
Expand Down
16 changes: 16 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1140,6 +1140,22 @@ static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) {
return;
}

// Look through Extract(Last|Penultimate)Element (BuildVector ....).
if (match(&R,
m_VPInstruction<VPInstruction::ExtractLastElement>(m_VPValue(A))) ||
match(&R, m_VPInstruction<VPInstruction::ExtractPenultimateElement>(
m_VPValue(A)))) {
unsigned Offset = cast<VPInstruction>(&R)->getOpcode() ==
VPInstruction::ExtractLastElement
? 1
: 2;
auto *BV = dyn_cast<VPInstruction>(A);
if (BV && BV->getOpcode() == VPInstruction::BuildVector) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use match here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a m_BuildVector matcher, which for which we don't support matching the operands, which is variable. That seems to work OK with @ayalz's suggestion below.

Def->replaceAllUsesWith(BV->getOperand(BV->getNumOperands() - Offset));
return;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler to match each pattern separately?

Suggested change
// Look through Extract(Last|Penultimate)Element (BuildVector ....).
if (match(&R,
m_VPInstruction<VPInstruction::ExtractLastElement>(m_VPValue(A))) ||
match(&R, m_VPInstruction<VPInstruction::ExtractPenultimateElement>(
m_VPValue(A)))) {
unsigned Offset = cast<VPInstruction>(&R)->getOpcode() ==
VPInstruction::ExtractLastElement
? 1
: 2;
auto *BV = dyn_cast<VPInstruction>(A);
if (BV && BV->getOpcode() == VPInstruction::BuildVector) {
Def->replaceAllUsesWith(BV->getOperand(BV->getNumOperands() - Offset));
return;
}
}
// Look through ExtractLastElement (BuildVector ....).
if (match(&R, m_VPInstruction<VPInstruction::ExtractLastElement>(
m_VPInstruction<VPInstruction::BuildVector>(BuildV))) {
Def->replaceAllUsesWith(BV->getOperand(BuildV->getNumOperands() - 1));
return;
}
// Look through ExtractPenultimateElement (BuildVector ....).
if (match(&R, m_VPInstruction<VPInstruction::ExtractPenultimateElement>(
m_VPInstruction<VPInstruction::BuildVector>(BuildV))) {
Def->replaceAllUsesWith(BV->getOperand(BuildV->getNumOperands() - 2));
return;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks. The current version doesn't caputre the BuildVector VPInstruction though


// Some simplifications can only be applied after unrolling. Perform them
// below.
if (!Plan->isUnrolled())
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.h
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,10 @@ struct VPlanTransforms {
/// Explicitly unroll \p Plan by \p UF.
static void unrollByUF(VPlan &Plan, unsigned UF, LLVMContext &Ctx);

/// Explicitly unroll VPReplicateRecipes outside of replicate regions by \p
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps converting replicate recipes to (or replacing with) sets of clone recipes per lane would be more accurate than "unroll".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks

/// VF.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "outside of replicate regions" is temporary, better noted under TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

static void unrollByVF(VPlan &Plan, ElementCount VF);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps more accurate to call it replicateByVF().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted thanks


/// Optimize \p Plan based on \p BestVF and \p BestUF. This may restrict the
/// resulting plan to \p BestVF and \p BestUF.
static void optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
Expand Down
81 changes: 81 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#include "VPlan.h"
#include "VPlanAnalysis.h"
#include "VPlanCFG.h"
#include "VPlanHelpers.h"
#include "VPlanPatternMatch.h"
#include "VPlanTransforms.h"
#include "VPlanUtils.h"
Expand Down Expand Up @@ -430,3 +431,83 @@ void VPlanTransforms::unrollByUF(VPlan &Plan, unsigned UF, LLVMContext &Ctx) {

VPlanTransforms::removeDeadRecipes(Plan);
}

/// Create a single-scalar clone of RepR for lane \p Lane.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Create a single-scalar clone of RepR for lane \p Lane.
/// Create a single-scalar clone of \p RepR for lane \p Lane.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

static VPReplicateRecipe *cloneForLane(VPlan &Plan, VPBuilder &Builder,
Type *IdxTy, VPReplicateRecipe *RepR,
VPLane Lane) {
// Collect the operands at Lane, creating extracts as needed.
SmallVector<VPValue *> NewOps;
for (VPValue *Op : RepR->operands()) {
if (vputils::isSingleScalar(Op)) {
NewOps.push_back(Op);
continue;
}
VPValue *Ext;
if (Lane.getKind() == VPLane::Kind::ScalableLast) {
Ext = Builder.createNaryOp(VPInstruction::ExtractLastElement, {Op});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ext = Builder.createNaryOp(VPInstruction::ExtractLastElement, {Op});
Ext = Builder.createNaryOp(VPInstruction::ExtractLastElement, {Op});
NewOps.push_back(Ext);
continue;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

} else {
// Look through buildvector to avoid unnecessary extracts.
auto *BV = dyn_cast<VPInstruction>(Op);
if (BV && BV->getOpcode() == VPInstruction::BuildVector) {
NewOps.push_back(BV->getOperand(Lane.getKnownLane()));
continue;
}
VPValue *Idx =
Plan.getOrAddLiveIn(ConstantInt::get(IdxTy, Lane.getKnownLane()));
Ext = Builder.createNaryOp(Instruction::ExtractElement, {Op, Idx});
}
NewOps.push_back(Ext);
}

auto *New =
new VPReplicateRecipe(RepR->getUnderlyingInstr(), NewOps,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a VPInstruction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually yes, but more work is needed separately to use VPInstruction for single-scalar VPReplicateRecipes: #141429, #140623

/*IsSingleScalar=*/true, /*Mask=*/nullptr, *RepR);
New->insertBefore(RepR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That this failed to call transferFlags here, which resulted in miscompiles. I've created #147398 to fix.

return New;
}

void VPlanTransforms::unrollByVF(VPlan &Plan, ElementCount VF) {
Type *IdxTy = IntegerType::get(
Plan.getScalarHeader()->getIRBasicBlock()->getContext(), 32);
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
vp_depth_first_shallow(Plan.getVectorLoopRegion()->getEntry()))) {
for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
auto *RepR = dyn_cast<VPReplicateRecipe>(&R);
if (!RepR || RepR->isSingleScalar())
continue;

VPBuilder Builder(RepR);
SmallVector<VPValue *> LaneDefs;
// Stores to invariant addresses only need to store the last lane.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Stores to invariant addresses only need to store the last lane.
// Stores to invariant addresses need to store the last lane only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

if (isa<StoreInst>(RepR->getUnderlyingInstr()) &&
vputils::isSingleScalar(RepR->getOperand(1))) {
cloneForLane(Plan, Builder, IdxTy, RepR, VPLane::getLastLaneForVF(VF));
RepR->eraseFromParent();
continue;
}

/// Create single-scalar version of RepR for all lanes.
for (unsigned I = 0; I != VF.getKnownMinValue(); ++I)
LaneDefs.push_back(cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I)));

/// Users that only demand the first lane can use the definition for lane
/// 0.
RepR->replaceUsesWithIf(LaneDefs[0], [RepR](VPUser &U, unsigned) {
return U.onlyFirstLaneUsed(RepR);
});

Type *ResTy = RepR->getUnderlyingInstr()->getType();
// If needed, create a Build(Struct)Vector recipe to insert the scalar
// lane values into a vector.
Comment on lines +526 to +527
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a pair of replicating recipes one feeding the other is replaced by VF recipes feeding a buildVector which VF other recipes extract from, where the extracts are optimized away by cloneForLane(); and the buildVector possibly by dce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

if (!ResTy->isVoidTy()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ResTy is void (better check for empty users instead?) suffice to clone for lanes and erase from parent, w/o populating LaneDefs, handling all stores early following those to a single address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

VPValue *VecRes = Builder.createNaryOp(
ResTy->isStructTy() ? VPInstruction::BuildStructVector
: VPInstruction::BuildVector,
LaneDefs);
RepR->replaceAllUsesWith(VecRes);
}
RepR->eraseFromParent();
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -398,12 +398,6 @@ define void @test_for_tried_to_force_scalar(ptr noalias %A, ptr noalias %B, ptr
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <12 x float> [[WIDE_VEC]], <12 x float> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>
; CHECK-NEXT: [[TMP30:%.*]] = extractelement <4 x float> [[STRIDED_VEC]], i32 3
; CHECK-NEXT: store float [[TMP30]], ptr [[C:%.*]], align 4
; CHECK-NEXT: [[TMP31:%.*]] = extractelement <4 x ptr> [[TMP29]], i32 0
; CHECK-NEXT: [[TMP38:%.*]] = load float, ptr [[TMP31]], align 4
; CHECK-NEXT: [[TMP33:%.*]] = extractelement <4 x ptr> [[TMP29]], i32 1
; CHECK-NEXT: [[TMP32:%.*]] = load float, ptr [[TMP33]], align 4
; CHECK-NEXT: [[TMP35:%.*]] = extractelement <4 x ptr> [[TMP29]], i32 2
; CHECK-NEXT: [[TMP34:%.*]] = load float, ptr [[TMP35]], align 4
; CHECK-NEXT: [[TMP37:%.*]] = extractelement <4 x ptr> [[TMP29]], i32 3
; CHECK-NEXT: [[TMP36:%.*]] = load float, ptr [[TMP37]], align 4
; CHECK-NEXT: store float [[TMP36]], ptr [[B:%.*]], align 4
Expand Down
Loading
Loading