Skip to content

[LV] Vectorize FMax via OrderedFCmpSelect w/o fast-math flags. #146711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions llvm/include/llvm/Analysis/IVDescriptors.h
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ enum class RecurKind {
FMul, ///< Product of floats.
FMin, ///< FP min implemented in terms of select(cmp()).
FMax, ///< FP max implemented in terms of select(cmp()).
FCmpOGTSelect, ///< FP max implemented in terms of select(cmp()), but without
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming issue was admittedly raised before. This is still documented and titled as dealing with "FP max", and also affects loops doing fcmp ugt and select as in a testcase at the bottom. If RecurKind tries to capture the explicit pattern in the input IR, it may need to accommodate a variety of compare predicates. Signed and unsigned max and min try, OTOH, to abstract cmp/select pairs having a variety of lt/le/gt/ge predicated and same or reversed operands. Is this aiming to handle Ordered and/or Unordered FMax reduction of sets that may include NaNs (in terms of select(cmp()))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this for now should be restricted to selects with ordered compares for now, so the result is only NaN if the start value is NaN.

It is not restricted to OGT, OLT also works. It needs to be a strict predicate, to use FindFirstIV. For non-strict ones we would have to use FindLastIV.

Updated to OrderedFCmpSelect, wdyt?

/// any fast-math flags. Users need to handle NaNs and signed
/// zeros when generating code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern and how ("users need") to handle it should indeed be explained, but preferably elsewhere. Pattern seems to suggest how to handle FP reductions (min, max, possibly others as well?) in the presence of NaNs and/or signed zeroes (both equally challenging?), which is evaded in the presence of certain fast-math flags (namely absence of nans and signed zeroes?).

Does the following sound right:
a. If the set is NaN-free, its reduction result is as with the fast-math flag.
b. If the set contains only NaN's, its reduction is either NaN or the initial value, depending on the reduction operation being unordered or ordered, respectively.
c. If the set contains both NaN's and non-NaN's, its reduction is either NaN or the reduction of all non-NaN's, depending on the reduction operation being unordered or ordered, respectively.

The vector of partial subset reduction results of case (a) contain only non NaN's, and is subject to standard final reduction. In case (b), this vector holds only NaN's or only the initial value, which provides the respective final value. Case (c) requires "tie breaking" based on index? What if the initial value is NaN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern and how ("users need") to handle it should indeed be explained, but preferably elsewhere. Pattern seems to suggest how to handle FP reductions (min, max, possibly others as well?) in the presence of NaNs and/or signed zeroes (both equally challenging?), which is evaded in the presence of certain fast-math flags (namely absence of nans and signed zeroes?).

Explanation is currently interleaved in handleFMaxReductionsWithoutFastMath, should it go elsewhere?

Does the following sound right: a. If the set is NaN-free, its reduction result is as with the fast-math flag.
if it is also free of signed zeroes, yep

b. If the set contains only NaN's, its reduction is either NaN or the initial value, depending on the reduction operation being unordered or ordered, respectively.

Yep, update to require ordered predicates, so it would be the start value if all-NaNs.

c. If the set contains both NaN's and non-NaN's, its reduction is either NaN or the reduction of all non-NaN's, depending on the reduction operation being unordered or ordered, respectively.

Yep, restricted to just ordered for now.

The vector of partial subset reduction results of case (a) contain only non NaN's, and is subject to standard final reduction. In case (b), this vector holds only NaN's or only the initial value, which provides the respective final value. Case (c) requires "tie breaking" based on index? What if the initial value is NaN?

Yep for cases a) and b).

For case c), if there is any non-NaN value (either start or any value in the loop), the reduction result is non-NaN. If any lane is non-NaN in the partial reduction vector, it will get selected.

The tie-breaking is mainly needed for signed zeros, where we need to pick the first one. Without tie-breaking, horizontal fmax will return +0.0 if it contains both -0.0 and +0.0, but if -0.0 has been seen first it needs to be selected first according to the index.

FMinimum, ///< FP min with llvm.minimum semantics
FMaximum, ///< FP max with llvm.maximum semantics
FMinimumNum, ///< FP min with llvm.minimumnum semantics
Expand Down Expand Up @@ -250,8 +253,9 @@ class RecurrenceDescriptor {
/// Returns true if the recurrence kind is a floating-point min/max kind.
static bool isFPMinMaxRecurrenceKind(RecurKind Kind) {
return Kind == RecurKind::FMin || Kind == RecurKind::FMax ||
Kind == RecurKind::FMinimum || Kind == RecurKind::FMaximum ||
Kind == RecurKind::FMinimumNum || Kind == RecurKind::FMaximumNum;
Kind == RecurKind::FCmpOGTSelect || Kind == RecurKind::FMinimum ||
Kind == RecurKind::FMaximum || Kind == RecurKind::FMinimumNum ||
Kind == RecurKind::FMaximumNum;
}

/// Returns true if the recurrence kind is any min/max kind.
Expand Down
15 changes: 11 additions & 4 deletions llvm/lib/Analysis/IVDescriptors.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -819,7 +819,8 @@ RecurrenceDescriptor::isMinMaxPattern(Instruction *I, RecurKind Kind,
if (match(I, m_OrdOrUnordFMin(m_Value(), m_Value())))
return InstDesc(Kind == RecurKind::FMin, I);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only max is handled by OrderedFCmpSelect, not min? Can start w/ FMaxOGT only.

if (match(I, m_OrdOrUnordFMax(m_Value(), m_Value())))
return InstDesc(Kind == RecurKind::FMax, I);
return InstDesc(Kind == RecurKind::FMax || Kind == RecurKind::FCmpOGTSelect,
I);
if (match(I, m_FMinNum(m_Value(), m_Value())))
return InstDesc(Kind == RecurKind::FMin, I);
if (match(I, m_FMaxNum(m_Value(), m_Value())))
Expand Down Expand Up @@ -941,10 +942,15 @@ RecurrenceDescriptor::InstDesc RecurrenceDescriptor::isRecurrenceInstr(
m_Intrinsic<Intrinsic::minimumnum>(m_Value(), m_Value())) ||
match(I, m_Intrinsic<Intrinsic::maximumnum>(m_Value(), m_Value()));
};
if (isIntMinMaxRecurrenceKind(Kind) ||
(HasRequiredFMF() && isFPMinMaxRecurrenceKind(Kind)))
if (isIntMinMaxRecurrenceKind(Kind))
return isMinMaxPattern(I, Kind, Prev);
else if (isFMulAddIntrinsic(I))
if (isFPMinMaxRecurrenceKind(Kind)) {
if (HasRequiredFMF())
return isMinMaxPattern(I, Kind, Prev);
if ((Kind == RecurKind::FMax || Kind == RecurKind::FCmpOGTSelect) &&
isMinMaxPattern(I, Kind, Prev).isRecurrence())
return InstDesc(I, RecurKind::FCmpOGTSelect);
} else if (isFMulAddIntrinsic(I))
return InstDesc(Kind == RecurKind::FMulAdd, I,
I->hasAllowReassoc() ? nullptr : I);
return InstDesc(false, I);
Expand Down Expand Up @@ -1207,6 +1213,7 @@ unsigned RecurrenceDescriptor::getOpcode(RecurKind Kind) {
case RecurKind::UMin:
return Instruction::ICmp;
case RecurKind::FMax:
case RecurKind::FCmpOGTSelect:
case RecurKind::FMin:
case RecurKind::FMaximum:
case RecurKind::FMinimum:
Expand Down
3 changes: 3 additions & 0 deletions llvm/lib/Transforms/Utils/LoopUtils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -937,6 +937,7 @@ constexpr Intrinsic::ID llvm::getReductionIntrinsicID(RecurKind RK) {
return Intrinsic::vector_reduce_umax;
case RecurKind::UMin:
return Intrinsic::vector_reduce_umin;
case RecurKind::FCmpOGTSelect:
case RecurKind::FMax:
return Intrinsic::vector_reduce_fmax;
case RecurKind::FMin:
Expand Down Expand Up @@ -1084,6 +1085,7 @@ CmpInst::Predicate llvm::getMinMaxReductionPredicate(RecurKind RK) {
return CmpInst::ICMP_SGT;
case RecurKind::FMin:
return CmpInst::FCMP_OLT;
case RecurKind::FCmpOGTSelect:
case RecurKind::FMax:
return CmpInst::FCMP_OGT;
// We do not add FMinimum/FMaximum recurrence kind here since there is no
Expand Down Expand Up @@ -1306,6 +1308,7 @@ Value *llvm::createSimpleReduction(IRBuilderBase &Builder, Value *Src,
case RecurKind::SMin:
case RecurKind::UMax:
case RecurKind::UMin:
case RecurKind::FCmpOGTSelect:
case RecurKind::FMax:
case RecurKind::FMin:
case RecurKind::FMinimum:
Expand Down
10 changes: 8 additions & 2 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4346,8 +4346,11 @@ bool LoopVectorizationPlanner::isCandidateForEpilogueVectorization(
ElementCount VF) const {
// Cross iteration phis such as reductions need special handling and are
// currently unsupported.
if (any_of(OrigLoop->getHeader()->phis(),
[&](PHINode &Phi) { return Legal->isFixedOrderRecurrence(&Phi); }))
if (any_of(OrigLoop->getHeader()->phis(), [&](PHINode &Phi) {
return Legal->isFixedOrderRecurrence(&Phi) ||
Legal->getReductionVars().lookup(&Phi).getRecurrenceKind() ==
RecurKind::FCmpOGTSelect;
}))
return false;

// Phis with uses outside of the loop require special handling and are
Expand Down Expand Up @@ -8808,6 +8811,9 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(

// Adjust the recipes for any inloop reductions.
adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start);
if (!VPlanTransforms::runPass(
VPlanTransforms::handleFMaxReductionsWithoutFastMath, *Plan))
return nullptr;

// Transform recipes to abstract recipes if it is legal and beneficial and
// clamp the range for better cost estimation.
Expand Down
3 changes: 3 additions & 0 deletions llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -23136,6 +23136,7 @@ class HorizontalReduction {
case RecurKind::FindFirstIVUMin:
case RecurKind::FindLastIVSMax:
case RecurKind::FindLastIVUMax:
case RecurKind::FCmpOGTSelect:
case RecurKind::FMaximumNum:
case RecurKind::FMinimumNum:
case RecurKind::None:
Expand Down Expand Up @@ -23273,6 +23274,7 @@ class HorizontalReduction {
case RecurKind::FindFirstIVUMin:
case RecurKind::FindLastIVSMax:
case RecurKind::FindLastIVUMax:
case RecurKind::FCmpOGTSelect:
case RecurKind::FMaximumNum:
case RecurKind::FMinimumNum:
case RecurKind::None:
Expand Down Expand Up @@ -23375,6 +23377,7 @@ class HorizontalReduction {
case RecurKind::FindFirstIVUMin:
case RecurKind::FindLastIVSMax:
case RecurKind::FindLastIVUMax:
case RecurKind::FCmpOGTSelect:
case RecurKind::FMaximumNum:
case RecurKind::FMinimumNum:
case RecurKind::None:
Expand Down
5 changes: 4 additions & 1 deletion llvm/lib/Transforms/Vectorize/VPlan.h
Original file line number Diff line number Diff line change
Expand Up @@ -980,7 +980,10 @@ class VPInstruction : public VPRecipeWithIRFlags,
ReductionStartVector,
// Creates a step vector starting from 0 to VF with a step of 1.
StepVector,

/// Extracts a single lane (first operand) from a set of vector operands.
/// The lane specifies an index into a vector formed by combining all vector
/// operands (all operands after the first one).
ExtractLane,
};

private:
Expand Down
3 changes: 3 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
return ResTy;
}
case Instruction::ICmp:
case Instruction::FCmp:
case VPInstruction::ActiveLaneMask:
assert(inferScalarType(R->getOperand(0)) ==
inferScalarType(R->getOperand(1)) &&
Expand All @@ -110,6 +111,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
case VPInstruction::BuildStructVector:
case VPInstruction::BuildVector:
return SetResultTyFromOp();
case VPInstruction::ExtractLane:
return inferScalarType(R->getOperand(1));
case VPInstruction::FirstActiveLane:
return Type::getIntNTy(Ctx, 64);
case VPInstruction::ExtractLastElement:
Expand Down
117 changes: 116 additions & 1 deletion llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
#define DEBUG_TYPE "vplan"

using namespace llvm;
using namespace VPlanPatternMatch;

namespace {
// Class that is used to build the plain CFG for the incoming IR.
Expand Down Expand Up @@ -427,7 +428,6 @@ static void createLoopRegion(VPlan &Plan, VPBlockBase *HeaderVPB) {
static void addCanonicalIVRecipes(VPlan &Plan, VPBasicBlock *HeaderVPBB,
VPBasicBlock *LatchVPBB, Type *IdxTy,
DebugLoc DL) {
using namespace VPlanPatternMatch;
Value *StartIdx = ConstantInt::get(IdxTy, 0);
auto *StartV = Plan.getOrAddLiveIn(StartIdx);

Expand Down Expand Up @@ -628,3 +628,118 @@ void VPlanTransforms::attachCheckBlock(VPlan &Plan, Value *Cond,
Term->addMetadata(LLVMContext::MD_prof, BranchWeights);
}
}

bool VPlanTransforms::handleFMaxReductionsWithoutFastMath(VPlan &Plan) {
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
VPReductionPHIRecipe *RedPhiR = nullptr;
VPRecipeWithIRFlags *MaxOp = nullptr;
VPWidenIntOrFpInductionRecipe *WideIV = nullptr;

// Check if there are any FCmpOGTSelect reductions using wide selects that we
// can fix up. To do so, we also need a wide canonical IV to keep track of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// can fix up. To do so, we also need a wide canonical IV to keep track of
// can fix up. To do so, we also need a wide canonical IV to keep track of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

// the indices of the max values.
for (auto &R : LoopRegion->getEntryBasicBlock()->phis()) {
// We need a wide canonical IV
if (auto *CurIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&R)) {
if (!CurIV->isCanonical())
continue;
WideIV = CurIV;
continue;
}

// And a single FCmpOGTSelect reduction phi.
// TODO: Support FMin reductions as well.
auto *CurRedPhiR = dyn_cast<VPReductionPHIRecipe>(&R);
if (!CurRedPhiR)
continue;
if (RedPhiR)
return false;
if (CurRedPhiR->getRecurrenceKind() != RecurKind::FCmpOGTSelect ||
CurRedPhiR->isInLoop() || CurRedPhiR->isOrdered())
continue;
RedPhiR = CurRedPhiR;

// MaxOp feeding the reduction phi must be a select (either wide or a
// replicate recipe), where the phi is the last operand, and the compare
// predicate is strict. This ensures NaNs won't get propagated unless the
// initial value is NaN
VPRecipeBase *Inc = RedPhiR->getBackedgeValue()->getDefiningRecipe();
auto *RepR = dyn_cast<VPReplicateRecipe>(Inc);
if (!isa<VPWidenSelectRecipe>(Inc) &&
!(RepR && (isa<SelectInst>(RepR->getUnderlyingInstr()))))
return false;

MaxOp = cast<VPRecipeWithIRFlags>(Inc);
auto *Cmp = cast<VPRecipeWithIRFlags>(MaxOp->getOperand(0));
if (MaxOp->getOperand(1) == RedPhiR ||
!CmpInst::isStrictPredicate(Cmp->getPredicate()))
return false;
}

// Nothing to do.
if (!RedPhiR)
return true;

// A wide canonical IV is currently required.
// TODO: Create an induction if no suitable existing one is available.
if (!WideIV)
return false;
Comment on lines +690 to +693
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a scalar canonical IV always exists, and is unique. But widen ones may exist (last one found is used?) or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, at this stage, all inductions will still be widened, but may not be canonical.


// Create a reduction that tracks the first indices where the latest maximum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Create a reduction that tracks the first indices where the latest maximum
// Create a reduction that tracks the first indices where the running maximum

// value has been selected. This is later used to select the max value from
// the partial reductions in a way that correctly handles signed zeros and
// NaNs in the input.
Comment on lines +697 to +698
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// the partial reductions in a way that correctly handles signed zeros and
// NaNs in the input.
// the partial reductions in a way that correctly handles a signed zero maximum.

(NaN's nor non-zero numbers do not require the tracked indices.)

// Note that we do not need to check if the induction may hit the sentinel
// value. If the sentinel value gets hit, the final reduction value is at the
// last index or the maximum was never set and all lanes contain the start
// value. In either case, the correct value is selected.
Comment on lines +699 to +702
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to elaborate, is this clear/correct/helpful?

Suggested change
// Note that we do not need to check if the induction may hit the sentinel
// value. If the sentinel value gets hit, the final reduction value is at the
// last index or the maximum was never set and all lanes contain the start
// value. In either case, the correct value is selected.
// Note that we do not need to check if a final index holds the initial
// sentinel value (max uint). This is because of the following argument.
// The index is used to select one among several lanes holding the same
// maximum value, in order to select the correct sign in case that value is 0.
// So only indices of lanes holding the total maximum value are of interest.
// If a final index holds the sentinel value then either the maximum of the
// corresponding lane appeared first at that index, or this maximum was never
// updated and still retains the start value. The former case is clearly fine.
// In the latter case the start value is either NaN or a number greater or
// equal to all elements of that lane. If the start value is NaN all lanes
// eventually hold NaN as their maximum value, a tie to be broken arbitrarily.
// If the start value is equal to the total maximum value, then the maxima of
// all lanes are equal to the start value, including its sign, again a tie to
// be broken arbitrarily.
// If the start value differs from (is less than) the total maximum value,
// then an index holding the sentinel value corresponds to a non maximum lane
// and is thus irrelevant for tie breaking.

unsigned IVWidth =
VPTypeAnalysis(Plan).inferScalarType(WideIV)->getScalarSizeInBits();
LLVMContext &Ctx = Plan.getScalarHeader()->getIRBasicBlock()->getContext();
VPValue *UMinSentinel =
Plan.getOrAddLiveIn(ConstantInt::get(Ctx, APInt::getMaxValue(IVWidth)));
auto *IdxPhi = new VPReductionPHIRecipe(nullptr, RecurKind::FindFirstIVUMin,
*UMinSentinel, false, false, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth commenting constant parameters.

IdxPhi->insertBefore(RedPhiR);
auto *MinIdxSel = new VPInstruction(Instruction::Select,
{MaxOp->getOperand(0), WideIV, IdxPhi});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{MaxOp->getOperand(0), WideIV, IdxPhi});
{Cmp, WideIV, IdxPhi});

MinIdxSel->insertAfter(MaxOp);
IdxPhi->addOperand(MinIdxSel);

// Find the first index of with the maximum value. This is used to extract the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Find the first index of with the maximum value. This is used to extract the
// Find the first index holding the maximum value. This is used to extract the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

// lane with the final max value and is needed to handle signed zeros and NaNs
// in the input.
auto *MaxResult = find_singleton<VPSingleDefRecipe>(
RedPhiR->users(), [](VPUser *U, bool) -> VPSingleDefRecipe * {
auto *VPI = dyn_cast<VPInstruction>(U);
if (VPI && VPI->getOpcode() == VPInstruction::ComputeReductionResult)
return VPI;
return nullptr;
});
VPBuilder Builder(MaxResult->getParent(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VPBuilder Builder(MaxResult->getParent(),
assert(MaxResult && "Missing a ComputeReductionResult user");
VPBuilder Builder(MaxResult->getParent(),

std::next(MaxResult->getIterator()));

// Create mask for lanes that have the max value and use it to mask out
// indices that don't contain maximum values.
auto *MaskFinalMaxValue = Builder.createNaryOp(
Instruction::FCmp, {MaxResult->getOperand(1), MaxResult},
VPIRFlags(CmpInst::FCMP_OEQ));
auto *IndicesWithMaxValue = Builder.createNaryOp(
Instruction::Select, {MaskFinalMaxValue, MinIdxSel, UMinSentinel});
auto *FirstMaxIdx = Builder.createNaryOp(
VPInstruction::ComputeFindIVResult,
{IdxPhi, WideIV->getStartValue(), UMinSentinel, IndicesWithMaxValue});
// Convert the index of the first max value to an index in the vector lanes of
// the partial reduction results. This ensures we select the first max value
// and acts as a tie-breaker if the partial reductions contain signed zeros.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vertical computation of each partial reduction result takes care of NaNs and signed zeroes, it is only the horizontal reduction of these vector lanes that require tie-breaking, to handle potential signed zeroes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tie-breaking is only needed to handle signed zeroes when computing the final reduction results.

Consider a final partial reduction vector with -0.0, +0.0 and -0.0 was encountered before +0.0 (e.g. the max at iteration 2 is -0.0 and at iteration 3 it is +0.0. Doing a plain horizontal fmax reduction will produce +0.0 (-0.0 < +0.0).

We then compare the partial reduction values to the result of the horizontal reduction (-0.0 == +0.0 will also be true, selecting all lanes with zeros of any signed-ness)

Out of those, we select the one encountered first using FindFirstIV. Note that this only works for strict predicates.

auto *FirstMaxLane =
Builder.createNaryOp(Instruction::URem, {FirstMaxIdx, &Plan.getVFxUF()});

// Extract the final max value and update the users.
auto *Res = Builder.createNaryOp(VPInstruction::ExtractLane,
{FirstMaxLane, MaxResult->getOperand(1)});
MaxResult->replaceUsesWithIf(Res, [MaskFinalMaxValue](VPUser &U, unsigned) {
return &U != MaskFinalMaxValue;
});
return true;
}
38 changes: 34 additions & 4 deletions llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -505,6 +505,7 @@ bool VPInstruction::canGenerateScalarForFirstLane() const {
return true;
switch (Opcode) {
case Instruction::Freeze:
case Instruction::FCmp:
case Instruction::ICmp:
case Instruction::PHI:
case Instruction::Select:
Expand Down Expand Up @@ -585,6 +586,7 @@ Value *VPInstruction::generate(VPTransformState &State) {
Value *Op = State.get(getOperand(0), vputils::onlyFirstLaneUsed(this));
return Builder.CreateFreeze(Op, Name);
}
case Instruction::FCmp:
case Instruction::ICmp: {
bool OnlyFirstLaneUsed = vputils::onlyFirstLaneUsed(this);
Value *A = State.get(getOperand(0), OnlyFirstLaneUsed);
Expand All @@ -595,7 +597,8 @@ Value *VPInstruction::generate(VPTransformState &State) {
llvm_unreachable("should be handled by VPPhi::execute");
}
case Instruction::Select: {
bool OnlyFirstLaneUsed = vputils::onlyFirstLaneUsed(this);
bool OnlyFirstLaneUsed =
State.VF.isScalar() || vputils::onlyFirstLaneUsed(this);
Value *Cond = State.get(getOperand(0), OnlyFirstLaneUsed);
Value *Op1 = State.get(getOperand(1), OnlyFirstLaneUsed);
Value *Op2 = State.get(getOperand(2), OnlyFirstLaneUsed);
Expand Down Expand Up @@ -858,7 +861,30 @@ Value *VPInstruction::generate(VPTransformState &State) {
Value *Res = State.get(getOperand(0));
for (VPValue *Op : drop_begin(operands()))
Res = Builder.CreateOr(Res, State.get(Op));
return Builder.CreateOrReduce(Res);
return Res->getType()->isIntegerTy(1) ? Res : Builder.CreateOrReduce(Res);
}
case VPInstruction::ExtractLane: {
Value *LaneToExtract = State.get(getOperand(0), true);
Type *IdxTy = State.TypeAnalysis.inferScalarType(getOperand(0));
Value *Res = nullptr;
Value *RuntimeVF = getRuntimeVF(State.Builder, IdxTy, State.VF);

for (unsigned Idx = 1; Idx != getNumOperands(); ++Idx) {
Value *VectorStart =
Builder.CreateMul(RuntimeVF, ConstantInt::get(IdxTy, Idx - 1));
Value *VectorIdx = Builder.CreateSub(LaneToExtract, VectorStart);
Value *Ext = State.VF.isScalar()
? State.get(getOperand(Idx))
: Builder.CreateExtractElement(
State.get(getOperand(Idx)), VectorIdx);
if (Res) {
Value *Cmp = Builder.CreateICmpUGE(LaneToExtract, VectorStart);
Res = Builder.CreateSelect(Cmp, Ext, Res);
} else {
Res = Ext;
}
}
return Res;
}
case VPInstruction::FirstActiveLane: {
if (getNumOperands() == 1) {
Expand Down Expand Up @@ -984,7 +1010,8 @@ bool VPInstruction::isVectorToScalar() const {
getOpcode() == VPInstruction::ComputeAnyOfResult ||
getOpcode() == VPInstruction::ComputeFindIVResult ||
getOpcode() == VPInstruction::ComputeReductionResult ||
getOpcode() == VPInstruction::AnyOf;
getOpcode() == VPInstruction::AnyOf ||
getOpcode() == VPInstruction::ExtractLane;
}

bool VPInstruction::isSingleScalar() const {
Expand Down Expand Up @@ -1031,6 +1058,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
switch (getOpcode()) {
case Instruction::ExtractElement:
case Instruction::Freeze:
case Instruction::FCmp:
case Instruction::ICmp:
case Instruction::Select:
case VPInstruction::AnyOf:
Expand Down Expand Up @@ -1066,6 +1094,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
return Op == getOperand(1);
case Instruction::PHI:
return true;
case Instruction::FCmp:
case Instruction::ICmp:
case Instruction::Select:
case Instruction::Or:
Expand Down Expand Up @@ -1098,6 +1127,7 @@ bool VPInstruction::onlyFirstPartUsed(const VPValue *Op) const {
switch (getOpcode()) {
default:
return false;
case Instruction::FCmp:
case Instruction::ICmp:
case Instruction::Select:
return vputils::onlyFirstPartUsed(this);
Expand Down Expand Up @@ -1782,7 +1812,7 @@ bool VPIRFlags::flagsValidForOpcode(unsigned Opcode) const {
return Opcode == Instruction::ZExt;
break;
case OperationType::Cmp:
return Opcode == Instruction::ICmp;
return Opcode == Instruction::FCmp || Opcode == Instruction::ICmp;
case OperationType::Other:
return true;
}
Expand Down
Loading