-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[Intrinsics][AArch64] Add intrinsics for masking off aliasing vector lanes #117007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-llvm-selectiondag Author: Sam Tebbs (SamTebbs33) ChangesIt can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration. This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard. This will be used by #100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic. Patch is 93.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117007.diff 11 Files Affected:
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9f4c90ba82a419..c9589d5af8ebbe 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -23475,6 +23475,86 @@ Examples:
%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 %elem0, i64 429)
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %3, i32 4, <4 x i1> %active.lane.mask, <4 x i32> poison)
+.. _int_experimental_get_alias_lane_mask:
+
+'``llvm.get.alias.lane.mask.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+ declare <4 x i1> @llvm.experimental.get.alias.lane.mask.v4i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+ declare <8 x i1> @llvm.experimental.get.alias.lane.mask.v8i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+ declare <16 x i1> @llvm.experimental.get.alias.lane.mask.v16i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+ declare <vscale x 16 x i1> @llvm.experimental.get.alias.lane.mask.nxv16i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+
+
+Overview:
+"""""""""
+
+Create a mask representing lanes that do or not overlap between two pointers across one vector loop iteration.
+
+
+Arguments:
+""""""""""
+
+The first two arguments have the same scalar integer type.
+The final two are immediates and the result is a vector with the i1 element type.
+
+Semantics:
+""""""""""
+
+In the case that ``%writeAfterRead`` is true, the '``llvm.experimental.get.alias.lane.mask.*``' intrinsics are semantically equivalent
+to:
+
+::
+
+ %diff = (%ptrB - %ptrA) / %elementSize
+ %m[i] = (icmp ult i, %diff) || (%diff <= 0)
+
+Otherwise they are semantically equivalent to:
+
+::
+
+ %diff = abs(%ptrB - %ptrA) / %elementSize
+ %m[i] = (icmp ult i, %diff) || (%diff == 0)
+
+where ``%m`` is a vector (mask) of active/inactive lanes with its elements
+indexed by ``i``, and ``%ptrA``, ``%ptrB`` are the two i64 arguments to
+``llvm.experimental.get.alias.lane.mask.*``, ``%elementSize`` is the i32 argument, ``%abs`` is the absolute difference operation, ``%icmp`` is an integer compare and ``ult``
+the unsigned less-than comparison operator. The subtraction between ``%ptrA`` and ``%ptrB`` could be negative. The ``%writeAfterRead`` argument is expected to be true if the ``%ptrB`` is stored to after ``%ptrA`` is read from.
+The above is equivalent to:
+
+::
+
+ %m = @llvm.experimental.get.alias.lane.mask(%ptrA, %ptrB, %elementSize, %writeAfterRead)
+
+This can, for example, be emitted by the loop vectorizer in which case
+``%ptrA`` is a pointer that is read from within the loop, and ``%ptrB`` is a pointer that is stored to within the loop.
+If the difference between these pointers is less than the vector factor, then they overlap (alias) within a loop iteration.
+An example is if ``%ptrA`` is 20 and ``%ptrB`` is 23 with a vector factor of 8, then lanes 3, 4, 5, 6 and 7 of the vector loaded from ``%ptrA``
+share addresses with lanes 0, 1, 2, 3, 4 and 5 from the vector stored to at ``%ptrB``.
+An alias mask of these two pointers should be <1, 1, 1, 0, 0, 0, 0, 0> so that only the non-overlapping lanes are loaded and stored.
+This operation allows many loops to be vectorised when it would otherwise be unsafe to do so.
+
+To account for the fact that only a subset of lanes have been operated on in an iteration,
+the loop's induction variable should be incremented by the popcount of the mask rather than the vector factor.
+
+This mask ``%m`` can e.g. be used in masked load/store instructions.
+
+
+Examples:
+"""""""""
+
+.. code-block:: llvm
+
+ %alias.lane.mask = call <4 x i1> @llvm.experimental.get.alias.lane.mask.v4i1.i64(i64 %ptrA, i64 %ptrB, i32 4, i1 1)
+ %vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %ptrA, i32 4, <4 x i1> %alias.lane.mask, <4 x i32> poison)
+ [...]
+ call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, <4 x i32>* %ptrB, i32 4, <4 x i1> %alias.lane.mask)
.. _int_experimental_vp_splice:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 6a41094ff933b0..0338310fd936df 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -468,6 +468,11 @@ class TargetLoweringBase {
return true;
}
+ /// Return true if the @llvm.experimental.get.alias.lane.mask intrinsic should be expanded using generic code in SelectionDAGBuilder.
+ virtual bool shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const {
+ return true;
+ }
+
virtual bool shouldExpandGetVectorLength(EVT CountVT, unsigned VF,
bool IsScalable) const {
return true;
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 1ca8c2565ab0b6..5f7073a531283e 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2363,6 +2363,11 @@ let IntrProperties = [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>
llvm_i32_ty]>;
}
+def int_experimental_get_alias_lane_mask:
+ DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [llvm_anyint_ty, LLVMMatchType<1>, llvm_anyint_ty, llvm_i1_ty],
+ [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>]>;
+
def int_get_active_lane_mask:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[llvm_anyint_ty, LLVMMatchType<1>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d729d448502d8..39e84e06a8de60 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8284,6 +8284,50 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
visitVectorExtractLastActive(I, Intrinsic);
return;
}
+ case Intrinsic::experimental_get_alias_lane_mask: {
+ SDValue SourceValue = getValue(I.getOperand(0));
+ SDValue SinkValue = getValue(I.getOperand(1));
+ SDValue EltSize = getValue(I.getOperand(2));
+ bool IsWriteAfterRead = cast<ConstantSDNode>(getValue(I.getOperand(3)))->getZExtValue() != 0;
+ auto IntrinsicVT = EVT::getEVT(I.getType());
+ auto PtrVT = SourceValue->getValueType(0);
+
+ if (!TLI.shouldExpandGetAliasLaneMask(IntrinsicVT, PtrVT, cast<ConstantSDNode>(EltSize)->getSExtValue())) {
+ visitTargetIntrinsic(I, Intrinsic);
+ return;
+ }
+
+ SDValue Diff = DAG.getNode(ISD::SUB, sdl,
+ PtrVT, SinkValue, SourceValue);
+ if (!IsWriteAfterRead)
+ Diff = DAG.getNode(ISD::ABS, sdl, PtrVT, Diff);
+
+ Diff = DAG.getNode(ISD::SDIV, sdl, PtrVT, Diff, EltSize);
+ SDValue Zero = DAG.getTargetConstant(0, sdl, PtrVT);
+
+ // If the difference is positive then some elements may alias
+ auto CmpVT = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),
+ PtrVT);
+ SDValue Cmp = DAG.getSetCC(sdl, CmpVT, Diff, Zero, IsWriteAfterRead ? ISD::SETLE : ISD::SETEQ);
+
+ // Splat the compare result then OR it with a lane mask
+ SDValue Splat = DAG.getSplat(IntrinsicVT, sdl, Cmp);
+
+ SDValue DiffMask;
+ // Don't emit an active lane mask if the target doesn't support it
+ if (TLI.shouldExpandGetActiveLaneMask(IntrinsicVT, PtrVT)) {
+ EVT VecTy = EVT::getVectorVT(*DAG.getContext(), PtrVT,
+ IntrinsicVT.getVectorElementCount());
+ SDValue DiffSplat = DAG.getSplat(VecTy, sdl, Diff);
+ SDValue VectorStep = DAG.getStepVector(sdl, VecTy);
+ DiffMask = DAG.getSetCC(sdl, IntrinsicVT, VectorStep,
+ DiffSplat, ISD::CondCode::SETULT);
+ } else {
+ DiffMask = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, sdl, IntrinsicVT, DAG.getTargetConstant(Intrinsic::get_active_lane_mask, sdl, MVT::i64), Zero, Diff);
+ }
+ SDValue Or = DAG.getNode(ISD::OR, sdl, IntrinsicVT, DiffMask, Splat);
+ setValue(&I, Or);
+ }
}
}
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7ab3fc06715ec8..66eaec0d5ae6c9 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -2033,6 +2033,24 @@ bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT,
return false;
}
+bool AArch64TargetLowering::shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const {
+ if (!Subtarget->hasSVE2())
+ return true;
+
+ if (PtrVT != MVT::i64)
+ return true;
+
+ if (VT == MVT::v2i1 || VT == MVT::nxv2i1)
+ return EltSize != 8;
+ if( VT == MVT::v4i1 || VT == MVT::nxv4i1)
+ return EltSize != 4;
+ if (VT == MVT::v8i1 || VT == MVT::nxv8i1)
+ return EltSize != 2;
+ if (VT == MVT::v16i1 || VT == MVT::nxv16i1)
+ return EltSize != 1;
+ return true;
+}
+
bool AArch64TargetLowering::shouldExpandPartialReductionIntrinsic(
const IntrinsicInst *I) const {
if (I->getIntrinsicID() != Intrinsic::experimental_vector_partial_reduce_add)
@@ -2796,6 +2814,8 @@ const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {
MAKE_CASE(AArch64ISD::LS64_BUILD)
MAKE_CASE(AArch64ISD::LS64_EXTRACT)
MAKE_CASE(AArch64ISD::TBL)
+ MAKE_CASE(AArch64ISD::WHILEWR)
+ MAKE_CASE(AArch64ISD::WHILERW)
MAKE_CASE(AArch64ISD::FADD_PRED)
MAKE_CASE(AArch64ISD::FADDA_PRED)
MAKE_CASE(AArch64ISD::FADDV_PRED)
@@ -5881,6 +5901,16 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
EVT PtrVT = getPointerTy(DAG.getDataLayout());
return DAG.getNode(AArch64ISD::THREAD_POINTER, dl, PtrVT);
}
+ case Intrinsic::aarch64_sve_whilewr_b:
+ case Intrinsic::aarch64_sve_whilewr_h:
+ case Intrinsic::aarch64_sve_whilewr_s:
+ case Intrinsic::aarch64_sve_whilewr_d:
+ return DAG.getNode(AArch64ISD::WHILEWR, dl, Op.getValueType(), Op.getOperand(1), Op.getOperand(2));
+ case Intrinsic::aarch64_sve_whilerw_b:
+ case Intrinsic::aarch64_sve_whilerw_h:
+ case Intrinsic::aarch64_sve_whilerw_s:
+ case Intrinsic::aarch64_sve_whilerw_d:
+ return DAG.getNode(AArch64ISD::WHILERW, dl, Op.getValueType(), Op.getOperand(1), Op.getOperand(2));
case Intrinsic::aarch64_neon_abs: {
EVT Ty = Op.getValueType();
if (Ty == MVT::i64) {
@@ -6340,16 +6370,39 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
return DAG.getNode(AArch64ISD::USDOT, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
}
+ case Intrinsic::experimental_get_alias_lane_mask:
case Intrinsic::get_active_lane_mask: {
+ unsigned IntrinsicID = Intrinsic::aarch64_sve_whilelo;
+ if (IntNo == Intrinsic::experimental_get_alias_lane_mask) {
+ uint64_t EltSize = Op.getOperand(3)->getAsZExtVal();
+ bool IsWriteAfterRead = Op.getOperand(4)->getAsZExtVal() == 1;
+ switch (EltSize) {
+ case 1:
+ IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_b : Intrinsic::aarch64_sve_whilerw_b;
+ break;
+ case 2:
+ IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_h : Intrinsic::aarch64_sve_whilerw_h;
+ break;
+ case 4:
+ IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_s : Intrinsic::aarch64_sve_whilerw_s;
+ break;
+ case 8:
+ IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_d : Intrinsic::aarch64_sve_whilerw_d;
+ break;
+ default:
+ llvm_unreachable("Unexpected element size for get.alias.lane.mask");
+ break;
+ }
+ }
SDValue ID =
- DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);
+ DAG.getTargetConstant(IntrinsicID, dl, MVT::i64);
EVT VT = Op.getValueType();
if (VT.isScalableVector())
return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT, ID, Op.getOperand(1),
Op.getOperand(2));
- // We can use the SVE whilelo instruction to lower this intrinsic by
+ // We can use the SVE whilelo/whilewr/whilerw instruction to lower this intrinsic by
// creating the appropriate sequence of scalable vector operations and
// then extracting a fixed-width subvector from the scalable vector.
@@ -14241,128 +14294,8 @@ static SDValue tryLowerToSLI(SDNode *N, SelectionDAG &DAG) {
return ResultSLI;
}
-/// Try to lower the construction of a pointer alias mask to a WHILEWR.
-/// The mask's enabled lanes represent the elements that will not overlap across
-/// one loop iteration. This tries to match:
-/// or (splat (setcc_lt (sub ptrA, ptrB), -(element_size - 1))),
-/// (get_active_lane_mask 0, (div (sub ptrA, ptrB), element_size))
-SDValue tryWhileWRFromOR(SDValue Op, SelectionDAG &DAG,
- const AArch64Subtarget &Subtarget) {
- if (!Subtarget.hasSVE2())
- return SDValue();
- SDValue LaneMask = Op.getOperand(0);
- SDValue Splat = Op.getOperand(1);
-
- if (Splat.getOpcode() != ISD::SPLAT_VECTOR)
- std::swap(LaneMask, Splat);
-
- if (LaneMask.getOpcode() != ISD::INTRINSIC_WO_CHAIN ||
- LaneMask.getConstantOperandVal(0) != Intrinsic::get_active_lane_mask ||
- Splat.getOpcode() != ISD::SPLAT_VECTOR)
- return SDValue();
-
- SDValue Cmp = Splat.getOperand(0);
- if (Cmp.getOpcode() != ISD::SETCC)
- return SDValue();
-
- CondCodeSDNode *Cond = cast<CondCodeSDNode>(Cmp.getOperand(2));
-
- auto ComparatorConst = dyn_cast<ConstantSDNode>(Cmp.getOperand(1));
- if (!ComparatorConst || ComparatorConst->getSExtValue() > 0 ||
- Cond->get() != ISD::CondCode::SETLT)
- return SDValue();
- unsigned CompValue = std::abs(ComparatorConst->getSExtValue());
- unsigned EltSize = CompValue + 1;
- if (!isPowerOf2_64(EltSize) || EltSize > 8)
- return SDValue();
-
- SDValue Diff = Cmp.getOperand(0);
- if (Diff.getOpcode() != ISD::SUB || Diff.getValueType() != MVT::i64)
- return SDValue();
-
- if (!isNullConstant(LaneMask.getOperand(1)) ||
- (EltSize != 1 && LaneMask.getOperand(2).getOpcode() != ISD::SRA))
- return SDValue();
-
- // The number of elements that alias is calculated by dividing the positive
- // difference between the pointers by the element size. An alias mask for i8
- // elements omits the division because it would just divide by 1
- if (EltSize > 1) {
- SDValue DiffDiv = LaneMask.getOperand(2);
- auto DiffDivConst = dyn_cast<ConstantSDNode>(DiffDiv.getOperand(1));
- if (!DiffDivConst || DiffDivConst->getZExtValue() != Log2_64(EltSize))
- return SDValue();
- if (EltSize > 2) {
- // When masking i32 or i64 elements, the positive value of the
- // possibly-negative difference comes from a select of the difference if
- // it's positive, otherwise the difference plus the element size if it's
- // negative: pos_diff = diff < 0 ? (diff + 7) : diff
- SDValue Select = DiffDiv.getOperand(0);
- // Make sure the difference is being compared by the select
- if (Select.getOpcode() != ISD::SELECT_CC || Select.getOperand(3) != Diff)
- return SDValue();
- // Make sure it's checking if the difference is less than 0
- if (!isNullConstant(Select.getOperand(1)) ||
- cast<CondCodeSDNode>(Select.getOperand(4))->get() !=
- ISD::CondCode::SETLT)
- return SDValue();
- // An add creates a positive value from the negative difference
- SDValue Add = Select.getOperand(2);
- if (Add.getOpcode() != ISD::ADD || Add.getOperand(0) != Diff)
- return SDValue();
- if (auto *AddConst = dyn_cast<ConstantSDNode>(Add.getOperand(1));
- !AddConst || AddConst->getZExtValue() != EltSize - 1)
- return SDValue();
- } else {
- // When masking i16 elements, this positive value comes from adding the
- // difference's sign bit to the difference itself. This is equivalent to
- // the 32 bit and 64 bit case: pos_diff = diff + sign_bit (diff)
- SDValue Add = DiffDiv.getOperand(0);
- if (Add.getOpcode() != ISD::ADD || Add.getOperand(0) != Diff)
- return SDValue();
- // A logical right shift by 63 extracts the sign bit from the difference
- SDValue Shift = Add.getOperand(1);
- if (Shift.getOpcode() != ISD::SRL || Shift.getOperand(0) != Diff)
- return SDValue();
- if (auto *ShiftConst = dyn_cast<ConstantSDNode>(Shift.getOperand(1));
- !ShiftConst || ShiftConst->getZExtValue() != 63)
- return SDValue();
- }
- } else if (LaneMask.getOperand(2) != Diff)
- return SDValue();
-
- SDValue StorePtr = Diff.getOperand(0);
- SDValue ReadPtr = Diff.getOperand(1);
-
- unsigned IntrinsicID = 0;
- switch (EltSize) {
- case 1:
- IntrinsicID = Intrinsic::aarch64_sve_whilewr_b;
- break;
- case 2:
- IntrinsicID = Intrinsic::aarch64_sve_whilewr_h;
- break;
- case 4:
- IntrinsicID = Intrinsic::aarch64_sve_whilewr_s;
- break;
- case 8:
- IntrinsicID = Intrinsic::aarch64_sve_whilewr_d;
- break;
- default:
- return SDValue();
- }
- SDLoc DL(Op);
- SDValue ID = DAG.getConstant(IntrinsicID, DL, MVT::i32);
- return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, Op.getValueType(), ID,
- StorePtr, ReadPtr);
-}
-
SDValue AArch64TargetLowering::LowerVectorOR(SDValue Op,
SelectionDAG &DAG) const {
- if (SDValue SV =
- tryWhileWRFromOR(Op, DAG, DAG.getSubtarget<AArch64Subtarget>()))
- return SV;
-
if (useSVEForFixedLengthVectorVT(Op.getValueType(),
!Subtarget->isNeonAvailable()))
return LowerToScalableOp(Op, DAG);
@@ -19609,7 +19542,9 @@ static bool isPredicateCCSettingOp(SDValue N) {
N.getConstantOperandVal(0) == Intrinsic::aarch64_sve_whilels ||
N.getConstantOperandVal(0) == Intrinsic::aarch64_sve_whilelt ||
// get_active_lane_mask is lowered to a whilelo instruction.
- N.getConstantOperandVal(0) == Intrinsic::get_active_lane_mask)))
+ N.getConstantOperandVal(0) == Intrinsic::get_active_lane_mask ||
+ // get_alias_lane_mask is lowered to a whilewr/rw instruction.
+ N.getConstantOperandVal(0) == Intrinsic::experimental_get_alias_lane_mask)))
return true;
return false;
@@ -27175,6 +27110,7 @@ void AArch64TargetLowering::ReplaceNodeResults(
return;
}
case Intrinsic::experimental_vector_match:
+ case Intrinsic::experimental_get_alias_lane_mask:
case Intrinsic::get_active_lane_mask: {
if (!VT.isFixedLengthVector() || VT.getVectorElementType() != MVT::i1)
return;
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index cb0b9e965277aa..b2f766b22911ff 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -297,6 +297,10 @@ enum NodeType : unsigned {
SMAXV,
UMAXV,
+ // Alias lane masks
+ WHILEWR,
+ WHILERW,
+
SADDV_PRED,
UADDV_PRED,
SMAXV_PRED,
@@ -980,6 +984,8 @@ class AArch64TargetLowering : public TargetLowering {
bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const override;
+ bool shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const override;
+
bool
shouldExpandPartialReductionIntrinsic(const IntrinsicInst *I) const override;
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 564fb33758ad57..99b1e0618ab34b 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -140,6 +140,11 @@ def AArch64st1q_scatter : SDNode<"AArch64ISD::SST1Q_PRED", SDT_AArch64_SCATTER_V
// AArch64 SVE/SVE2 - the remaining node definitions
//
+// Alias masks
+def SDT_AArch64Mask : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisInt<1>, SDTCisSameAs<2, 1>, SDTCVecEltisVT<0,i1>]>;
+def AArch64whilewr : SDNode<"AArch64ISD::WHILEWR", SDT_AArch64Mask>;
+def AArch64whilerw : SDNode<"AArch64ISD::WHILERW", SDT_AArch64Mask>;
+
// SVE CNT/INC/RDVL
def sve_rdvl_imm : Co...
[truncated]
|
@llvm/pr-subscribers-backend-aarch64 Author: Sam Tebbs (SamTebbs33) ChangesIt can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration. This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard. This will be used by #100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic. Patch is 93.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117007.diff 11 Files Affected:
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9f4c90ba82a419..c9589d5af8ebbe 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -23475,6 +23475,86 @@ Examples:
%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 %elem0, i64 429)
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %3, i32 4, <4 x i1> %active.lane.mask, <4 x i32> poison)
+.. _int_experimental_get_alias_lane_mask:
+
+'``llvm.get.alias.lane.mask.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+ declare <4 x i1> @llvm.experimental.get.alias.lane.mask.v4i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+ declare <8 x i1> @llvm.experimental.get.alias.lane.mask.v8i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+ declare <16 x i1> @llvm.experimental.get.alias.lane.mask.v16i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+ declare <vscale x 16 x i1> @llvm.experimental.get.alias.lane.mask.nxv16i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+
+
+Overview:
+"""""""""
+
+Create a mask representing lanes that do or not overlap between two pointers across one vector loop iteration.
+
+
+Arguments:
+""""""""""
+
+The first two arguments have the same scalar integer type.
+The final two are immediates and the result is a vector with the i1 element type.
+
+Semantics:
+""""""""""
+
+In the case that ``%writeAfterRead`` is true, the '``llvm.experimental.get.alias.lane.mask.*``' intrinsics are semantically equivalent
+to:
+
+::
+
+ %diff = (%ptrB - %ptrA) / %elementSize
+ %m[i] = (icmp ult i, %diff) || (%diff <= 0)
+
+Otherwise they are semantically equivalent to:
+
+::
+
+ %diff = abs(%ptrB - %ptrA) / %elementSize
+ %m[i] = (icmp ult i, %diff) || (%diff == 0)
+
+where ``%m`` is a vector (mask) of active/inactive lanes with its elements
+indexed by ``i``, and ``%ptrA``, ``%ptrB`` are the two i64 arguments to
+``llvm.experimental.get.alias.lane.mask.*``, ``%elementSize`` is the i32 argument, ``%abs`` is the absolute difference operation, ``%icmp`` is an integer compare and ``ult``
+the unsigned less-than comparison operator. The subtraction between ``%ptrA`` and ``%ptrB`` could be negative. The ``%writeAfterRead`` argument is expected to be true if the ``%ptrB`` is stored to after ``%ptrA`` is read from.
+The above is equivalent to:
+
+::
+
+ %m = @llvm.experimental.get.alias.lane.mask(%ptrA, %ptrB, %elementSize, %writeAfterRead)
+
+This can, for example, be emitted by the loop vectorizer in which case
+``%ptrA`` is a pointer that is read from within the loop, and ``%ptrB`` is a pointer that is stored to within the loop.
+If the difference between these pointers is less than the vector factor, then they overlap (alias) within a loop iteration.
+An example is if ``%ptrA`` is 20 and ``%ptrB`` is 23 with a vector factor of 8, then lanes 3, 4, 5, 6 and 7 of the vector loaded from ``%ptrA``
+share addresses with lanes 0, 1, 2, 3, 4 and 5 from the vector stored to at ``%ptrB``.
+An alias mask of these two pointers should be <1, 1, 1, 0, 0, 0, 0, 0> so that only the non-overlapping lanes are loaded and stored.
+This operation allows many loops to be vectorised when it would otherwise be unsafe to do so.
+
+To account for the fact that only a subset of lanes have been operated on in an iteration,
+the loop's induction variable should be incremented by the popcount of the mask rather than the vector factor.
+
+This mask ``%m`` can e.g. be used in masked load/store instructions.
+
+
+Examples:
+"""""""""
+
+.. code-block:: llvm
+
+ %alias.lane.mask = call <4 x i1> @llvm.experimental.get.alias.lane.mask.v4i1.i64(i64 %ptrA, i64 %ptrB, i32 4, i1 1)
+ %vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %ptrA, i32 4, <4 x i1> %alias.lane.mask, <4 x i32> poison)
+ [...]
+ call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, <4 x i32>* %ptrB, i32 4, <4 x i1> %alias.lane.mask)
.. _int_experimental_vp_splice:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 6a41094ff933b0..0338310fd936df 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -468,6 +468,11 @@ class TargetLoweringBase {
return true;
}
+ /// Return true if the @llvm.experimental.get.alias.lane.mask intrinsic should be expanded using generic code in SelectionDAGBuilder.
+ virtual bool shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const {
+ return true;
+ }
+
virtual bool shouldExpandGetVectorLength(EVT CountVT, unsigned VF,
bool IsScalable) const {
return true;
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 1ca8c2565ab0b6..5f7073a531283e 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2363,6 +2363,11 @@ let IntrProperties = [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>
llvm_i32_ty]>;
}
+def int_experimental_get_alias_lane_mask:
+ DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [llvm_anyint_ty, LLVMMatchType<1>, llvm_anyint_ty, llvm_i1_ty],
+ [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>]>;
+
def int_get_active_lane_mask:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[llvm_anyint_ty, LLVMMatchType<1>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d729d448502d8..39e84e06a8de60 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8284,6 +8284,50 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
visitVectorExtractLastActive(I, Intrinsic);
return;
}
+ case Intrinsic::experimental_get_alias_lane_mask: {
+ SDValue SourceValue = getValue(I.getOperand(0));
+ SDValue SinkValue = getValue(I.getOperand(1));
+ SDValue EltSize = getValue(I.getOperand(2));
+ bool IsWriteAfterRead = cast<ConstantSDNode>(getValue(I.getOperand(3)))->getZExtValue() != 0;
+ auto IntrinsicVT = EVT::getEVT(I.getType());
+ auto PtrVT = SourceValue->getValueType(0);
+
+ if (!TLI.shouldExpandGetAliasLaneMask(IntrinsicVT, PtrVT, cast<ConstantSDNode>(EltSize)->getSExtValue())) {
+ visitTargetIntrinsic(I, Intrinsic);
+ return;
+ }
+
+ SDValue Diff = DAG.getNode(ISD::SUB, sdl,
+ PtrVT, SinkValue, SourceValue);
+ if (!IsWriteAfterRead)
+ Diff = DAG.getNode(ISD::ABS, sdl, PtrVT, Diff);
+
+ Diff = DAG.getNode(ISD::SDIV, sdl, PtrVT, Diff, EltSize);
+ SDValue Zero = DAG.getTargetConstant(0, sdl, PtrVT);
+
+ // If the difference is positive then some elements may alias
+ auto CmpVT = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),
+ PtrVT);
+ SDValue Cmp = DAG.getSetCC(sdl, CmpVT, Diff, Zero, IsWriteAfterRead ? ISD::SETLE : ISD::SETEQ);
+
+ // Splat the compare result then OR it with a lane mask
+ SDValue Splat = DAG.getSplat(IntrinsicVT, sdl, Cmp);
+
+ SDValue DiffMask;
+ // Don't emit an active lane mask if the target doesn't support it
+ if (TLI.shouldExpandGetActiveLaneMask(IntrinsicVT, PtrVT)) {
+ EVT VecTy = EVT::getVectorVT(*DAG.getContext(), PtrVT,
+ IntrinsicVT.getVectorElementCount());
+ SDValue DiffSplat = DAG.getSplat(VecTy, sdl, Diff);
+ SDValue VectorStep = DAG.getStepVector(sdl, VecTy);
+ DiffMask = DAG.getSetCC(sdl, IntrinsicVT, VectorStep,
+ DiffSplat, ISD::CondCode::SETULT);
+ } else {
+ DiffMask = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, sdl, IntrinsicVT, DAG.getTargetConstant(Intrinsic::get_active_lane_mask, sdl, MVT::i64), Zero, Diff);
+ }
+ SDValue Or = DAG.getNode(ISD::OR, sdl, IntrinsicVT, DiffMask, Splat);
+ setValue(&I, Or);
+ }
}
}
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7ab3fc06715ec8..66eaec0d5ae6c9 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -2033,6 +2033,24 @@ bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT,
return false;
}
+bool AArch64TargetLowering::shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const {
+ if (!Subtarget->hasSVE2())
+ return true;
+
+ if (PtrVT != MVT::i64)
+ return true;
+
+ if (VT == MVT::v2i1 || VT == MVT::nxv2i1)
+ return EltSize != 8;
+ if( VT == MVT::v4i1 || VT == MVT::nxv4i1)
+ return EltSize != 4;
+ if (VT == MVT::v8i1 || VT == MVT::nxv8i1)
+ return EltSize != 2;
+ if (VT == MVT::v16i1 || VT == MVT::nxv16i1)
+ return EltSize != 1;
+ return true;
+}
+
bool AArch64TargetLowering::shouldExpandPartialReductionIntrinsic(
const IntrinsicInst *I) const {
if (I->getIntrinsicID() != Intrinsic::experimental_vector_partial_reduce_add)
@@ -2796,6 +2814,8 @@ const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {
MAKE_CASE(AArch64ISD::LS64_BUILD)
MAKE_CASE(AArch64ISD::LS64_EXTRACT)
MAKE_CASE(AArch64ISD::TBL)
+ MAKE_CASE(AArch64ISD::WHILEWR)
+ MAKE_CASE(AArch64ISD::WHILERW)
MAKE_CASE(AArch64ISD::FADD_PRED)
MAKE_CASE(AArch64ISD::FADDA_PRED)
MAKE_CASE(AArch64ISD::FADDV_PRED)
@@ -5881,6 +5901,16 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
EVT PtrVT = getPointerTy(DAG.getDataLayout());
return DAG.getNode(AArch64ISD::THREAD_POINTER, dl, PtrVT);
}
+ case Intrinsic::aarch64_sve_whilewr_b:
+ case Intrinsic::aarch64_sve_whilewr_h:
+ case Intrinsic::aarch64_sve_whilewr_s:
+ case Intrinsic::aarch64_sve_whilewr_d:
+ return DAG.getNode(AArch64ISD::WHILEWR, dl, Op.getValueType(), Op.getOperand(1), Op.getOperand(2));
+ case Intrinsic::aarch64_sve_whilerw_b:
+ case Intrinsic::aarch64_sve_whilerw_h:
+ case Intrinsic::aarch64_sve_whilerw_s:
+ case Intrinsic::aarch64_sve_whilerw_d:
+ return DAG.getNode(AArch64ISD::WHILERW, dl, Op.getValueType(), Op.getOperand(1), Op.getOperand(2));
case Intrinsic::aarch64_neon_abs: {
EVT Ty = Op.getValueType();
if (Ty == MVT::i64) {
@@ -6340,16 +6370,39 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
return DAG.getNode(AArch64ISD::USDOT, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
}
+ case Intrinsic::experimental_get_alias_lane_mask:
case Intrinsic::get_active_lane_mask: {
+ unsigned IntrinsicID = Intrinsic::aarch64_sve_whilelo;
+ if (IntNo == Intrinsic::experimental_get_alias_lane_mask) {
+ uint64_t EltSize = Op.getOperand(3)->getAsZExtVal();
+ bool IsWriteAfterRead = Op.getOperand(4)->getAsZExtVal() == 1;
+ switch (EltSize) {
+ case 1:
+ IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_b : Intrinsic::aarch64_sve_whilerw_b;
+ break;
+ case 2:
+ IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_h : Intrinsic::aarch64_sve_whilerw_h;
+ break;
+ case 4:
+ IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_s : Intrinsic::aarch64_sve_whilerw_s;
+ break;
+ case 8:
+ IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_d : Intrinsic::aarch64_sve_whilerw_d;
+ break;
+ default:
+ llvm_unreachable("Unexpected element size for get.alias.lane.mask");
+ break;
+ }
+ }
SDValue ID =
- DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);
+ DAG.getTargetConstant(IntrinsicID, dl, MVT::i64);
EVT VT = Op.getValueType();
if (VT.isScalableVector())
return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT, ID, Op.getOperand(1),
Op.getOperand(2));
- // We can use the SVE whilelo instruction to lower this intrinsic by
+ // We can use the SVE whilelo/whilewr/whilerw instruction to lower this intrinsic by
// creating the appropriate sequence of scalable vector operations and
// then extracting a fixed-width subvector from the scalable vector.
@@ -14241,128 +14294,8 @@ static SDValue tryLowerToSLI(SDNode *N, SelectionDAG &DAG) {
return ResultSLI;
}
-/// Try to lower the construction of a pointer alias mask to a WHILEWR.
-/// The mask's enabled lanes represent the elements that will not overlap across
-/// one loop iteration. This tries to match:
-/// or (splat (setcc_lt (sub ptrA, ptrB), -(element_size - 1))),
-/// (get_active_lane_mask 0, (div (sub ptrA, ptrB), element_size))
-SDValue tryWhileWRFromOR(SDValue Op, SelectionDAG &DAG,
- const AArch64Subtarget &Subtarget) {
- if (!Subtarget.hasSVE2())
- return SDValue();
- SDValue LaneMask = Op.getOperand(0);
- SDValue Splat = Op.getOperand(1);
-
- if (Splat.getOpcode() != ISD::SPLAT_VECTOR)
- std::swap(LaneMask, Splat);
-
- if (LaneMask.getOpcode() != ISD::INTRINSIC_WO_CHAIN ||
- LaneMask.getConstantOperandVal(0) != Intrinsic::get_active_lane_mask ||
- Splat.getOpcode() != ISD::SPLAT_VECTOR)
- return SDValue();
-
- SDValue Cmp = Splat.getOperand(0);
- if (Cmp.getOpcode() != ISD::SETCC)
- return SDValue();
-
- CondCodeSDNode *Cond = cast<CondCodeSDNode>(Cmp.getOperand(2));
-
- auto ComparatorConst = dyn_cast<ConstantSDNode>(Cmp.getOperand(1));
- if (!ComparatorConst || ComparatorConst->getSExtValue() > 0 ||
- Cond->get() != ISD::CondCode::SETLT)
- return SDValue();
- unsigned CompValue = std::abs(ComparatorConst->getSExtValue());
- unsigned EltSize = CompValue + 1;
- if (!isPowerOf2_64(EltSize) || EltSize > 8)
- return SDValue();
-
- SDValue Diff = Cmp.getOperand(0);
- if (Diff.getOpcode() != ISD::SUB || Diff.getValueType() != MVT::i64)
- return SDValue();
-
- if (!isNullConstant(LaneMask.getOperand(1)) ||
- (EltSize != 1 && LaneMask.getOperand(2).getOpcode() != ISD::SRA))
- return SDValue();
-
- // The number of elements that alias is calculated by dividing the positive
- // difference between the pointers by the element size. An alias mask for i8
- // elements omits the division because it would just divide by 1
- if (EltSize > 1) {
- SDValue DiffDiv = LaneMask.getOperand(2);
- auto DiffDivConst = dyn_cast<ConstantSDNode>(DiffDiv.getOperand(1));
- if (!DiffDivConst || DiffDivConst->getZExtValue() != Log2_64(EltSize))
- return SDValue();
- if (EltSize > 2) {
- // When masking i32 or i64 elements, the positive value of the
- // possibly-negative difference comes from a select of the difference if
- // it's positive, otherwise the difference plus the element size if it's
- // negative: pos_diff = diff < 0 ? (diff + 7) : diff
- SDValue Select = DiffDiv.getOperand(0);
- // Make sure the difference is being compared by the select
- if (Select.getOpcode() != ISD::SELECT_CC || Select.getOperand(3) != Diff)
- return SDValue();
- // Make sure it's checking if the difference is less than 0
- if (!isNullConstant(Select.getOperand(1)) ||
- cast<CondCodeSDNode>(Select.getOperand(4))->get() !=
- ISD::CondCode::SETLT)
- return SDValue();
- // An add creates a positive value from the negative difference
- SDValue Add = Select.getOperand(2);
- if (Add.getOpcode() != ISD::ADD || Add.getOperand(0) != Diff)
- return SDValue();
- if (auto *AddConst = dyn_cast<ConstantSDNode>(Add.getOperand(1));
- !AddConst || AddConst->getZExtValue() != EltSize - 1)
- return SDValue();
- } else {
- // When masking i16 elements, this positive value comes from adding the
- // difference's sign bit to the difference itself. This is equivalent to
- // the 32 bit and 64 bit case: pos_diff = diff + sign_bit (diff)
- SDValue Add = DiffDiv.getOperand(0);
- if (Add.getOpcode() != ISD::ADD || Add.getOperand(0) != Diff)
- return SDValue();
- // A logical right shift by 63 extracts the sign bit from the difference
- SDValue Shift = Add.getOperand(1);
- if (Shift.getOpcode() != ISD::SRL || Shift.getOperand(0) != Diff)
- return SDValue();
- if (auto *ShiftConst = dyn_cast<ConstantSDNode>(Shift.getOperand(1));
- !ShiftConst || ShiftConst->getZExtValue() != 63)
- return SDValue();
- }
- } else if (LaneMask.getOperand(2) != Diff)
- return SDValue();
-
- SDValue StorePtr = Diff.getOperand(0);
- SDValue ReadPtr = Diff.getOperand(1);
-
- unsigned IntrinsicID = 0;
- switch (EltSize) {
- case 1:
- IntrinsicID = Intrinsic::aarch64_sve_whilewr_b;
- break;
- case 2:
- IntrinsicID = Intrinsic::aarch64_sve_whilewr_h;
- break;
- case 4:
- IntrinsicID = Intrinsic::aarch64_sve_whilewr_s;
- break;
- case 8:
- IntrinsicID = Intrinsic::aarch64_sve_whilewr_d;
- break;
- default:
- return SDValue();
- }
- SDLoc DL(Op);
- SDValue ID = DAG.getConstant(IntrinsicID, DL, MVT::i32);
- return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, Op.getValueType(), ID,
- StorePtr, ReadPtr);
-}
-
SDValue AArch64TargetLowering::LowerVectorOR(SDValue Op,
SelectionDAG &DAG) const {
- if (SDValue SV =
- tryWhileWRFromOR(Op, DAG, DAG.getSubtarget<AArch64Subtarget>()))
- return SV;
-
if (useSVEForFixedLengthVectorVT(Op.getValueType(),
!Subtarget->isNeonAvailable()))
return LowerToScalableOp(Op, DAG);
@@ -19609,7 +19542,9 @@ static bool isPredicateCCSettingOp(SDValue N) {
N.getConstantOperandVal(0) == Intrinsic::aarch64_sve_whilels ||
N.getConstantOperandVal(0) == Intrinsic::aarch64_sve_whilelt ||
// get_active_lane_mask is lowered to a whilelo instruction.
- N.getConstantOperandVal(0) == Intrinsic::get_active_lane_mask)))
+ N.getConstantOperandVal(0) == Intrinsic::get_active_lane_mask ||
+ // get_alias_lane_mask is lowered to a whilewr/rw instruction.
+ N.getConstantOperandVal(0) == Intrinsic::experimental_get_alias_lane_mask)))
return true;
return false;
@@ -27175,6 +27110,7 @@ void AArch64TargetLowering::ReplaceNodeResults(
return;
}
case Intrinsic::experimental_vector_match:
+ case Intrinsic::experimental_get_alias_lane_mask:
case Intrinsic::get_active_lane_mask: {
if (!VT.isFixedLengthVector() || VT.getVectorElementType() != MVT::i1)
return;
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index cb0b9e965277aa..b2f766b22911ff 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -297,6 +297,10 @@ enum NodeType : unsigned {
SMAXV,
UMAXV,
+ // Alias lane masks
+ WHILEWR,
+ WHILERW,
+
SADDV_PRED,
UADDV_PRED,
SMAXV_PRED,
@@ -980,6 +984,8 @@ class AArch64TargetLowering : public TargetLowering {
bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const override;
+ bool shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const override;
+
bool
shouldExpandPartialReductionIntrinsic(const IntrinsicInst *I) const override;
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 564fb33758ad57..99b1e0618ab34b 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -140,6 +140,11 @@ def AArch64st1q_scatter : SDNode<"AArch64ISD::SST1Q_PRED", SDT_AArch64_SCATTER_V
// AArch64 SVE/SVE2 - the remaining node definitions
//
+// Alias masks
+def SDT_AArch64Mask : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisInt<1>, SDTCisSameAs<2, 1>, SDTCVecEltisVT<0,i1>]>;
+def AArch64whilewr : SDNode<"AArch64ISD::WHILEWR", SDT_AArch64Mask>;
+def AArch64whilerw : SDNode<"AArch64ISD::WHILERW", SDT_AArch64Mask>;
+
// SVE CNT/INC/RDVL
def sve_rdvl_imm : Co...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
if (!IsWriteAfterRead) | ||
Diff = DAG.getNode(ISD::ABS, sdl, PtrVT, Diff); | ||
|
||
Diff = DAG.getNode(ISD::SDIV, sdl, PtrVT, Diff, EltSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm wrong but wouldn't this line always be executed, even if the !IsWriteAfterRead
condition is met. So if the if
statement above is entered, Diff
is set to the ABS
node and is then overwritten and set to the SDIV
node?
Maybe you forgot a return
in the if
statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding: !IsWriteAfterRead
would imply that Diff
is likely to be negative, so this inserts the ISD::ABS
immediately before using Diff
as an operand to the SDIV
node, ensuring that it is positive.
Arguably the if
isn't needed there, as ABS
-ing a positive number just returns the same number, but then we're adding nodes that we know don't do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes sorry I didn't notice the Diff as an argument in the second assignment.
@@ -0,0 +1,82 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | |||
; RUN: llc -mtriple=aarch64 -mattr=+sve2 %s -o - | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth having a +sve and a +sve2 run line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
llvm/docs/LangRef.rst
Outdated
immediate argument, ``%abs`` is the absolute difference operation, ``%icmp`` is | ||
an integer compare and ``ult`` the unsigned less-than comparison operator. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the explanation of abs, icmp and ult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
llvm/docs/LangRef.rst
Outdated
``llvm.experimental.get.alias.lane.mask.*``, ``%elementSize`` is the first | ||
immediate argument, ``%abs`` is the absolute difference operation, ``%icmp`` is | ||
an integer compare and ``ult`` the unsigned less-than comparison operator. The | ||
subtraction between ``%ptrA`` and ``%ptrB`` could be negative. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this line about the result being negative too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
llvm/docs/LangRef.rst
Outdated
The intrinsic will return poison if ``%ptrA`` and ``%ptrB`` are within | ||
VF * ``%elementSize`` of each other and ``%ptrA`` + VF * ``%elementSize`` wraps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this up above the other explanation to make it more prominent, and explain how the (%ptrB - %ptrA) / %elementSize
doesn't always apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Let me know if it needs changing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'm not sure about these shouldExpand functions but I can see that is used elsewhere, and in general this LGTM. It would be good to use these to generate runtime alias checks using the last lane.
I've made some changes to relocate the default lowering for the intrinsic so that SelectionDAGBuilder.cpp doesn't call any TTI hooks. The AArch64 WHILEWR/RW instructions accept scalable vector output types so I mimicked the methods used for active lane mask lowering when the output is fixed instead. This originally produced an extra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to upgrade the whilewr intrinsics (which I think sounds OK to me), then it will need auto-update code something like in https://github.com/llvm/llvm-project/pull/120363/files#diff-0c0305d510a076cef711c006c1d9fd78c95cade1f597d21ee46fd753e6982316.
It might be good to separate that out into a separate patch too, to keep things managable.
@@ -567,6 +567,9 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const { | |||
case ISD::EXPERIMENTAL_VECTOR_HISTOGRAM: | |||
return "histogram"; | |||
|
|||
case ISD::EXPERIMENTAL_ALIAS_LANE_MASK: | |||
return "alias_mask"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alias_lane_mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -2033,6 +2041,25 @@ bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT, | |||
return false; | |||
} | |||
|
|||
bool AArch64TargetLowering::shouldExpandGetAliasLaneMask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly can. Done.
Thanks for that. I've removed them and am no longer seeing the extra |
d093339
to
7124e2c
Compare
} // End HasSVE2_or_SME | ||
defm WHILEWR_PXX : sve2_int_while_rr<0b0, "whilewr", AArch64whilewr>; | ||
defm WHILERW_PXX : sve2_int_while_rr<0b1, "whilerw", AArch64whilerw>; | ||
} // End HasSVE2orSME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undo this. (It looks like a merge-conflict went the wrong way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting that, fixed.
@@ -19861,7 +19946,8 @@ static SDValue getPTest(SelectionDAG &DAG, EVT VT, SDValue Pg, SDValue Op, | |||
AArch64CC::CondCode Cond); | |||
|
|||
static bool isPredicateCCSettingOp(SDValue N) { | |||
if ((N.getOpcode() == ISD::SETCC) || | |||
if ((N.getOpcode() == ISD::SETCC || | |||
N.getOpcode() == ISD::EXPERIMENTAL_ALIAS_LANE_MASK) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does adding this mean we need to always lower this to a while?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it does. Do you think it's a problem that not all vector types are marked as legal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it do anything at the moment, and do you have a test for it? I think this is only used in performFirstTrueTestVectorCombine, and that has a test for !isBeforeLegalize so protects against the wrong types. Maybe it is good to separate that part out into a new commit and make sure it has plenty of tests if it is not needed already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just following what had been implemented for get.active.lane.mask
so I can remove this and revisit it if needed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM
b856ecf
to
5402e27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nit addressed
@@ -0,0 +1,48 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe merge this file with alias_mask_nosve2.ll and have two RUN lines?
(in that case, would expand_alias_mask.ll` be a better name?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be missing something but if we merge them, then the sve-less run line will fail since it can't handle the scalable vector test.
case Intrinsic::loop_dependence_raw_mask: | ||
case Intrinsic::loop_dependence_war_mask: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases can be removed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the description to reflect the current state of the PR?
From the current description it is also not clear why we need different intrinsics for write-after-read and read-after-write. Would be great if you could add a brief explanation to the description as well
I think the title/description still needs updating.
* (ptrB - ptrA) <= 0 (guarantees that all lanes are loaded before any stores are | ||
committed), or | ||
* (ptrB - ptrA) >= elementSize * lane (guarantees that this lane is loaded | ||
before the store to the same address is committed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think committed is a term that is defined/used in LangRef. Would be goot to reframe this as well in general terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Given a scalar store to %ptrA, followed by a scalar load from %ptrB, this | ||
instruction generates a mask where an active lane indicates that there is no | ||
read-after-write hazard for this lane and that this lane does not introduce any | ||
new store-to-load forwarding hazard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store-to-load forwarding hazard is not defined. Do we need this wording here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording for the store-to-load forwarding (hazard) behaviour cannot be removed, because it is the only distinction between this intrinsic and the .war
intrinsic. i.e. The "safe" requirement is not the only behaviour that this intrinsic implements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-added the hazard wording, thanks.
A read-after-write hazard occurs when a read-after-write sequence for a given | ||
lane in a vector ends up being executed as a write-after-read sequence due to | ||
the aliasing of pointers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just explain this generally (hazard language is not used in langref), does this simply say that instead a first reading and then storing a lane, it is stored first instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I specifically suggested using the "introduces a hazard" terminology (along with a subsequent definition of what a hazard is) is because "safe" and "no alias" do not cover the semantics.
'no alias' can't be used because both intrinsics still return an all-active mask when their pointers fully alias.
'safe' doesn't cover the .raw
intrinsic because it sets lanes to inactive when they are safe but would otherwise introduce a store-to-load forwarding hazard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the latest review and commit conflicts with a some of @sdesmalen-arm 's previously proposed changes, I'd like his opinion on the new langref text.
Given a scalar store to %ptrA, followed by a scalar load from %ptrB, this | ||
instruction generates a mask where an active lane indicates that there is no | ||
read-after-write hazard for this lane and that this lane does not introduce any | ||
new store-to-load forwarding hazard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
* (ptrB - ptrA) <= 0 (guarantees that all lanes are loaded before any stores are | ||
committed), or | ||
* (ptrB - ptrA) >= elementSize * lane (guarantees that this lane is loaded | ||
before the store to the same address is committed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A read-after-write hazard occurs when a read-after-write sequence for a given | ||
lane in a vector ends up being executed as a write-after-read sequence due to | ||
the aliasing of pointers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks for spotting that. The description, title and lang ref have been updated. Please let me know if anything else needs changing. |
Given a scalar store to %ptrA, followed by a scalar load from %ptrB, this | ||
instruction generates a mask where an active lane indicates that there is no | ||
read-after-write hazard for this lane and that this lane does not introduce any | ||
new store-to-load forwarding hazard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording for the store-to-load forwarding (hazard) behaviour cannot be removed, because it is the only distinction between this intrinsic and the .war
intrinsic. i.e. The "safe" requirement is not the only behaviour that this intrinsic implements.
|
||
abs(ptrB - ptrA) >= elementSize * lane (guarantees that the store of this lane | ||
is committed before loading from this address) | ||
occurs before loading from this address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case for ptrA == ptrB
needs to be explicitly called out as 'safe' here, because it doesn't introduce any new hazards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A read-after-write hazard occurs when a read-after-write sequence for a given | ||
lane in a vector ends up being executed as a write-after-read sequence due to | ||
the aliasing of pointers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I specifically suggested using the "introduces a hazard" terminology (along with a subsequent definition of what a hazard is) is because "safe" and "no alias" do not cover the semantics.
'no alias' can't be used because both intrinsics still return an all-active mask when their pointers fully alias.
'safe' doesn't cover the .raw
intrinsic because it sets lanes to inactive when they are safe but would otherwise introduce a store-to-load forwarding hazard.
; CHECK-NEXT: csinc w0, w8, wzr, ne | ||
; CHECK-NEXT: ret | ||
entry: | ||
%0 = call <1 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %a, ptr %b, i64 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%0 = call <1 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %a, ptr %b, i64 1) | |
%0 = call <1 x i1> @llvm.loop.dependence.war.mask.v1i1(ptr %a, ptr %b, i64 1) |
(same for the tests below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -784,3 +784,115 @@ entry: | |||
%0 = call <16 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %a, ptr %b, i64 3) | |||
ret <16 x i1> %0 | |||
} | |||
|
|||
define <1 x i1> @whilewr_8_scalarize(ptr %a, ptr %b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you add a section header saying these tests are about scalarising <1 x i1>
types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -765,3 +765,193 @@ entry: | |||
%0 = call <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr %a, ptr %b, i64 3) | |||
ret <vscale x 16 x i1> %0 | |||
} | |||
|
|||
define <vscale x 1 x i1> @whilewr_8_scalarize(ptr %a, ptr %b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not scalarizing a <vscale x 1 x i1>
type, instead it seems to default to expand
. I'm not sure how useful these tests are tbh, because this functionality should already be tested elsewhere, so maybe just remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can remove them 👍
llvm/docs/LangRef.rst
Outdated
read-after-write sequence can be performed safely for that lane, without a | ||
read-after-write hazard occurring or a a new store-to-load forwarding hazard | ||
being introduced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read-after-write sequence can be performed safely for that lane, without a | |
read-after-write hazard occurring or a a new store-to-load forwarding hazard | |
being introduced. | |
read-after-write sequence can be performed safely for that lane, without a | |
read-after-write hazard or a store-to-load forwarding hazard being introduced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
llvm/docs/LangRef.rst
Outdated
A store-to-load forwarding hazard occurs when a vector store writes to an | ||
address that partially overlaps with the address of a subsequent vector load. | ||
Only the overlapping addresses can be forwarded to the load if the data hasn't | ||
been written to memory yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the load can't be performed until the write has completed, resulting in a stall that did not exist when executing as scalars. So perhaps you can write instead:
A store-to-load forwarding hazard occurs when a vector store writes to an | |
address that partially overlaps with the address of a subsequent vector load. | |
Only the overlapping addresses can be forwarded to the load if the data hasn't | |
been written to memory yet. | |
A store-to-load forwarding hazard occurs when a vector store writes to an | |
address that partially overlaps with the address of a subsequent vector load, | |
meaning that the vector load can't be performed until the vector store has completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers.
occurs before loading from this address) | ||
* ptrA == ptrB doesn't introduce any new hazards and is safe | ||
occurs before loading from this address), or | ||
* ptrA == ptrB, doesn't introduce any new hazards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* ptrA == ptrB, doesn't introduce any new hazards | |
* ptrA == ptrB (doesn't introduce any new hazards that weren't present in scalar code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
is smaller than ``VF * %elementsize`` and either ``%ptrA + VF * %elementSize`` | ||
or ``%ptrB + VF * %elementSize`` wrap. | ||
The element of the result mask is active when loading from %ptrA then storing to | ||
%ptrB is safe and doesn't result in a write-after-read hazard: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
%ptrB is safe and doesn't result in a write-after-read hazard: | |
%ptrB is safe and doesn't result in a write-after-read hazard, meaning that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thank you.
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/3/builds/21371 Here is the relevant piece of the build log for the reference
|
It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration.
This PR adds intrinsics designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. The
loop.dependence.war.mask
intrinsic represents cases where the store occurs after the load, and the opposite forloop.dependence.raw.mask
. The distinction between write-after-read and read-after-write is important, since the ordering of the read and write operations affects if the chain of those instructions can be done safely.Along with the two pointer parameters, the intrinsics also take an immediate that represents the size in bytes of the vector element types.
This will be used by #100579.