Skip to content

Commit 569d738

Browse files
authored
[Intrinsics][AArch64] Add intrinsics for masking off aliasing vector lanes (#117007)
It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration. This PR adds intrinsics designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. The `loop.dependence.war.mask` intrinsic represents cases where the store occurs after the load, and the opposite for `loop.dependence.raw.mask`. The distinction between write-after-read and read-after-write is important, since the ordering of the read and write operations affects if the chain of those instructions can be done safely. Along with the two pointer parameters, the intrinsics also take an immediate that represents the size in bytes of the vector element types. This will be used by #100579.
1 parent 0196d7e commit 569d738

19 files changed

+2173
-7
lines changed

llvm/docs/LangRef.rst

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24019,6 +24019,130 @@ Examples:
2401924019
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %3, i32 4, <4 x i1> %active.lane.mask, <4 x i32> poison)
2402024020

2402124021

24022+
.. _int_loop_dependence_war_mask:
24023+
24024+
'``llvm.loop.dependence.war.mask.*``' Intrinsics
24025+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24026+
24027+
Syntax:
24028+
"""""""
24029+
This is an overloaded intrinsic.
24030+
24031+
::
24032+
24033+
declare <4 x i1> @llvm.loop.dependence.war.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
24034+
declare <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
24035+
declare <16 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
24036+
declare <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
24037+
24038+
24039+
Overview:
24040+
"""""""""
24041+
24042+
Given a vector load from %ptrA followed by a vector store to %ptrB, this
24043+
instruction generates a mask where an active lane indicates that the
24044+
write-after-read sequence can be performed safely for that lane, without the
24045+
danger of a write-after-read hazard occurring.
24046+
24047+
A write-after-read hazard occurs when a write-after-read sequence for a given
24048+
lane in a vector ends up being executed as a read-after-write sequence due to
24049+
the aliasing of pointers.
24050+
24051+
Arguments:
24052+
""""""""""
24053+
24054+
The first two arguments are pointers and the last argument is an immediate.
24055+
The result is a vector with the i1 element type.
24056+
24057+
Semantics:
24058+
""""""""""
24059+
24060+
``%elementSize`` is the size of the accessed elements in bytes.
24061+
The intrinsic returns ``poison`` if the distance between ``%prtA`` and ``%ptrB``
24062+
is smaller than ``VF * %elementsize`` and either ``%ptrA + VF * %elementSize``
24063+
or ``%ptrB + VF * %elementSize`` wrap.
24064+
The element of the result mask is active when loading from %ptrA then storing to
24065+
%ptrB is safe and doesn't result in a write-after-read hazard, meaning that:
24066+
24067+
* (ptrB - ptrA) <= 0 (guarantees that all lanes are loaded before any stores), or
24068+
* (ptrB - ptrA) >= elementSize * lane (guarantees that this lane is loaded
24069+
before the store to the same address)
24070+
24071+
Examples:
24072+
"""""""""
24073+
24074+
.. code-block:: llvm
24075+
24076+
%loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.war.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 4)
24077+
%vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(ptr %ptrA, i32 4, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
24078+
[...]
24079+
call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, ptr %ptrB, i32 4, <4 x i1> %loop.dependence.mask)
24080+
24081+
.. _int_loop_dependence_raw_mask:
24082+
24083+
'``llvm.loop.dependence.raw.mask.*``' Intrinsics
24084+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24085+
24086+
Syntax:
24087+
"""""""
24088+
This is an overloaded intrinsic.
24089+
24090+
::
24091+
24092+
declare <4 x i1> @llvm.loop.dependence.raw.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
24093+
declare <8 x i1> @llvm.loop.dependence.raw.mask.v8i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
24094+
declare <16 x i1> @llvm.loop.dependence.raw.mask.v16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
24095+
declare <vscale x 16 x i1> @llvm.loop.dependence.raw.mask.nxv16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
24096+
24097+
24098+
Overview:
24099+
"""""""""
24100+
24101+
Given a vector store to %ptrA followed by a vector load from %ptrB, this
24102+
instruction generates a mask where an active lane indicates that the
24103+
read-after-write sequence can be performed safely for that lane, without a
24104+
read-after-write hazard or a store-to-load forwarding hazard being introduced.
24105+
24106+
A read-after-write hazard occurs when a read-after-write sequence for a given
24107+
lane in a vector ends up being executed as a write-after-read sequence due to
24108+
the aliasing of pointers.
24109+
24110+
A store-to-load forwarding hazard occurs when a vector store writes to an
24111+
address that partially overlaps with the address of a subsequent vector load,
24112+
meaning that the vector load can't be performed until the vector store is
24113+
complete.
24114+
24115+
Arguments:
24116+
""""""""""
24117+
24118+
The first two arguments are pointers and the last argument is an immediate.
24119+
The result is a vector with the i1 element type.
24120+
24121+
Semantics:
24122+
""""""""""
24123+
24124+
``%elementSize`` is the size of the accessed elements in bytes.
24125+
The intrinsic returns ``poison`` if the distance between ``%prtA`` and ``%ptrB``
24126+
is smaller than ``VF * %elementsize`` and either ``%ptrA + VF * %elementSize``
24127+
or ``%ptrB + VF * %elementSize`` wrap.
24128+
The element of the result mask is active when storing to %ptrA then loading from
24129+
%ptrB is safe and doesn't result in aliasing, meaning that:
24130+
24131+
* abs(ptrB - ptrA) >= elementSize * lane (guarantees that the store of this lane
24132+
occurs before loading from this address), or
24133+
* ptrA == ptrB (doesn't introduce any new hazards that weren't in the scalar
24134+
code)
24135+
24136+
Examples:
24137+
"""""""""
24138+
24139+
.. code-block:: llvm
24140+
24141+
%loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.raw.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 4)
24142+
call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, ptr %ptrA, i32 4, <4 x i1> %loop.dependence.mask)
24143+
[...]
24144+
%vecB = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(ptr %ptrB, i32 4, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
24145+
2402224146
.. _int_experimental_vp_splice:
2402324147

2402424148
'``llvm.experimental.vp.splice``' Intrinsic

llvm/include/llvm/CodeGen/ISDOpcodes.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1558,6 +1558,12 @@ enum NodeType {
15581558
// bits conform to getBooleanContents similar to the SETCC operator.
15591559
GET_ACTIVE_LANE_MASK,
15601560

1561+
// The `llvm.loop.dependence.{war, raw}.mask` intrinsics
1562+
// Operands: Load pointer, Store pointer, Element size
1563+
// Output: Mask
1564+
LOOP_DEPENDENCE_WAR_MASK,
1565+
LOOP_DEPENDENCE_RAW_MASK,
1566+
15611567
// llvm.clear_cache intrinsic
15621568
// Operands: Input Chain, Start Addres, End Address
15631569
// Outputs: Output Chain

llvm/include/llvm/IR/Intrinsics.td

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2420,6 +2420,16 @@ let IntrProperties = [IntrNoMem, ImmArg<ArgIndex<1>>] in {
24202420
llvm_i32_ty]>;
24212421
}
24222422

2423+
def int_loop_dependence_raw_mask:
2424+
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
2425+
[llvm_ptr_ty, llvm_ptr_ty, llvm_i64_ty],
2426+
[IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>]>;
2427+
2428+
def int_loop_dependence_war_mask:
2429+
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
2430+
[llvm_ptr_ty, llvm_ptr_ty, llvm_i64_ty],
2431+
[IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>]>;
2432+
24232433
def int_get_active_lane_mask:
24242434
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
24252435
[llvm_anyint_ty, LLVMMatchType<1>],

llvm/include/llvm/Target/TargetSelectionDAG.td

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -833,6 +833,14 @@ def step_vector : SDNode<"ISD::STEP_VECTOR", SDTypeProfile<1, 1,
833833
def scalar_to_vector : SDNode<"ISD::SCALAR_TO_VECTOR", SDTypeProfile<1, 1, []>,
834834
[]>;
835835

836+
def SDTLoopDepMask : SDTypeProfile<1, 3, [SDTCisVec<0>, SDTCisInt<1>,
837+
SDTCisSameAs<2, 1>, SDTCisInt<3>,
838+
SDTCVecEltisVT<0,i1>]>;
839+
def loop_dependence_war_mask : SDNode<"ISD::LOOP_DEPENDENCE_WAR_MASK",
840+
SDTLoopDepMask, []>;
841+
def loop_dependence_raw_mask : SDNode<"ISD::LOOP_DEPENDENCE_RAW_MASK",
842+
SDTLoopDepMask, []>;
843+
836844
// vector_extract/vector_insert are similar to extractelt/insertelt but allow
837845
// types that require promotion (a 16i8 extract where i8 is not a legal type so
838846
// uses i32 for example). extractelt/insertelt are preferred where the element

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,11 @@ void DAGTypeLegalizer::PromoteIntegerResult(SDNode *N, unsigned ResNo) {
324324
Res = PromoteIntRes_VP_REDUCE(N);
325325
break;
326326

327+
case ISD::LOOP_DEPENDENCE_WAR_MASK:
328+
case ISD::LOOP_DEPENDENCE_RAW_MASK:
329+
Res = PromoteIntRes_LOOP_DEPENDENCE_MASK(N);
330+
break;
331+
327332
case ISD::FREEZE:
328333
Res = PromoteIntRes_FREEZE(N);
329334
break;
@@ -374,6 +379,12 @@ SDValue DAGTypeLegalizer::PromoteIntRes_MERGE_VALUES(SDNode *N,
374379
return GetPromotedInteger(Op);
375380
}
376381

382+
SDValue DAGTypeLegalizer::PromoteIntRes_LOOP_DEPENDENCE_MASK(SDNode *N) {
383+
EVT VT = N->getValueType(0);
384+
EVT NewVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
385+
return DAG.getNode(N->getOpcode(), SDLoc(N), NewVT, N->ops());
386+
}
387+
377388
SDValue DAGTypeLegalizer::PromoteIntRes_AssertSext(SDNode *N) {
378389
// Sign-extend the new bits, and continue the assertion.
379390
SDValue Op = SExtPromotedInteger(N->getOperand(0));

llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -382,6 +382,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
382382
SDValue PromoteIntRes_VECTOR_FIND_LAST_ACTIVE(SDNode *N);
383383
SDValue PromoteIntRes_GET_ACTIVE_LANE_MASK(SDNode *N);
384384
SDValue PromoteIntRes_PARTIAL_REDUCE_MLA(SDNode *N);
385+
SDValue PromoteIntRes_LOOP_DEPENDENCE_MASK(SDNode *N);
385386

386387
// Integer Operand Promotion.
387388
bool PromoteIntegerOperand(SDNode *N, unsigned OpNo);
@@ -436,6 +437,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
436437
SDValue PromoteIntOp_VECTOR_FIND_LAST_ACTIVE(SDNode *N, unsigned OpNo);
437438
SDValue PromoteIntOp_GET_ACTIVE_LANE_MASK(SDNode *N);
438439
SDValue PromoteIntOp_PARTIAL_REDUCE_MLA(SDNode *N);
440+
SDValue PromoteIntOp_LOOP_DEPENDENCE_MASK(SDNode *N, unsigned OpNo);
439441

440442
void SExtOrZExtPromotedOperands(SDValue &LHS, SDValue &RHS);
441443
void PromoteSetCCOperands(SDValue &LHS,SDValue &RHS, ISD::CondCode Code);
@@ -868,6 +870,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
868870
// Vector Result Scalarization: <1 x ty> -> ty.
869871
void ScalarizeVectorResult(SDNode *N, unsigned ResNo);
870872
SDValue ScalarizeVecRes_MERGE_VALUES(SDNode *N, unsigned ResNo);
873+
SDValue ScalarizeVecRes_LOOP_DEPENDENCE_MASK(SDNode *N);
871874
SDValue ScalarizeVecRes_BinOp(SDNode *N);
872875
SDValue ScalarizeVecRes_CMP(SDNode *N);
873876
SDValue ScalarizeVecRes_TernaryOp(SDNode *N);
@@ -964,6 +967,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
964967
void SplitVecRes_FIX(SDNode *N, SDValue &Lo, SDValue &Hi);
965968

966969
void SplitVecRes_BITCAST(SDNode *N, SDValue &Lo, SDValue &Hi);
970+
void SplitVecRes_LOOP_DEPENDENCE_MASK(SDNode *N, SDValue &Lo, SDValue &Hi);
967971
void SplitVecRes_BUILD_VECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);
968972
void SplitVecRes_CONCAT_VECTORS(SDNode *N, SDValue &Lo, SDValue &Hi);
969973
void SplitVecRes_EXTRACT_SUBVECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);
@@ -1070,6 +1074,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
10701074
SDValue WidenVecRes_ADDRSPACECAST(SDNode *N);
10711075
SDValue WidenVecRes_AssertZext(SDNode* N);
10721076
SDValue WidenVecRes_BITCAST(SDNode* N);
1077+
SDValue WidenVecRes_LOOP_DEPENDENCE_MASK(SDNode *N);
10731078
SDValue WidenVecRes_BUILD_VECTOR(SDNode* N);
10741079
SDValue WidenVecRes_CONCAT_VECTORS(SDNode* N);
10751080
SDValue WidenVecRes_EXTEND_VECTOR_INREG(SDNode* N);

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,7 @@ class VectorLegalizer {
138138
SDValue ExpandVP_FNEG(SDNode *Node);
139139
SDValue ExpandVP_FABS(SDNode *Node);
140140
SDValue ExpandVP_FCOPYSIGN(SDNode *Node);
141+
SDValue ExpandLOOP_DEPENDENCE_MASK(SDNode *N);
141142
SDValue ExpandSELECT(SDNode *Node);
142143
std::pair<SDValue, SDValue> ExpandLoad(SDNode *N);
143144
SDValue ExpandStore(SDNode *N);
@@ -475,6 +476,8 @@ SDValue VectorLegalizer::LegalizeOp(SDValue Op) {
475476
case ISD::VECTOR_COMPRESS:
476477
case ISD::SCMP:
477478
case ISD::UCMP:
479+
case ISD::LOOP_DEPENDENCE_WAR_MASK:
480+
case ISD::LOOP_DEPENDENCE_RAW_MASK:
478481
Action = TLI.getOperationAction(Node->getOpcode(), Node->getValueType(0));
479482
break;
480483
case ISD::SMULFIX:
@@ -1291,6 +1294,10 @@ void VectorLegalizer::Expand(SDNode *Node, SmallVectorImpl<SDValue> &Results) {
12911294
case ISD::UCMP:
12921295
Results.push_back(TLI.expandCMP(Node, DAG));
12931296
return;
1297+
case ISD::LOOP_DEPENDENCE_WAR_MASK:
1298+
case ISD::LOOP_DEPENDENCE_RAW_MASK:
1299+
Results.push_back(ExpandLOOP_DEPENDENCE_MASK(Node));
1300+
return;
12941301

12951302
case ISD::FADD:
12961303
case ISD::FMUL:
@@ -1796,6 +1803,50 @@ SDValue VectorLegalizer::ExpandVP_FCOPYSIGN(SDNode *Node) {
17961803
return DAG.getNode(ISD::BITCAST, DL, VT, CopiedSign);
17971804
}
17981805

1806+
SDValue VectorLegalizer::ExpandLOOP_DEPENDENCE_MASK(SDNode *N) {
1807+
SDLoc DL(N);
1808+
SDValue SourceValue = N->getOperand(0);
1809+
SDValue SinkValue = N->getOperand(1);
1810+
SDValue EltSize = N->getOperand(2);
1811+
1812+
bool IsReadAfterWrite = N->getOpcode() == ISD::LOOP_DEPENDENCE_RAW_MASK;
1813+
EVT VT = N->getValueType(0);
1814+
EVT PtrVT = SourceValue->getValueType(0);
1815+
1816+
SDValue Diff = DAG.getNode(ISD::SUB, DL, PtrVT, SinkValue, SourceValue);
1817+
if (IsReadAfterWrite)
1818+
Diff = DAG.getNode(ISD::ABS, DL, PtrVT, Diff);
1819+
1820+
Diff = DAG.getNode(ISD::SDIV, DL, PtrVT, Diff, EltSize);
1821+
1822+
// If the difference is positive then some elements may alias
1823+
EVT CmpVT = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),
1824+
Diff.getValueType());
1825+
SDValue Zero = DAG.getTargetConstant(0, DL, PtrVT);
1826+
SDValue Cmp = DAG.getSetCC(DL, CmpVT, Diff, Zero,
1827+
IsReadAfterWrite ? ISD::SETEQ : ISD::SETLE);
1828+
1829+
// Create the lane mask
1830+
EVT SplatVT = VT.changeElementType(PtrVT);
1831+
SDValue DiffSplat = DAG.getSplat(SplatVT, DL, Diff);
1832+
SDValue VectorStep = DAG.getStepVector(DL, SplatVT);
1833+
EVT MaskVT = VT.changeElementType(MVT::i1);
1834+
SDValue DiffMask =
1835+
DAG.getSetCC(DL, MaskVT, VectorStep, DiffSplat, ISD::CondCode::SETULT);
1836+
1837+
EVT EltVT = VT.getVectorElementType();
1838+
// Extend the diff setcc in case the intrinsic has been promoted to a vector
1839+
// type with elements larger than i1
1840+
if (EltVT.getScalarSizeInBits() > MaskVT.getScalarSizeInBits())
1841+
DiffMask = DAG.getNode(ISD::ANY_EXTEND, DL, VT, DiffMask);
1842+
1843+
// Splat the compare result then OR it with the lane mask
1844+
if (CmpVT.getScalarSizeInBits() < EltVT.getScalarSizeInBits())
1845+
Cmp = DAG.getNode(ISD::ZERO_EXTEND, DL, EltVT, Cmp);
1846+
SDValue Splat = DAG.getSplat(VT, DL, Cmp);
1847+
return DAG.getNode(ISD::OR, DL, VT, DiffMask, Splat);
1848+
}
1849+
17991850
void VectorLegalizer::ExpandFP_TO_UINT(SDNode *Node,
18001851
SmallVectorImpl<SDValue> &Results) {
18011852
// Attempt to expand using TargetLowering.

0 commit comments

Comments
 (0)