[AArch64] Support scalable vp.udiv/vp.sdiv with SVE #154327

lukel97 · 2025-08-19T13:00:15Z

This PR adds support for lowering @llvm.vp.sdiv and @llvm.vp.udiv to sdiv and udiv respectively on SVE.

The motivation for this actually somewhat comes from an improvement we're trying to make in the loop vectorizer for RISC-V, but it may be beneficial for AArch64 too: #154076.

Given a loop below with a divisor that may potentially trap:

void foo(int* __restrict__ dst, int* __restrict__ src, int N) {
    #pragma nounroll
    for (int i = 0; i < 12; i++)
        dst[i] = 42 / src[i];
}

When tail folded, the loop vectorizer today needs to "mask off" the excess lanes with a safe divisor of 1:

	mov	z0.s, #1
.LBB0_1:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
	ld1w	{ z2.s }, p1/z, [x1, x9, lsl #2]
	sel	z2.s, p1, z2.s, z0.s
	sdivr	z2.s, p0/m, z2.s, z1.s
	st1w	{ z2.s }, p1, [x0, x9, lsl #2]
	incw	x9
	whilelo	p1.s, x9, x8
	b.mi	.LBB0_1

It does this by using a select before passing it into a regular sdiv instruction:

  %7 = select <vscale x 4 x i1> %active.lane.mask, <vscale x 4 x i32> %wide.masked.load, <vscale x 4 x i32> splat (i32 1)
  %8 = sdiv <vscale x 4 x i32> splat (i32 42), %7

However we can remove the need for the sel and the broadcasted safe-divisor by using a @llvm.vp.sdiv intrinsic and passing in the active lane mask:

.LBB0_1:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
	ld1w	{ z1.s }, p0/z, [x1, x9, lsl #2]
	sdivr	z1.s, p0/m, z1.s, z0.s
	st1w	{ z1.s }, p0, [x0, x9, lsl #2]
	incw	x9
	whilelo	p0.s, x9, x8
	b.mi	.LBB0_1

This patch adds codegen support for AArch64, so the loop vectorizer can later on emit the VP intrinsics in a later patch.

The main thing is that llvm.vp.sdiv doesn't have UB on the masked-off lanes at the IR level. IFIUC sdiv/udiv on SVE don't trap on a zero divisor anyway, so we could just lower these VP intrinsics to a true predicate. However given that division is probably slow, maybe some microarchitectures optimise for the case where some lanes are disabled. So it seems worthwhile to lower the VP mask down to the SVE predicate.

Note that the "EVL" operand of the VP intrinsic gets moved into the mask by ExpandVectorPredication, and when it's the full size of the vector it will get ignored completely.

This only adds support for scalable vectors for now. I guess fixed-length vectors could be supported in future but we would need to make v2i1/v4i1/v8i1/v16i1 legal types first.

I don't think we need to support any other VP intrinsics on AArch64. The div/rem intrinsics are unique in that they specifically allow us to prevent UB.

llvmbot · 2025-08-19T13:00:47Z

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-aarch64

Author: Luke Lau (lukel97)

Changes

This PR adds support for lowering @llvm.vp.sdiv and @llvm.vp.udiv to sdiv and udiv respectively on SVE.

The motivation for this actually somewhat comes from an improvement we're trying to make in the loop vectorizer for RISC-V, but it may be beneficial for AArch64 too: #154076.

Given a loop below with a divisor that may potentially trap:

void foo(int* __restrict__ dst, int* __restrict__ src, int N) {
    #pragma nounroll
    for (int i = 0; i &lt; 12; i++)
        dst[i] = 42 / src[i];
}

When tail folded, the loop vectorizer today needs to "mask off" the excess lanes with a safe divisor of 1:

	mov	z0.s, #<!-- -->1
.LBB0_1:                                // %vector.body
                                        // =&gt;This Inner Loop Header: Depth=1
	ld1w	{ z2.s }, p1/z, [x1, x9, lsl #<!-- -->2]
	sel	z2.s, p1, z2.s, z0.s
	sdivr	z2.s, p0/m, z2.s, z1.s
	st1w	{ z2.s }, p1, [x0, x9, lsl #<!-- -->2]
	incw	x9
	whilelo	p1.s, x9, x8
	b.mi	.LBB0_1

It does this by using a select before passing it into a regular sdiv instruction:

  %7 = select &lt;vscale x 4 x i1&gt; %active.lane.mask, &lt;vscale x 4 x i32&gt; %wide.masked.load, &lt;vscale x 4 x i32&gt; splat (i32 1)
  %8 = sdiv &lt;vscale x 4 x i32&gt; splat (i32 42), %7

However we can remove the need for the sel and the broadcasted safe-divisor by using a @llvm.vp.sdiv intrinsic and passing in the active lane mask:

.LBB0_1:                                // %vector.body
                                        // =&gt;This Inner Loop Header: Depth=1
	ld1w	{ z1.s }, p0/z, [x1, x9, lsl #<!-- -->2]
	sdivr	z1.s, p0/m, z1.s, z0.s
	st1w	{ z1.s }, p0, [x0, x9, lsl #<!-- -->2]
	incw	x9
	whilelo	p0.s, x9, x8
	b.mi	.LBB0_1

This patch adds codegen support for AArch64, so the loop vectorizer can later on emit the VP intrinsics in a later patch.

The main thing is that llvm.vp.sdiv doesn't have UB on the masked-off lanes at the IR level. IFIUC sdiv/udiv on SVE don't trap on a zero divisor anyway, so we could just lower these VP intrinsics to a true predicate. However given that division is probably slow, maybe some microarchitectures optimise for the case where some lanes are disabled. So it seems worthwhile to lower the VP mask down to the SVE predicate.

Note that the "EVL" operand of the VP intrinsic gets moved into the mask by ExpandVectorPredication, and when it's the full size of the vector it will get ignored completely.

This only adds support for scalable vectors for now. I guess fixed-length vectors could be supported in future but we would need to make v2i1/v4i1/v8i1/v16i1 legal types first.

Full diff: https://github.com/llvm/llvm-project/pull/154327.diff

7 Files Affected:

(modified) llvm/include/llvm/Target/TargetSelectionDAG.td (+5)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+47)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+1)
(modified) llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td (+2-2)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+11)
(modified) llvm/lib/Target/AArch64/SVEInstrFormats.td (+6-1)
(added) llvm/test/CodeGen/AArch64/sve-vp-div.ll (+144)

diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index a4ed62bb5715c..7a615e3a80622 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -115,6 +115,9 @@ def SDTPtrAddOp : SDTypeProfile<1, 2, [     // ptradd
 def SDTIntBinOp : SDTypeProfile<1, 2, [     // add, and, or, xor, udiv, etc.
   SDTCisSameAs<0, 1>, SDTCisSameAs<0, 2>, SDTCisInt<0>
 ]>;
+def SDTIntBinVPOp : SDTypeProfile<1, 4, [
+  SDTCisSameAs<0, 1>, SDTCisSameAs<0, 2>, SDTCisInt<0>, SDTCisVec<0>, SDTCVecEltisVT<3, i1>, SDTCisSameNumEltsAs<0, 3>, SDTCisInt<4>
+]>;
 def SDTIntShiftOp : SDTypeProfile<1, 2, [   // shl, sra, srl
   SDTCisSameAs<0, 1>, SDTCisInt<0>, SDTCisInt<2>
 ]>;
@@ -423,6 +426,8 @@ def smullohi   : SDNode<"ISD::SMUL_LOHI" , SDTIntBinHiLoOp, [SDNPCommutative]>;
 def umullohi   : SDNode<"ISD::UMUL_LOHI" , SDTIntBinHiLoOp, [SDNPCommutative]>;
 def sdiv       : SDNode<"ISD::SDIV"      , SDTIntBinOp>;
 def udiv       : SDNode<"ISD::UDIV"      , SDTIntBinOp>;
+def vp_sdiv    : SDNode<"ISD::VP_SDIV"   , SDTIntBinVPOp>;
+def vp_udiv    : SDNode<"ISD::VP_UDIV"   , SDTIntBinVPOp>;
 def srem       : SDNode<"ISD::SREM"      , SDTIntBinOp>;
 def urem       : SDNode<"ISD::UREM"      , SDTIntBinOp>;
 def sdivrem    : SDNode<"ISD::SDIVREM"   , SDTIntBinHiLoOp>;
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index aefbbe2534be2..1edbab78952ae 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -1594,6 +1594,17 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
         setOperationAction(ISD::OR, VT, Custom);
     }
 
+    for (auto VT : {MVT::nxv4i32, MVT::nxv2i64}) {
+      setOperationAction(ISD::VP_SDIV, VT, Legal);
+      setOperationAction(ISD::VP_UDIV, VT, Legal);
+    }
+    // SVE doesn't have i8 and i16 DIV operations, so custom lower them to
+    // 32-bit operations.
+    for (auto VT : {MVT::nxv16i8, MVT::nxv8i16}) {
+      setOperationAction(ISD::VP_SDIV, VT, Custom);
+      setOperationAction(ISD::VP_UDIV, VT, Custom);
+    }
+
     // Illegal unpacked integer vector types.
     for (auto VT : {MVT::nxv8i8, MVT::nxv4i16, MVT::nxv2i32}) {
       setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
@@ -7462,6 +7473,9 @@ SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
   case ISD::SDIV:
   case ISD::UDIV:
     return LowerDIV(Op, DAG);
+  case ISD::VP_SDIV:
+  case ISD::VP_UDIV:
+    return LowerVP_DIV(Op, DAG);
   case ISD::SMIN:
   case ISD::UMIN:
   case ISD::SMAX:
@@ -15870,6 +15884,39 @@ SDValue AArch64TargetLowering::LowerDIV(SDValue Op, SelectionDAG &DAG) const {
   return DAG.getNode(AArch64ISD::UZP1, DL, VT, ResultLoCast, ResultHiCast);
 }
 
+SDValue AArch64TargetLowering::LowerVP_DIV(SDValue Op,
+                                           SelectionDAG &DAG) const {
+  EVT VT = Op.getValueType();
+  SDLoc DL(Op);
+  bool Signed = Op.getOpcode() == ISD::VP_SDIV;
+
+  // SVE doesn't have i8 and i16 DIV operations; widen them to 32-bit
+  // operations, and truncate the result.
+  EVT WidenedVT;
+  if (VT == MVT::nxv16i8)
+    WidenedVT = MVT::nxv8i16;
+  else if (VT == MVT::nxv8i16)
+    WidenedVT = MVT::nxv4i32;
+  else
+    llvm_unreachable("Unexpected Custom DIV operation");
+
+  auto [MaskLo, MaskHi] = DAG.SplitVector(Op.getOperand(2), DL);
+  auto [EVLLo, EVLHi] = DAG.SplitEVL(Op.getOperand(3), WidenedVT, DL);
+  unsigned UnpkLo = Signed ? AArch64ISD::SUNPKLO : AArch64ISD::UUNPKLO;
+  unsigned UnpkHi = Signed ? AArch64ISD::SUNPKHI : AArch64ISD::UUNPKHI;
+  SDValue Op0Lo = DAG.getNode(UnpkLo, DL, WidenedVT, Op.getOperand(0));
+  SDValue Op1Lo = DAG.getNode(UnpkLo, DL, WidenedVT, Op.getOperand(1));
+  SDValue Op0Hi = DAG.getNode(UnpkHi, DL, WidenedVT, Op.getOperand(0));
+  SDValue Op1Hi = DAG.getNode(UnpkHi, DL, WidenedVT, Op.getOperand(1));
+  SDValue ResultLo =
+      DAG.getNode(Op.getOpcode(), DL, WidenedVT, Op0Lo, Op1Lo, MaskLo, EVLLo);
+  SDValue ResultHi =
+      DAG.getNode(Op.getOpcode(), DL, WidenedVT, Op0Hi, Op1Hi, MaskHi, EVLHi);
+  SDValue ResultLoCast = DAG.getNode(AArch64ISD::NVCAST, DL, VT, ResultLo);
+  SDValue ResultHiCast = DAG.getNode(AArch64ISD::NVCAST, DL, VT, ResultHi);
+  return DAG.getNode(AArch64ISD::UZP1, DL, VT, ResultLoCast, ResultHiCast);
+}
+
 bool AArch64TargetLowering::shouldExpandBuildVectorWithShuffles(
     EVT VT, unsigned DefinedValues) const {
   if (!Subtarget->isNeonAvailable())
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 78d6a507b80d3..0251ab594e488 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -706,6 +706,7 @@ class AArch64TargetLowering : public TargetLowering {
   SDValue LowerPARTIAL_REDUCE_MLA(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerGET_ACTIVE_LANE_MASK(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerDIV(SDValue Op, SelectionDAG &DAG) const;
+  SDValue LowerVP_DIV(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerVectorSRA_SRL_SHL(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerShiftParts(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 509dd8b73a017..ee17fef3a3b3a 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -700,8 +700,8 @@ let Predicates = [HasSVE_or_SME] in {
   defm SDIVR_ZPmZ : sve_int_bin_pred_arit_2_div<0b110, "sdivr", "SDIVR_ZPZZ", int_aarch64_sve_sdivr, DestructiveBinaryCommWithRev, "SDIV_ZPmZ", /*isReverseInstr*/ 1>;
   defm UDIVR_ZPmZ : sve_int_bin_pred_arit_2_div<0b111, "udivr", "UDIVR_ZPZZ", int_aarch64_sve_udivr, DestructiveBinaryCommWithRev, "UDIV_ZPmZ", /*isReverseInstr*/ 1>;
 
-  defm SDIV_ZPZZ  : sve_int_bin_pred_sd<AArch64sdiv_p>;
-  defm UDIV_ZPZZ  : sve_int_bin_pred_sd<AArch64udiv_p>;
+  defm SDIV_ZPZZ  : sve_int_bin_pred_sd<AArch64sdiv_p, vp_sdiv>;
+  defm UDIV_ZPZZ  : sve_int_bin_pred_sd<AArch64udiv_p, vp_udiv>;
 
   defm SDOT_ZZZ : sve_intx_dot<0b0, "sdot", AArch64sdot>;
   defm UDOT_ZZZ : sve_intx_dot<0b1, "udot", AArch64udot>;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 9c96fdd427814..ea6cf1a7e21d1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -156,6 +156,17 @@ class AArch64TTIImpl final : public BasicTTIImplBase<AArch64TTIImpl> {
 
   bool isVScaleKnownToBeAPowerOfTwo() const override { return true; }
 
+  TargetTransformInfo::VPLegalization
+  getVPLegalizationStrategy(const VPIntrinsic &PI) const override {
+    using VPLegalization = TargetTransformInfo::VPLegalization;
+    switch (PI.getIntrinsicID()) {
+    case Intrinsic::vp_sdiv:
+    case Intrinsic::vp_udiv:
+      return VPLegalization(VPLegalization::Discard, VPLegalization::Legal);
+    }
+    return BaseT::getVPLegalizationStrategy(PI);
+  }
+
   bool shouldMaximizeVectorBandwidth(
       TargetTransformInfo::RegisterKind K) const override;
 
diff --git a/llvm/lib/Target/AArch64/SVEInstrFormats.td b/llvm/lib/Target/AArch64/SVEInstrFormats.td
index a3a7d0f74e1bc..ada6a47590ed2 100644
--- a/llvm/lib/Target/AArch64/SVEInstrFormats.td
+++ b/llvm/lib/Target/AArch64/SVEInstrFormats.td
@@ -9788,12 +9788,17 @@ multiclass sve_int_bin_pred_bhsd<SDPatternOperator op> {
 }
 
 // As sve_int_bin_pred but when only i32 and i64 vector types are required.
-multiclass sve_int_bin_pred_sd<SDPatternOperator op> {
+multiclass sve_int_bin_pred_sd<SDPatternOperator op, SDPatternOperator vp_op> {
   def _S_UNDEF : PredTwoOpPseudo<NAME # _S, ZPR32, FalseLanesUndef>;
   def _D_UNDEF : PredTwoOpPseudo<NAME # _D, ZPR64, FalseLanesUndef>;
 
   def : SVE_3_Op_Pat<nxv4i32, op, nxv4i1, nxv4i32, nxv4i32, !cast<Pseudo>(NAME # _S_UNDEF)>;
   def : SVE_3_Op_Pat<nxv2i64, op, nxv2i1, nxv2i64, nxv2i64, !cast<Pseudo>(NAME # _D_UNDEF)>;
+
+  def : Pat<(nxv4i32 (vp_op nxv4i32:$lhs, nxv4i32:$rhs, nxv4i1:$pred, (i32 srcvalue))),
+            (!cast<Pseudo>(NAME # _S_UNDEF) $pred, $lhs, $rhs)>;
+  def : Pat<(nxv2i64 (vp_op nxv2i64:$lhs, nxv2i64:$rhs, nxv2i1:$pred, (i32 srcvalue))),
+            (!cast<Pseudo>(NAME # _D_UNDEF) $pred, $lhs, $rhs)>;
 }
 
 // Predicated pseudo integer two operand instructions. Second operand is an
diff --git a/llvm/test/CodeGen/AArch64/sve-vp-div.ll b/llvm/test/CodeGen/AArch64/sve-vp-div.ll
new file mode 100644
index 0000000000000..920b4308ecfdd
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-vp-div.ll
@@ -0,0 +1,144 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64 -mattr=+sve < %s | FileCheck %s
+
+define <vscale x 2 x i64> @sdiv_evl_max(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %mask) {
+; CHECK-LABEL: sdiv_evl_max:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sdiv z0.d, p0/m, z0.d, z1.d
+; CHECK-NEXT:    ret
+  %vscale = call i32 @llvm.vscale()
+  %evl = mul i32 %vscale, 2
+  %z = call <vscale x 2 x i64> @llvm.vp.sdiv(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %mask, i32 %evl)
+  ret <vscale x 2 x i64> %z
+}
+
+define <vscale x 2 x i64> @sdiv_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %mask, i32 %evl) {
+; CHECK-LABEL: sdiv_nxv2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    whilelo p1.d, wzr, w0
+; CHECK-NEXT:    and p0.b, p1/z, p1.b, p0.b
+; CHECK-NEXT:    sdiv z0.d, p0/m, z0.d, z1.d
+; CHECK-NEXT:    ret
+  %z = call <vscale x 2 x i64> @llvm.vp.sdiv(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %mask, i32 %evl)
+  ret <vscale x 2 x i64> %z
+}
+
+define <vscale x 4 x i32> @sdiv_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %mask, i32 %evl) {
+; CHECK-LABEL: sdiv_nxv4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    whilelo p1.s, wzr, w0
+; CHECK-NEXT:    and p0.b, p1/z, p1.b, p0.b
+; CHECK-NEXT:    sdiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT:    ret
+  %z = call <vscale x 4 x i32> @llvm.vp.sdiv(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %mask, i32 %evl)
+  ret <vscale x 4 x i32> %z
+}
+
+define <vscale x 8 x i16> @sdiv_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %mask, i32 %evl) {
+; CHECK-LABEL: sdiv_nxv8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    whilelo p1.h, wzr, w0
+; CHECK-NEXT:    sunpkhi z2.s, z1.h
+; CHECK-NEXT:    sunpkhi z3.s, z0.h
+; CHECK-NEXT:    sunpklo z1.s, z1.h
+; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    and p0.b, p1/z, p1.b, p0.b
+; CHECK-NEXT:    punpkhi p1.h, p0.b
+; CHECK-NEXT:    punpklo p0.h, p0.b
+; CHECK-NEXT:    sdivr z2.s, p1/m, z2.s, z3.s
+; CHECK-NEXT:    sdiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT:    uzp1 z0.h, z0.h, z2.h
+; CHECK-NEXT:    ret
+  %z = call <vscale x 8 x i16> @llvm.vp.sdiv(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %mask, i32 %evl)
+  ret <vscale x 8 x i16> %z
+}
+
+define <vscale x 8 x i16> @sdiv_nxv16i8(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %mask, i32 %evl) {
+; CHECK-LABEL: sdiv_nxv16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    whilelo p1.h, wzr, w0
+; CHECK-NEXT:    sunpkhi z2.s, z1.h
+; CHECK-NEXT:    sunpkhi z3.s, z0.h
+; CHECK-NEXT:    sunpklo z1.s, z1.h
+; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    and p0.b, p1/z, p1.b, p0.b
+; CHECK-NEXT:    punpkhi p1.h, p0.b
+; CHECK-NEXT:    punpklo p0.h, p0.b
+; CHECK-NEXT:    sdivr z2.s, p1/m, z2.s, z3.s
+; CHECK-NEXT:    sdiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT:    uzp1 z0.h, z0.h, z2.h
+; CHECK-NEXT:    ret
+  %z = call <vscale x 8 x i16> @llvm.vp.sdiv(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %mask, i32 %evl)
+  ret <vscale x 8 x i16> %z
+}
+
+define <vscale x 2 x i64> @udiv_evl_max(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %mask) {
+; CHECK-LABEL: udiv_evl_max:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    udiv z0.d, p0/m, z0.d, z1.d
+; CHECK-NEXT:    ret
+  %vscale = call i32 @llvm.vscale()
+  %evl = mul i32 %vscale, 2
+  %z = call <vscale x 2 x i64> @llvm.vp.udiv(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %mask, i32 %evl)
+  ret <vscale x 2 x i64> %z
+}
+
+define <vscale x 2 x i64> @udiv_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %mask, i32 %evl) {
+; CHECK-LABEL: udiv_nxv2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    whilelo p1.d, wzr, w0
+; CHECK-NEXT:    and p0.b, p1/z, p1.b, p0.b
+; CHECK-NEXT:    udiv z0.d, p0/m, z0.d, z1.d
+; CHECK-NEXT:    ret
+  %z = call <vscale x 2 x i64> @llvm.vp.udiv(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %mask, i32 %evl)
+  ret <vscale x 2 x i64> %z
+}
+
+define <vscale x 4 x i32> @udiv_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %mask, i32 %evl) {
+; CHECK-LABEL: udiv_nxv4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    whilelo p1.s, wzr, w0
+; CHECK-NEXT:    and p0.b, p1/z, p1.b, p0.b
+; CHECK-NEXT:    udiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT:    ret
+  %z = call <vscale x 4 x i32> @llvm.vp.udiv(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %mask, i32 %evl)
+  ret <vscale x 4 x i32> %z
+}
+
+define <vscale x 8 x i16> @udiv_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %mask, i32 %evl) {
+; CHECK-LABEL: udiv_nxv8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    whilelo p1.h, wzr, w0
+; CHECK-NEXT:    uunpkhi z2.s, z1.h
+; CHECK-NEXT:    uunpkhi z3.s, z0.h
+; CHECK-NEXT:    uunpklo z1.s, z1.h
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    and p0.b, p1/z, p1.b, p0.b
+; CHECK-NEXT:    punpkhi p1.h, p0.b
+; CHECK-NEXT:    punpklo p0.h, p0.b
+; CHECK-NEXT:    udivr z2.s, p1/m, z2.s, z3.s
+; CHECK-NEXT:    udiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT:    uzp1 z0.h, z0.h, z2.h
+; CHECK-NEXT:    ret
+  %z = call <vscale x 8 x i16> @llvm.vp.udiv(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %mask, i32 %evl)
+  ret <vscale x 8 x i16> %z
+}
+
+define <vscale x 8 x i16> @udiv_nxv16i8(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %mask, i32 %evl) {
+; CHECK-LABEL: udiv_nxv16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    whilelo p1.h, wzr, w0
+; CHECK-NEXT:    uunpkhi z2.s, z1.h
+; CHECK-NEXT:    uunpkhi z3.s, z0.h
+; CHECK-NEXT:    uunpklo z1.s, z1.h
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    and p0.b, p1/z, p1.b, p0.b
+; CHECK-NEXT:    punpkhi p1.h, p0.b
+; CHECK-NEXT:    punpklo p0.h, p0.b
+; CHECK-NEXT:    udivr z2.s, p1/m, z2.s, z3.s
+; CHECK-NEXT:    udiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT:    uzp1 z0.h, z0.h, z2.h
+; CHECK-NEXT:    ret
+  %z = call <vscale x 8 x i16> @llvm.vp.udiv(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %mask, i32 %evl)
+  ret <vscale x 8 x i16> %z
+}

lukel97 · 2025-08-19T13:06:12Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

+    switch (PI.getIntrinsicID()) {
+    case Intrinsic::vp_sdiv:
+    case Intrinsic::vp_udiv:
+      return VPLegalization(VPLegalization::Discard, VPLegalization::Legal);


This is just saying that the EVL argument doesn't actually make a difference on SVE because udiv/sdiv doesn't trap anyway, so feel free to throw it away.

But since sdiv + udiv can have UB on those lanes, ExpandVectorPredication ends up moving the EVL into the mask operand, hence the

whilelo p1.d, wzr, w0 and p0.b, p1/z, p1.b, p0.b

in the tests

paulwalker-arm

There needs to be wider discussion before going down this path. We've currently avoided adding AArch64 code generation support for the VP intrinsics because it is unclear what agreement is in place for them to be "the" solution to supporting masked operations within LLVM IR.

It's not a deal breaker but I've never much like the EVL parameter, or at least the way it is defined. This is because, as you say, for SVE it is meaningless and I don't like having meaningless properties within the IR.

Looking at the current code I think AArch64 isel is just missing PatFrags for select based divides, much like we have for select based add, subs etc.

lukel97 · 2025-08-19T14:06:48Z

There needs to be wider discussion before going down this path. We've currently avoided adding AArch64 code generation support for the VP intrinsics because it is unclear what agreement is in place for them to be "the" solution to supporting masked operations within LLVM IR.

To clarify, this PR only intends to add support for the division intrinsics and nothing else. I agree that VP intrinsics don't make much sense for AArch64 and I'm not proposing to support the entire suite of VP intrinsics.

For what it's worth, on RISC-V we have also been moving away VP intrinsics as well, e.g. see #127180. We now just use regular instructions for the most part and perform the EVL optimisation bits at the MIR level. We only use VP intrinsics specifically for things like loads/stores and permutations.

Looking at the current code I think AArch64 isel is just missing PatFrags for select based divides, much like we have for select based add, subs etc.

Unlike the other VP intrinsics, I think the division VP intrinsics are special in this regard because AFAIK there isn't a simple way to represent their semantics in LLVM IR.

We want to represent a division predicated by some mask, where:

The masked-off lanes are poison in the result (this avoids the need for the sel)
The masked-off lanes don't cause UB

To the best of my knowledge, we can't achieve both with a single select + div today. We would have to do something like this:

%s = select <4 x i1> %m, <4 x i32> %x, <4 x i32> splat (i32 1)
%div = sdiv <4 x i32> %y, %s
%res = select <4 x i1> %m, <4 x i32> %div, <4 x i32> poison

But this seems fragile if the selects get moved around, and more inaccurate when costing in the loop vectorizer (we have two selects that are actually free).

So in my head I'm somewhat viewing llvm.vp.sdiv as not a regular VP intrinsic, but something more like llvm.masked.sdiv.

paulwalker-arm · 2025-08-19T14:19:50Z

Thanks for the clarification.

So in my head I'm somewhat viewing llvm.vp.sdiv as not a regular VP intrinsic, but something more like llvm.masked.sdiv.

In which case I'd rather see the addition of llvm.masked.sdiv because it's more in keeping with the existing masked operations rather than mixing between the two styles.

lukel97 · 2025-08-19T15:30:15Z

In which case I'd rather see the addition of llvm.masked.sdiv because it's more in keeping with the existing masked operations rather than mixing between the two styles.

Unfortunately we still need the EVL operand though for RISC-V, which is why I was hoping to emit the VP intrinsic from the loop vectorizer.

If it's of any consolation, the ExpandVectorPredication pass will "discard" the EVL operand by the time it hits the AArch64 backend, so it can just be ignored during isel.

lukel97 · 2025-08-19T15:40:13Z

I should also mention that the AArch64 backend is not my neck of the woods at all, and the original PR for RISC-V in #154076 isn't blocked by this.

Happy to go with whatever the actual AArch64 people here think is the right direction, just thought I'd post this patch as a sort of proof-of-concept if others found it interesting!

[AArch64] Support scalable vp.udiv/vp.sdiv with SVE

8a2f110

lukel97 requested review from SamTebbs33, david-arm and paulwalker-arm August 19, 2025 13:00

llvmbot added backend:AArch64 llvm:SelectionDAG SelectionDAGISel as well labels Aug 19, 2025

lukel97 commented Aug 19, 2025

View reviewed changes

paulwalker-arm requested changes Aug 19, 2025

View reviewed changes

lukel97 mentioned this pull request Aug 19, 2025

[VP] Detect truncated and shifted EVLs during expansion #154334

Open

lukel97 closed this Sep 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Support scalable vp.udiv/vp.sdiv with SVE #154327

[AArch64] Support scalable vp.udiv/vp.sdiv with SVE #154327

Uh oh!

lukel97 commented Aug 19, 2025 •

edited

Loading

Uh oh!

llvmbot commented Aug 19, 2025 •

edited

Loading

Uh oh!

lukel97 Aug 19, 2025

Uh oh!

paulwalker-arm left a comment

Uh oh!

lukel97 commented Aug 19, 2025

Uh oh!

paulwalker-arm commented Aug 19, 2025

Uh oh!

lukel97 commented Aug 19, 2025

Uh oh!

lukel97 commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[AArch64] Support scalable vp.udiv/vp.sdiv with SVE #154327

[AArch64] Support scalable vp.udiv/vp.sdiv with SVE #154327

Uh oh!

Conversation

lukel97 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukel97 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm left a comment

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Aug 19, 2025

Uh oh!

paulwalker-arm commented Aug 19, 2025

Uh oh!

lukel97 commented Aug 19, 2025

Uh oh!

lukel97 commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukel97 commented Aug 19, 2025 •

edited

Loading

llvmbot commented Aug 19, 2025 •

edited

Loading