[SelectionDAG] Use Magic Algorithm for Splitting UDIV/UREM by Constant #154968

mskamp · 2025-08-22T15:22:15Z

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the fmt
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes #137514.

Add new test prefixes to some tests. Currently, these prefixes are unused but a subsequent commit will change the test result such that they become necessary. Furthermore, rename tests that will be folded after a subsequent commit.

llvmbot · 2025-08-22T15:22:45Z

@llvm/pr-subscribers-backend-arm

@llvm/pr-subscribers-backend-x86

Author: Marius Kamp (mskamp)

Changes

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the fmt
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes #137514.

Patch is 138.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154968.diff

15 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+159-33)
(modified) llvm/test/CodeGen/AArch64/rem-by-const.ll (+210-70)
(modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+130-110)
(modified) llvm/test/CodeGen/Mips/funnel-shift.ll (+188-185)
(modified) llvm/test/CodeGen/PowerPC/funnel-shift.ll (+134-178)
(modified) llvm/test/CodeGen/PowerPC/urem-lkk.ll (+82-18)
(modified) llvm/test/CodeGen/RISCV/div-by-constant.ll (+42-7)
(modified) llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll (+144-28)
(modified) llvm/test/CodeGen/RISCV/split-urem-by-constant.ll (+192-28)
(modified) llvm/test/CodeGen/RISCV/urem-lkk.ll (+49-13)
(modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+158-46)
(modified) llvm/test/CodeGen/X86/divide-by-constant.ll (+458-32)
(modified) llvm/test/CodeGen/X86/divmod128.ll (+625-15)
(modified) llvm/test/CodeGen/X86/funnel-shift.ll (+73-39)
(modified) llvm/test/CodeGen/X86/i128-udiv.ll (+37-9)

diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 402a012e8e555..9e1a51e291ddb 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -8011,25 +8011,12 @@ bool TargetLowering::expandMUL(SDNode *N, SDValue &Lo, SDValue &Hi, EVT HiLoVT,
 // dividend and multiply by the multiplicative inverse of the shifted divisor.
 // If we want the remainder, we shift the value left by the number of trailing
 // zeros and add the bits that were shifted out of the dividend.
-bool TargetLowering::expandDIVREMByConstant(SDNode *N,
-                                            SmallVectorImpl<SDValue> &Result,
-                                            EVT HiLoVT, SelectionDAG &DAG,
-                                            SDValue LL, SDValue LH) const {
+static bool expandUDIVREMByConstantViaUREMDecomposition(
+    SDNode *N, APInt Divisor, SmallVectorImpl<SDValue> &Result, EVT HiLoVT,
+    SelectionDAG &DAG, SDValue LL, SDValue LH, const TargetLowering &TLI) {
   unsigned Opcode = N->getOpcode();
   EVT VT = N->getValueType(0);
 
-  // TODO: Support signed division/remainder.
-  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
-    return false;
-  assert(
-      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
-      "Unexpected opcode");
-
-  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
-  if (!CN)
-    return false;
-
-  APInt Divisor = CN->getAPIntValue();
   unsigned BitWidth = Divisor.getBitWidth();
   unsigned HBitWidth = BitWidth / 2;
   assert(VT.getScalarSizeInBits() == BitWidth &&
@@ -8040,20 +8027,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   if (Divisor.uge(HalfMaxPlus1))
     return false;
 
-  // We depend on the UREM by constant optimization in DAGCombiner that requires
-  // high multiply.
-  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
-      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
-    return false;
-
-  // Don't expand if optimizing for size.
-  if (DAG.shouldOptForSize())
-    return false;
-
-  // Early out for 0 or 1 divisors.
-  if (Divisor.ule(1))
-    return false;
-
   // If the divisor is even, shift it until it becomes odd.
   unsigned TrailingZeros = 0;
   if (!Divisor[0]) {
@@ -8097,8 +8070,8 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
 
     // Use uaddo_carry if we can, otherwise use a compare to detect overflow.
     EVT SetCCType =
-        getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
-    if (isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
+        TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
+    if (TLI.isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
       SDVTList VTList = DAG.getVTList(HiLoVT, SetCCType);
       Sum = DAG.getNode(ISD::UADDO, dl, VTList, LL, LH);
       Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, Sum,
@@ -8108,7 +8081,7 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
       SDValue Carry = DAG.getSetCC(dl, SetCCType, Sum, LL, ISD::SETULT);
       // If the boolean for the target is 0 or 1, we can add the setcc result
       // directly.
-      if (getBooleanContents(HiLoVT) ==
+      if (TLI.getBooleanContents(HiLoVT) ==
           TargetLoweringBase::ZeroOrOneBooleanContent)
         Carry = DAG.getZExtOrTrunc(Carry, dl, HiLoVT);
       else
@@ -8164,6 +8137,159 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   return true;
 }
 
+static bool
+expandUDIVREMByConstantViaUMulHiMagic(SDNode *N, const APInt &Divisor,
+                                      SmallVectorImpl<SDValue> &Result,
+                                      EVT HiLoVT, SelectionDAG &DAG, SDValue LL,
+                                      SDValue LH, const TargetLowering &TLI) {
+
+  SDValue N0 = N->getOperand(0);
+  EVT VT = N0->getValueType(0);
+  SDLoc DL{N};
+
+  assert(!Divisor.isOne() && "Magic algorithm does not work for division by 1");
+
+  // This helper creates a MUL_LOHI of the pair (LL, LH) by a constant.
+  auto MakeMUL_LOHIByConst = [&](unsigned Opc, SDValue LL, SDValue LH,
+                                 const APInt &Const,
+                                 SmallVectorImpl<SDValue> &Result) {
+    SDValue LHS = DAG.getNode(ISD::BUILD_PAIR, DL, VT, LL, LH);
+    SDValue RHS = DAG.getConstant(Const, DL, VT);
+    auto [RL, RH] = DAG.SplitScalar(RHS, DL, HiLoVT, HiLoVT);
+    return TLI.expandMUL_LOHI(
+        Opc, VT, DL, LHS, RHS, Result, HiLoVT, DAG,
+        TargetLowering::MulExpansionKind::OnlyLegalOrCustom, LL, LH, RL, RH);
+  };
+
+  // This helper creates an ADD/SUB of the pairs (LL, LH) and (RL, RH).
+  auto MakeAddSubLong = [&](unsigned Opc, SDValue LL, SDValue LH, SDValue RL,
+                            SDValue RH) {
+    SDValue AddSubNode =
+        DAG.getNode(Opc == ISD::ADD ? ISD::UADDO : ISD::USUBO, DL,
+                    DAG.getVTList(HiLoVT, MVT::i1), LL, RL);
+    SDValue OutL, OutH, Overflow;
+    TLI.expandUADDSUBO(AddSubNode.getNode(), OutL, Overflow, DAG);
+    SDValue WithOverflow = DAG.getNode(
+        Opc, DL, HiLoVT, LH, DAG.getZExtOrTrunc(Overflow, DL, HiLoVT));
+    OutH = DAG.getNode(Opc, DL, HiLoVT, WithOverflow, RH);
+    return std::make_pair(OutL, OutH);
+  };
+
+  // This helper creates a SRL of the pair (LL, LH) by Shift.
+  auto MakeSRLLong = [&](SDValue LL, SDValue LH, unsigned Shift) {
+    unsigned HBitWidth = HiLoVT.getScalarSizeInBits();
+    if (Shift < HBitWidth) {
+      SDValue ShAmt = DAG.getConstant(Shift, DL, HiLoVT);
+      SDValue ResL = DAG.getNode(ISD::FSHR, DL, HiLoVT, LH, LL, ShAmt);
+      SDValue ResH = DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt);
+      return std::make_pair(ResL, ResH);
+    }
+    SDValue Zero = DAG.getConstant(0, DL, HiLoVT);
+    if (Shift == HBitWidth)
+      return std::make_pair(LH, Zero);
+    assert(Shift - HBitWidth < HBitWidth &&
+           "We shouldn't generate an undefined shift");
+    SDValue ShAmt = DAG.getConstant(Shift - HBitWidth, DL, HiLoVT);
+    return std::make_pair(DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt), Zero);
+  };
+
+  // Knowledge of leading zeros may help to reduce the multiplier.
+  unsigned KnownLeadingZeros = DAG.computeKnownBits(N0).countMinLeadingZeros();
+
+  UnsignedDivisionByConstantInfo Magics = UnsignedDivisionByConstantInfo::get(
+      Divisor, std::min(KnownLeadingZeros, Divisor.countl_zero()));
+
+  assert(!LL == !LH && "Expected both input halves or no input halves!");
+  if (!LL)
+    std::tie(LL, LH) = DAG.SplitScalar(N0, DL, HiLoVT, HiLoVT);
+  SDValue QL = LL;
+  SDValue QH = LH;
+  if (Magics.PreShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PreShift);
+
+  SmallVector<SDValue, 2> UMulResult;
+  if (!MakeMUL_LOHIByConst(ISD::UMUL_LOHI, QL, QH, Magics.Magic, UMulResult))
+    return false;
+
+  QL = UMulResult[2];
+  QH = UMulResult[3];
+
+  if (Magics.IsAdd) {
+    auto [NPQL, NPQH] = MakeAddSubLong(ISD::SUB, LL, LH, QL, QH);
+    std::tie(NPQL, NPQH) = MakeSRLLong(NPQL, NPQH, 1);
+    std::tie(QL, QH) = MakeAddSubLong(ISD::ADD, NPQL, NPQH, QL, QH);
+  }
+
+  if (Magics.PostShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PostShift);
+
+  unsigned Opcode = N->getOpcode();
+  if (Opcode != ISD::UREM) {
+    Result.push_back(QL);
+    Result.push_back(QH);
+  }
+
+  if (Opcode != ISD::UDIV) {
+    SmallVector<SDValue, 2> MulResult;
+    if (!MakeMUL_LOHIByConst(ISD::MUL, QL, QH, Divisor, MulResult))
+      return false;
+
+    assert(MulResult.size() == 2);
+
+    auto [RemL, RemH] =
+        MakeAddSubLong(ISD::SUB, LL, LH, MulResult[0], MulResult[1]);
+
+    Result.push_back(RemL);
+    Result.push_back(RemH);
+  }
+
+  return true;
+}
+
+bool TargetLowering::expandDIVREMByConstant(SDNode *N,
+                                            SmallVectorImpl<SDValue> &Result,
+                                            EVT HiLoVT, SelectionDAG &DAG,
+                                            SDValue LL, SDValue LH) const {
+  unsigned Opcode = N->getOpcode();
+
+  // TODO: Support signed division/remainder.
+  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
+    return false;
+  assert(
+      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
+      "Unexpected opcode");
+
+  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
+  if (!CN)
+    return false;
+
+  APInt Divisor = CN->getAPIntValue();
+
+  // We depend on the UREM by constant optimization in DAGCombiner that requires
+  // high multiply.
+  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
+      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
+    return false;
+
+  // Don't expand if optimizing for size.
+  if (DAG.shouldOptForSize())
+    return false;
+
+  // Early out for 0 or 1 divisors.
+  if (Divisor.ule(1))
+    return false;
+
+  if (expandUDIVREMByConstantViaUREMDecomposition(N, Divisor, Result, HiLoVT,
+                                                  DAG, LL, LH, *this))
+    return true;
+
+  if (expandUDIVREMByConstantViaUMulHiMagic(N, Divisor, Result, HiLoVT, DAG, LL,
+                                            LH, *this))
+    return true;
+
+  return false;
+}
+
 // Check that (every element of) Z is undef or not an exact multiple of BW.
 static bool isNonZeroModBitWidthOrUndef(SDValue Z, unsigned BW) {
   return ISD::matchUnaryPredicate(
diff --git a/llvm/test/CodeGen/AArch64/rem-by-const.ll b/llvm/test/CodeGen/AArch64/rem-by-const.ll
index c57383ad9b1e7..0554b2e66a0be 100644
--- a/llvm/test/CodeGen/AArch64/rem-by-const.ll
+++ b/llvm/test/CodeGen/AArch64/rem-by-const.ll
@@ -513,13 +513,50 @@ entry:
 define i128 @ui128_7(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x11, x9, x8
+; CHECK-SD-NEXT:    cinc x12, x10, hs
+; CHECK-SD-NEXT:    subs x13, x0, x11
+; CHECK-SD-NEXT:    cset w14, lo
+; CHECK-SD-NEXT:    sub x14, x1, x14
+; CHECK-SD-NEXT:    sub x12, x14, x12
+; CHECK-SD-NEXT:    extr x13, x12, x13, #1
+; CHECK-SD-NEXT:    lsr x12, x12, #1
+; CHECK-SD-NEXT:    adds x11, x13, x11
+; CHECK-SD-NEXT:    cinc x12, x12, hs
+; CHECK-SD-NEXT:    cmn x9, x8
+; CHECK-SD-NEXT:    adc x8, x12, x10
+; CHECK-SD-NEXT:    mov w10, #7 // =0x7
+; CHECK-SD-NEXT:    extr x9, x8, x11, #2
+; CHECK-SD-NEXT:    lsr x8, x8, #2
+; CHECK-SD-NEXT:    umulh x10, x9, x10
+; CHECK-SD-NEXT:    lsl x11, x9, #3
+; CHECK-SD-NEXT:    sub x9, x11, x9
+; CHECK-SD-NEXT:    subs x0, x0, x9
+; CHECK-SD-NEXT:    cset w9, lo
+; CHECK-SD-NEXT:    sub x10, x10, x8
+; CHECK-SD-NEXT:    sub x9, x1, x9
+; CHECK-SD-NEXT:    add x8, x10, x8, lsl #3
+; CHECK-SD-NEXT:    sub x1, x9, x8
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_7:
@@ -596,13 +633,38 @@ entry:
 define i128 @ui128_100(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x8, x9, x8
+; CHECK-SD-NEXT:    cinc x9, x10, hs
+; CHECK-SD-NEXT:    mov w10, #100 // =0x64
+; CHECK-SD-NEXT:    extr x8, x9, x8, #4
+; CHECK-SD-NEXT:    lsr x9, x9, #4
+; CHECK-SD-NEXT:    umulh x11, x8, x10
+; CHECK-SD-NEXT:    mul x8, x8, x10
+; CHECK-SD-NEXT:    madd x9, x9, x10, x11
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    cset w8, lo
+; CHECK-SD-NEXT:    sub x8, x1, x8
+; CHECK-SD-NEXT:    sub x1, x8, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_100:
@@ -3204,34 +3266,85 @@ entry:
 define <2 x i128> @uv2i128_7(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    mul x12, x2, x8
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    umulh x10, x2, x11
+; CHECK-SD-NEXT:    adds x14, x9, x16
+; CHECK-SD-NEXT:    cinc x15, x13, hs
+; CHECK-SD-NEXT:    subs x18, x0, x14
+; CHECK-SD-NEXT:    umulh x17, x2, x8
+; CHECK-SD-NEXT:    cset w5, lo
+; CHECK-SD-NEXT:    sub x5, x1, x5
+; CHECK-SD-NEXT:    umulh x6, x3, x11
+; CHECK-SD-NEXT:    sub x15, x5, x15
+; CHECK-SD-NEXT:    extr x18, x15, x18, #1
+; CHECK-SD-NEXT:    mul x11, x3, x11
+; CHECK-SD-NEXT:    lsr x15, x15, #1
+; CHECK-SD-NEXT:    umulh x4, x3, x8
+; CHECK-SD-NEXT:    adds x14, x18, x14
+; CHECK-SD-NEXT:    cinc x15, x15, hs
+; CHECK-SD-NEXT:    cmn x9, x16
+; CHECK-SD-NEXT:    mul x8, x3, x8
+; CHECK-SD-NEXT:    adc x9, x15, x13
+; CHECK-SD-NEXT:    adds x10, x10, x12
+; CHECK-SD-NEXT:    cinc x12, x17, hs
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x10, x12, x6
+; CHECK-SD-NEXT:    cinc x11, x4, hs
+; CHECK-SD-NEXT:    adds x12, x10, x8
+; CHECK-SD-NEXT:    cinc x13, x11, hs
+; CHECK-SD-NEXT:    subs x15, x2, x12
+; CHECK-SD-NEXT:    cset w16, lo
+; CHECK-SD-NEXT:    sub x16, x3, x16
+; CHECK-SD-NEXT:    sub x13, x16, x13
+; CHECK-SD-NEXT:    extr x15, x13, x15, #1
+; CHECK-SD-NEXT:    lsr x13, x13, #1
+; CHECK-SD-NEXT:    adds x12, x15, x12
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    cmn x10, x8
+; CHECK-SD-NEXT:    extr x8, x9, x14, #2
+; CHECK-SD-NEXT:    adc x10, x13, x11
+; CHECK-SD-NEXT:    mov w11, #7 // =0x7
+; CHECK-SD-NEXT:    lsr x9, x9, #2
+; CHECK-SD-NEXT:    extr x12, x10, x12, #2
+; CHECK-SD-NEXT:    umulh x13, x8, x11
+; CHECK-SD-NEXT:    lsl x14, x8, #3
+; CHECK-SD-NEXT:    lsr x10, x10, #2
+; CHECK-SD-NEXT:    umulh x11, x12, x11
+; CHECK-SD-NEXT:    lsl x15, x12, #3
+; CHECK-SD-NEXT:    sub x8, x14, x8
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    sub x8, x15, x12
+; CHECK-SD-NEXT:    cset w12, lo
+; CHECK-SD-NEXT:    sub x13, x13, x9
+; CHECK-SD-NEXT:    subs x2, x2, x8
+; CHECK-SD-NEXT:    add x8, x13, x9, lsl #3
+; CHECK-SD-NEXT:    sub x11, x11, x10
+; CHECK-SD-NEXT:    add x9, x11, x10, lsl #3
+; CHECK-SD-NEXT:    cset w10, lo
+; CHECK-SD-NEXT:    sub x11, x1, x12
+; CHECK-SD-NEXT:    sub x10, x3, x10
+; CHECK-SD-NEXT:    sub x1, x11, x8
+; CHECK-SD-NEXT:    sub x3, x10, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: uv2i128_7:
@@ -3361,34 +3474,61 @@ entry:
 define <2 x i128> @uv2i128_100(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mov w12, #100 // =0x64
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-...
[truncated]

llvmbot · 2025-08-22T15:22:45Z

@llvm/pr-subscribers-llvm-selectiondag

Author: Marius Kamp (mskamp)

Changes

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the fmt
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes #137514.

Patch is 138.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154968.diff

15 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+159-33)
(modified) llvm/test/CodeGen/AArch64/rem-by-const.ll (+210-70)
(modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+130-110)
(modified) llvm/test/CodeGen/Mips/funnel-shift.ll (+188-185)
(modified) llvm/test/CodeGen/PowerPC/funnel-shift.ll (+134-178)
(modified) llvm/test/CodeGen/PowerPC/urem-lkk.ll (+82-18)
(modified) llvm/test/CodeGen/RISCV/div-by-constant.ll (+42-7)
(modified) llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll (+144-28)
(modified) llvm/test/CodeGen/RISCV/split-urem-by-constant.ll (+192-28)
(modified) llvm/test/CodeGen/RISCV/urem-lkk.ll (+49-13)
(modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+158-46)
(modified) llvm/test/CodeGen/X86/divide-by-constant.ll (+458-32)
(modified) llvm/test/CodeGen/X86/divmod128.ll (+625-15)
(modified) llvm/test/CodeGen/X86/funnel-shift.ll (+73-39)
(modified) llvm/test/CodeGen/X86/i128-udiv.ll (+37-9)

diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 402a012e8e555..9e1a51e291ddb 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -8011,25 +8011,12 @@ bool TargetLowering::expandMUL(SDNode *N, SDValue &Lo, SDValue &Hi, EVT HiLoVT,
 // dividend and multiply by the multiplicative inverse of the shifted divisor.
 // If we want the remainder, we shift the value left by the number of trailing
 // zeros and add the bits that were shifted out of the dividend.
-bool TargetLowering::expandDIVREMByConstant(SDNode *N,
-                                            SmallVectorImpl<SDValue> &Result,
-                                            EVT HiLoVT, SelectionDAG &DAG,
-                                            SDValue LL, SDValue LH) const {
+static bool expandUDIVREMByConstantViaUREMDecomposition(
+    SDNode *N, APInt Divisor, SmallVectorImpl<SDValue> &Result, EVT HiLoVT,
+    SelectionDAG &DAG, SDValue LL, SDValue LH, const TargetLowering &TLI) {
   unsigned Opcode = N->getOpcode();
   EVT VT = N->getValueType(0);
 
-  // TODO: Support signed division/remainder.
-  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
-    return false;
-  assert(
-      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
-      "Unexpected opcode");
-
-  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
-  if (!CN)
-    return false;
-
-  APInt Divisor = CN->getAPIntValue();
   unsigned BitWidth = Divisor.getBitWidth();
   unsigned HBitWidth = BitWidth / 2;
   assert(VT.getScalarSizeInBits() == BitWidth &&
@@ -8040,20 +8027,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   if (Divisor.uge(HalfMaxPlus1))
     return false;
 
-  // We depend on the UREM by constant optimization in DAGCombiner that requires
-  // high multiply.
-  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
-      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
-    return false;
-
-  // Don't expand if optimizing for size.
-  if (DAG.shouldOptForSize())
-    return false;
-
-  // Early out for 0 or 1 divisors.
-  if (Divisor.ule(1))
-    return false;
-
   // If the divisor is even, shift it until it becomes odd.
   unsigned TrailingZeros = 0;
   if (!Divisor[0]) {
@@ -8097,8 +8070,8 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
 
     // Use uaddo_carry if we can, otherwise use a compare to detect overflow.
     EVT SetCCType =
-        getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
-    if (isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
+        TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
+    if (TLI.isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
       SDVTList VTList = DAG.getVTList(HiLoVT, SetCCType);
       Sum = DAG.getNode(ISD::UADDO, dl, VTList, LL, LH);
       Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, Sum,
@@ -8108,7 +8081,7 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
       SDValue Carry = DAG.getSetCC(dl, SetCCType, Sum, LL, ISD::SETULT);
       // If the boolean for the target is 0 or 1, we can add the setcc result
       // directly.
-      if (getBooleanContents(HiLoVT) ==
+      if (TLI.getBooleanContents(HiLoVT) ==
           TargetLoweringBase::ZeroOrOneBooleanContent)
         Carry = DAG.getZExtOrTrunc(Carry, dl, HiLoVT);
       else
@@ -8164,6 +8137,159 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   return true;
 }
 
+static bool
+expandUDIVREMByConstantViaUMulHiMagic(SDNode *N, const APInt &Divisor,
+                                      SmallVectorImpl<SDValue> &Result,
+                                      EVT HiLoVT, SelectionDAG &DAG, SDValue LL,
+                                      SDValue LH, const TargetLowering &TLI) {
+
+  SDValue N0 = N->getOperand(0);
+  EVT VT = N0->getValueType(0);
+  SDLoc DL{N};
+
+  assert(!Divisor.isOne() && "Magic algorithm does not work for division by 1");
+
+  // This helper creates a MUL_LOHI of the pair (LL, LH) by a constant.
+  auto MakeMUL_LOHIByConst = [&](unsigned Opc, SDValue LL, SDValue LH,
+                                 const APInt &Const,
+                                 SmallVectorImpl<SDValue> &Result) {
+    SDValue LHS = DAG.getNode(ISD::BUILD_PAIR, DL, VT, LL, LH);
+    SDValue RHS = DAG.getConstant(Const, DL, VT);
+    auto [RL, RH] = DAG.SplitScalar(RHS, DL, HiLoVT, HiLoVT);
+    return TLI.expandMUL_LOHI(
+        Opc, VT, DL, LHS, RHS, Result, HiLoVT, DAG,
+        TargetLowering::MulExpansionKind::OnlyLegalOrCustom, LL, LH, RL, RH);
+  };
+
+  // This helper creates an ADD/SUB of the pairs (LL, LH) and (RL, RH).
+  auto MakeAddSubLong = [&](unsigned Opc, SDValue LL, SDValue LH, SDValue RL,
+                            SDValue RH) {
+    SDValue AddSubNode =
+        DAG.getNode(Opc == ISD::ADD ? ISD::UADDO : ISD::USUBO, DL,
+                    DAG.getVTList(HiLoVT, MVT::i1), LL, RL);
+    SDValue OutL, OutH, Overflow;
+    TLI.expandUADDSUBO(AddSubNode.getNode(), OutL, Overflow, DAG);
+    SDValue WithOverflow = DAG.getNode(
+        Opc, DL, HiLoVT, LH, DAG.getZExtOrTrunc(Overflow, DL, HiLoVT));
+    OutH = DAG.getNode(Opc, DL, HiLoVT, WithOverflow, RH);
+    return std::make_pair(OutL, OutH);
+  };
+
+  // This helper creates a SRL of the pair (LL, LH) by Shift.
+  auto MakeSRLLong = [&](SDValue LL, SDValue LH, unsigned Shift) {
+    unsigned HBitWidth = HiLoVT.getScalarSizeInBits();
+    if (Shift < HBitWidth) {
+      SDValue ShAmt = DAG.getConstant(Shift, DL, HiLoVT);
+      SDValue ResL = DAG.getNode(ISD::FSHR, DL, HiLoVT, LH, LL, ShAmt);
+      SDValue ResH = DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt);
+      return std::make_pair(ResL, ResH);
+    }
+    SDValue Zero = DAG.getConstant(0, DL, HiLoVT);
+    if (Shift == HBitWidth)
+      return std::make_pair(LH, Zero);
+    assert(Shift - HBitWidth < HBitWidth &&
+           "We shouldn't generate an undefined shift");
+    SDValue ShAmt = DAG.getConstant(Shift - HBitWidth, DL, HiLoVT);
+    return std::make_pair(DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt), Zero);
+  };
+
+  // Knowledge of leading zeros may help to reduce the multiplier.
+  unsigned KnownLeadingZeros = DAG.computeKnownBits(N0).countMinLeadingZeros();
+
+  UnsignedDivisionByConstantInfo Magics = UnsignedDivisionByConstantInfo::get(
+      Divisor, std::min(KnownLeadingZeros, Divisor.countl_zero()));
+
+  assert(!LL == !LH && "Expected both input halves or no input halves!");
+  if (!LL)
+    std::tie(LL, LH) = DAG.SplitScalar(N0, DL, HiLoVT, HiLoVT);
+  SDValue QL = LL;
+  SDValue QH = LH;
+  if (Magics.PreShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PreShift);
+
+  SmallVector<SDValue, 2> UMulResult;
+  if (!MakeMUL_LOHIByConst(ISD::UMUL_LOHI, QL, QH, Magics.Magic, UMulResult))
+    return false;
+
+  QL = UMulResult[2];
+  QH = UMulResult[3];
+
+  if (Magics.IsAdd) {
+    auto [NPQL, NPQH] = MakeAddSubLong(ISD::SUB, LL, LH, QL, QH);
+    std::tie(NPQL, NPQH) = MakeSRLLong(NPQL, NPQH, 1);
+    std::tie(QL, QH) = MakeAddSubLong(ISD::ADD, NPQL, NPQH, QL, QH);
+  }
+
+  if (Magics.PostShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PostShift);
+
+  unsigned Opcode = N->getOpcode();
+  if (Opcode != ISD::UREM) {
+    Result.push_back(QL);
+    Result.push_back(QH);
+  }
+
+  if (Opcode != ISD::UDIV) {
+    SmallVector<SDValue, 2> MulResult;
+    if (!MakeMUL_LOHIByConst(ISD::MUL, QL, QH, Divisor, MulResult))
+      return false;
+
+    assert(MulResult.size() == 2);
+
+    auto [RemL, RemH] =
+        MakeAddSubLong(ISD::SUB, LL, LH, MulResult[0], MulResult[1]);
+
+    Result.push_back(RemL);
+    Result.push_back(RemH);
+  }
+
+  return true;
+}
+
+bool TargetLowering::expandDIVREMByConstant(SDNode *N,
+                                            SmallVectorImpl<SDValue> &Result,
+                                            EVT HiLoVT, SelectionDAG &DAG,
+                                            SDValue LL, SDValue LH) const {
+  unsigned Opcode = N->getOpcode();
+
+  // TODO: Support signed division/remainder.
+  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
+    return false;
+  assert(
+      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
+      "Unexpected opcode");
+
+  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
+  if (!CN)
+    return false;
+
+  APInt Divisor = CN->getAPIntValue();
+
+  // We depend on the UREM by constant optimization in DAGCombiner that requires
+  // high multiply.
+  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
+      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
+    return false;
+
+  // Don't expand if optimizing for size.
+  if (DAG.shouldOptForSize())
+    return false;
+
+  // Early out for 0 or 1 divisors.
+  if (Divisor.ule(1))
+    return false;
+
+  if (expandUDIVREMByConstantViaUREMDecomposition(N, Divisor, Result, HiLoVT,
+                                                  DAG, LL, LH, *this))
+    return true;
+
+  if (expandUDIVREMByConstantViaUMulHiMagic(N, Divisor, Result, HiLoVT, DAG, LL,
+                                            LH, *this))
+    return true;
+
+  return false;
+}
+
 // Check that (every element of) Z is undef or not an exact multiple of BW.
 static bool isNonZeroModBitWidthOrUndef(SDValue Z, unsigned BW) {
   return ISD::matchUnaryPredicate(
diff --git a/llvm/test/CodeGen/AArch64/rem-by-const.ll b/llvm/test/CodeGen/AArch64/rem-by-const.ll
index c57383ad9b1e7..0554b2e66a0be 100644
--- a/llvm/test/CodeGen/AArch64/rem-by-const.ll
+++ b/llvm/test/CodeGen/AArch64/rem-by-const.ll
@@ -513,13 +513,50 @@ entry:
 define i128 @ui128_7(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x11, x9, x8
+; CHECK-SD-NEXT:    cinc x12, x10, hs
+; CHECK-SD-NEXT:    subs x13, x0, x11
+; CHECK-SD-NEXT:    cset w14, lo
+; CHECK-SD-NEXT:    sub x14, x1, x14
+; CHECK-SD-NEXT:    sub x12, x14, x12
+; CHECK-SD-NEXT:    extr x13, x12, x13, #1
+; CHECK-SD-NEXT:    lsr x12, x12, #1
+; CHECK-SD-NEXT:    adds x11, x13, x11
+; CHECK-SD-NEXT:    cinc x12, x12, hs
+; CHECK-SD-NEXT:    cmn x9, x8
+; CHECK-SD-NEXT:    adc x8, x12, x10
+; CHECK-SD-NEXT:    mov w10, #7 // =0x7
+; CHECK-SD-NEXT:    extr x9, x8, x11, #2
+; CHECK-SD-NEXT:    lsr x8, x8, #2
+; CHECK-SD-NEXT:    umulh x10, x9, x10
+; CHECK-SD-NEXT:    lsl x11, x9, #3
+; CHECK-SD-NEXT:    sub x9, x11, x9
+; CHECK-SD-NEXT:    subs x0, x0, x9
+; CHECK-SD-NEXT:    cset w9, lo
+; CHECK-SD-NEXT:    sub x10, x10, x8
+; CHECK-SD-NEXT:    sub x9, x1, x9
+; CHECK-SD-NEXT:    add x8, x10, x8, lsl #3
+; CHECK-SD-NEXT:    sub x1, x9, x8
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_7:
@@ -596,13 +633,38 @@ entry:
 define i128 @ui128_100(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x8, x9, x8
+; CHECK-SD-NEXT:    cinc x9, x10, hs
+; CHECK-SD-NEXT:    mov w10, #100 // =0x64
+; CHECK-SD-NEXT:    extr x8, x9, x8, #4
+; CHECK-SD-NEXT:    lsr x9, x9, #4
+; CHECK-SD-NEXT:    umulh x11, x8, x10
+; CHECK-SD-NEXT:    mul x8, x8, x10
+; CHECK-SD-NEXT:    madd x9, x9, x10, x11
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    cset w8, lo
+; CHECK-SD-NEXT:    sub x8, x1, x8
+; CHECK-SD-NEXT:    sub x1, x8, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_100:
@@ -3204,34 +3266,85 @@ entry:
 define <2 x i128> @uv2i128_7(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    mul x12, x2, x8
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    umulh x10, x2, x11
+; CHECK-SD-NEXT:    adds x14, x9, x16
+; CHECK-SD-NEXT:    cinc x15, x13, hs
+; CHECK-SD-NEXT:    subs x18, x0, x14
+; CHECK-SD-NEXT:    umulh x17, x2, x8
+; CHECK-SD-NEXT:    cset w5, lo
+; CHECK-SD-NEXT:    sub x5, x1, x5
+; CHECK-SD-NEXT:    umulh x6, x3, x11
+; CHECK-SD-NEXT:    sub x15, x5, x15
+; CHECK-SD-NEXT:    extr x18, x15, x18, #1
+; CHECK-SD-NEXT:    mul x11, x3, x11
+; CHECK-SD-NEXT:    lsr x15, x15, #1
+; CHECK-SD-NEXT:    umulh x4, x3, x8
+; CHECK-SD-NEXT:    adds x14, x18, x14
+; CHECK-SD-NEXT:    cinc x15, x15, hs
+; CHECK-SD-NEXT:    cmn x9, x16
+; CHECK-SD-NEXT:    mul x8, x3, x8
+; CHECK-SD-NEXT:    adc x9, x15, x13
+; CHECK-SD-NEXT:    adds x10, x10, x12
+; CHECK-SD-NEXT:    cinc x12, x17, hs
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x10, x12, x6
+; CHECK-SD-NEXT:    cinc x11, x4, hs
+; CHECK-SD-NEXT:    adds x12, x10, x8
+; CHECK-SD-NEXT:    cinc x13, x11, hs
+; CHECK-SD-NEXT:    subs x15, x2, x12
+; CHECK-SD-NEXT:    cset w16, lo
+; CHECK-SD-NEXT:    sub x16, x3, x16
+; CHECK-SD-NEXT:    sub x13, x16, x13
+; CHECK-SD-NEXT:    extr x15, x13, x15, #1
+; CHECK-SD-NEXT:    lsr x13, x13, #1
+; CHECK-SD-NEXT:    adds x12, x15, x12
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    cmn x10, x8
+; CHECK-SD-NEXT:    extr x8, x9, x14, #2
+; CHECK-SD-NEXT:    adc x10, x13, x11
+; CHECK-SD-NEXT:    mov w11, #7 // =0x7
+; CHECK-SD-NEXT:    lsr x9, x9, #2
+; CHECK-SD-NEXT:    extr x12, x10, x12, #2
+; CHECK-SD-NEXT:    umulh x13, x8, x11
+; CHECK-SD-NEXT:    lsl x14, x8, #3
+; CHECK-SD-NEXT:    lsr x10, x10, #2
+; CHECK-SD-NEXT:    umulh x11, x12, x11
+; CHECK-SD-NEXT:    lsl x15, x12, #3
+; CHECK-SD-NEXT:    sub x8, x14, x8
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    sub x8, x15, x12
+; CHECK-SD-NEXT:    cset w12, lo
+; CHECK-SD-NEXT:    sub x13, x13, x9
+; CHECK-SD-NEXT:    subs x2, x2, x8
+; CHECK-SD-NEXT:    add x8, x13, x9, lsl #3
+; CHECK-SD-NEXT:    sub x11, x11, x10
+; CHECK-SD-NEXT:    add x9, x11, x10, lsl #3
+; CHECK-SD-NEXT:    cset w10, lo
+; CHECK-SD-NEXT:    sub x11, x1, x12
+; CHECK-SD-NEXT:    sub x10, x3, x10
+; CHECK-SD-NEXT:    sub x1, x11, x8
+; CHECK-SD-NEXT:    sub x3, x10, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: uv2i128_7:
@@ -3361,34 +3474,61 @@ entry:
 define <2 x i128> @uv2i128_100(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mov w12, #100 // =0x64
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-...
[truncated]

llvmbot · 2025-08-22T15:22:45Z

@llvm/pr-subscribers-backend-mips

Author: Marius Kamp (mskamp)

Changes

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the fmt
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes #137514.

Patch is 138.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154968.diff

15 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+159-33)
(modified) llvm/test/CodeGen/AArch64/rem-by-const.ll (+210-70)
(modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+130-110)
(modified) llvm/test/CodeGen/Mips/funnel-shift.ll (+188-185)
(modified) llvm/test/CodeGen/PowerPC/funnel-shift.ll (+134-178)
(modified) llvm/test/CodeGen/PowerPC/urem-lkk.ll (+82-18)
(modified) llvm/test/CodeGen/RISCV/div-by-constant.ll (+42-7)
(modified) llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll (+144-28)
(modified) llvm/test/CodeGen/RISCV/split-urem-by-constant.ll (+192-28)
(modified) llvm/test/CodeGen/RISCV/urem-lkk.ll (+49-13)
(modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+158-46)
(modified) llvm/test/CodeGen/X86/divide-by-constant.ll (+458-32)
(modified) llvm/test/CodeGen/X86/divmod128.ll (+625-15)
(modified) llvm/test/CodeGen/X86/funnel-shift.ll (+73-39)
(modified) llvm/test/CodeGen/X86/i128-udiv.ll (+37-9)

diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 402a012e8e555..9e1a51e291ddb 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -8011,25 +8011,12 @@ bool TargetLowering::expandMUL(SDNode *N, SDValue &Lo, SDValue &Hi, EVT HiLoVT,
 // dividend and multiply by the multiplicative inverse of the shifted divisor.
 // If we want the remainder, we shift the value left by the number of trailing
 // zeros and add the bits that were shifted out of the dividend.
-bool TargetLowering::expandDIVREMByConstant(SDNode *N,
-                                            SmallVectorImpl<SDValue> &Result,
-                                            EVT HiLoVT, SelectionDAG &DAG,
-                                            SDValue LL, SDValue LH) const {
+static bool expandUDIVREMByConstantViaUREMDecomposition(
+    SDNode *N, APInt Divisor, SmallVectorImpl<SDValue> &Result, EVT HiLoVT,
+    SelectionDAG &DAG, SDValue LL, SDValue LH, const TargetLowering &TLI) {
   unsigned Opcode = N->getOpcode();
   EVT VT = N->getValueType(0);
 
-  // TODO: Support signed division/remainder.
-  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
-    return false;
-  assert(
-      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
-      "Unexpected opcode");
-
-  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
-  if (!CN)
-    return false;
-
-  APInt Divisor = CN->getAPIntValue();
   unsigned BitWidth = Divisor.getBitWidth();
   unsigned HBitWidth = BitWidth / 2;
   assert(VT.getScalarSizeInBits() == BitWidth &&
@@ -8040,20 +8027,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   if (Divisor.uge(HalfMaxPlus1))
     return false;
 
-  // We depend on the UREM by constant optimization in DAGCombiner that requires
-  // high multiply.
-  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
-      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
-    return false;
-
-  // Don't expand if optimizing for size.
-  if (DAG.shouldOptForSize())
-    return false;
-
-  // Early out for 0 or 1 divisors.
-  if (Divisor.ule(1))
-    return false;
-
   // If the divisor is even, shift it until it becomes odd.
   unsigned TrailingZeros = 0;
   if (!Divisor[0]) {
@@ -8097,8 +8070,8 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
 
     // Use uaddo_carry if we can, otherwise use a compare to detect overflow.
     EVT SetCCType =
-        getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
-    if (isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
+        TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
+    if (TLI.isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
       SDVTList VTList = DAG.getVTList(HiLoVT, SetCCType);
       Sum = DAG.getNode(ISD::UADDO, dl, VTList, LL, LH);
       Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, Sum,
@@ -8108,7 +8081,7 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
       SDValue Carry = DAG.getSetCC(dl, SetCCType, Sum, LL, ISD::SETULT);
       // If the boolean for the target is 0 or 1, we can add the setcc result
       // directly.
-      if (getBooleanContents(HiLoVT) ==
+      if (TLI.getBooleanContents(HiLoVT) ==
           TargetLoweringBase::ZeroOrOneBooleanContent)
         Carry = DAG.getZExtOrTrunc(Carry, dl, HiLoVT);
       else
@@ -8164,6 +8137,159 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   return true;
 }
 
+static bool
+expandUDIVREMByConstantViaUMulHiMagic(SDNode *N, const APInt &Divisor,
+                                      SmallVectorImpl<SDValue> &Result,
+                                      EVT HiLoVT, SelectionDAG &DAG, SDValue LL,
+                                      SDValue LH, const TargetLowering &TLI) {
+
+  SDValue N0 = N->getOperand(0);
+  EVT VT = N0->getValueType(0);
+  SDLoc DL{N};
+
+  assert(!Divisor.isOne() && "Magic algorithm does not work for division by 1");
+
+  // This helper creates a MUL_LOHI of the pair (LL, LH) by a constant.
+  auto MakeMUL_LOHIByConst = [&](unsigned Opc, SDValue LL, SDValue LH,
+                                 const APInt &Const,
+                                 SmallVectorImpl<SDValue> &Result) {
+    SDValue LHS = DAG.getNode(ISD::BUILD_PAIR, DL, VT, LL, LH);
+    SDValue RHS = DAG.getConstant(Const, DL, VT);
+    auto [RL, RH] = DAG.SplitScalar(RHS, DL, HiLoVT, HiLoVT);
+    return TLI.expandMUL_LOHI(
+        Opc, VT, DL, LHS, RHS, Result, HiLoVT, DAG,
+        TargetLowering::MulExpansionKind::OnlyLegalOrCustom, LL, LH, RL, RH);
+  };
+
+  // This helper creates an ADD/SUB of the pairs (LL, LH) and (RL, RH).
+  auto MakeAddSubLong = [&](unsigned Opc, SDValue LL, SDValue LH, SDValue RL,
+                            SDValue RH) {
+    SDValue AddSubNode =
+        DAG.getNode(Opc == ISD::ADD ? ISD::UADDO : ISD::USUBO, DL,
+                    DAG.getVTList(HiLoVT, MVT::i1), LL, RL);
+    SDValue OutL, OutH, Overflow;
+    TLI.expandUADDSUBO(AddSubNode.getNode(), OutL, Overflow, DAG);
+    SDValue WithOverflow = DAG.getNode(
+        Opc, DL, HiLoVT, LH, DAG.getZExtOrTrunc(Overflow, DL, HiLoVT));
+    OutH = DAG.getNode(Opc, DL, HiLoVT, WithOverflow, RH);
+    return std::make_pair(OutL, OutH);
+  };
+
+  // This helper creates a SRL of the pair (LL, LH) by Shift.
+  auto MakeSRLLong = [&](SDValue LL, SDValue LH, unsigned Shift) {
+    unsigned HBitWidth = HiLoVT.getScalarSizeInBits();
+    if (Shift < HBitWidth) {
+      SDValue ShAmt = DAG.getConstant(Shift, DL, HiLoVT);
+      SDValue ResL = DAG.getNode(ISD::FSHR, DL, HiLoVT, LH, LL, ShAmt);
+      SDValue ResH = DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt);
+      return std::make_pair(ResL, ResH);
+    }
+    SDValue Zero = DAG.getConstant(0, DL, HiLoVT);
+    if (Shift == HBitWidth)
+      return std::make_pair(LH, Zero);
+    assert(Shift - HBitWidth < HBitWidth &&
+           "We shouldn't generate an undefined shift");
+    SDValue ShAmt = DAG.getConstant(Shift - HBitWidth, DL, HiLoVT);
+    return std::make_pair(DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt), Zero);
+  };
+
+  // Knowledge of leading zeros may help to reduce the multiplier.
+  unsigned KnownLeadingZeros = DAG.computeKnownBits(N0).countMinLeadingZeros();
+
+  UnsignedDivisionByConstantInfo Magics = UnsignedDivisionByConstantInfo::get(
+      Divisor, std::min(KnownLeadingZeros, Divisor.countl_zero()));
+
+  assert(!LL == !LH && "Expected both input halves or no input halves!");
+  if (!LL)
+    std::tie(LL, LH) = DAG.SplitScalar(N0, DL, HiLoVT, HiLoVT);
+  SDValue QL = LL;
+  SDValue QH = LH;
+  if (Magics.PreShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PreShift);
+
+  SmallVector<SDValue, 2> UMulResult;
+  if (!MakeMUL_LOHIByConst(ISD::UMUL_LOHI, QL, QH, Magics.Magic, UMulResult))
+    return false;
+
+  QL = UMulResult[2];
+  QH = UMulResult[3];
+
+  if (Magics.IsAdd) {
+    auto [NPQL, NPQH] = MakeAddSubLong(ISD::SUB, LL, LH, QL, QH);
+    std::tie(NPQL, NPQH) = MakeSRLLong(NPQL, NPQH, 1);
+    std::tie(QL, QH) = MakeAddSubLong(ISD::ADD, NPQL, NPQH, QL, QH);
+  }
+
+  if (Magics.PostShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PostShift);
+
+  unsigned Opcode = N->getOpcode();
+  if (Opcode != ISD::UREM) {
+    Result.push_back(QL);
+    Result.push_back(QH);
+  }
+
+  if (Opcode != ISD::UDIV) {
+    SmallVector<SDValue, 2> MulResult;
+    if (!MakeMUL_LOHIByConst(ISD::MUL, QL, QH, Divisor, MulResult))
+      return false;
+
+    assert(MulResult.size() == 2);
+
+    auto [RemL, RemH] =
+        MakeAddSubLong(ISD::SUB, LL, LH, MulResult[0], MulResult[1]);
+
+    Result.push_back(RemL);
+    Result.push_back(RemH);
+  }
+
+  return true;
+}
+
+bool TargetLowering::expandDIVREMByConstant(SDNode *N,
+                                            SmallVectorImpl<SDValue> &Result,
+                                            EVT HiLoVT, SelectionDAG &DAG,
+                                            SDValue LL, SDValue LH) const {
+  unsigned Opcode = N->getOpcode();
+
+  // TODO: Support signed division/remainder.
+  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
+    return false;
+  assert(
+      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
+      "Unexpected opcode");
+
+  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
+  if (!CN)
+    return false;
+
+  APInt Divisor = CN->getAPIntValue();
+
+  // We depend on the UREM by constant optimization in DAGCombiner that requires
+  // high multiply.
+  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
+      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
+    return false;
+
+  // Don't expand if optimizing for size.
+  if (DAG.shouldOptForSize())
+    return false;
+
+  // Early out for 0 or 1 divisors.
+  if (Divisor.ule(1))
+    return false;
+
+  if (expandUDIVREMByConstantViaUREMDecomposition(N, Divisor, Result, HiLoVT,
+                                                  DAG, LL, LH, *this))
+    return true;
+
+  if (expandUDIVREMByConstantViaUMulHiMagic(N, Divisor, Result, HiLoVT, DAG, LL,
+                                            LH, *this))
+    return true;
+
+  return false;
+}
+
 // Check that (every element of) Z is undef or not an exact multiple of BW.
 static bool isNonZeroModBitWidthOrUndef(SDValue Z, unsigned BW) {
   return ISD::matchUnaryPredicate(
diff --git a/llvm/test/CodeGen/AArch64/rem-by-const.ll b/llvm/test/CodeGen/AArch64/rem-by-const.ll
index c57383ad9b1e7..0554b2e66a0be 100644
--- a/llvm/test/CodeGen/AArch64/rem-by-const.ll
+++ b/llvm/test/CodeGen/AArch64/rem-by-const.ll
@@ -513,13 +513,50 @@ entry:
 define i128 @ui128_7(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x11, x9, x8
+; CHECK-SD-NEXT:    cinc x12, x10, hs
+; CHECK-SD-NEXT:    subs x13, x0, x11
+; CHECK-SD-NEXT:    cset w14, lo
+; CHECK-SD-NEXT:    sub x14, x1, x14
+; CHECK-SD-NEXT:    sub x12, x14, x12
+; CHECK-SD-NEXT:    extr x13, x12, x13, #1
+; CHECK-SD-NEXT:    lsr x12, x12, #1
+; CHECK-SD-NEXT:    adds x11, x13, x11
+; CHECK-SD-NEXT:    cinc x12, x12, hs
+; CHECK-SD-NEXT:    cmn x9, x8
+; CHECK-SD-NEXT:    adc x8, x12, x10
+; CHECK-SD-NEXT:    mov w10, #7 // =0x7
+; CHECK-SD-NEXT:    extr x9, x8, x11, #2
+; CHECK-SD-NEXT:    lsr x8, x8, #2
+; CHECK-SD-NEXT:    umulh x10, x9, x10
+; CHECK-SD-NEXT:    lsl x11, x9, #3
+; CHECK-SD-NEXT:    sub x9, x11, x9
+; CHECK-SD-NEXT:    subs x0, x0, x9
+; CHECK-SD-NEXT:    cset w9, lo
+; CHECK-SD-NEXT:    sub x10, x10, x8
+; CHECK-SD-NEXT:    sub x9, x1, x9
+; CHECK-SD-NEXT:    add x8, x10, x8, lsl #3
+; CHECK-SD-NEXT:    sub x1, x9, x8
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_7:
@@ -596,13 +633,38 @@ entry:
 define i128 @ui128_100(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x8, x9, x8
+; CHECK-SD-NEXT:    cinc x9, x10, hs
+; CHECK-SD-NEXT:    mov w10, #100 // =0x64
+; CHECK-SD-NEXT:    extr x8, x9, x8, #4
+; CHECK-SD-NEXT:    lsr x9, x9, #4
+; CHECK-SD-NEXT:    umulh x11, x8, x10
+; CHECK-SD-NEXT:    mul x8, x8, x10
+; CHECK-SD-NEXT:    madd x9, x9, x10, x11
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    cset w8, lo
+; CHECK-SD-NEXT:    sub x8, x1, x8
+; CHECK-SD-NEXT:    sub x1, x8, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_100:
@@ -3204,34 +3266,85 @@ entry:
 define <2 x i128> @uv2i128_7(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    mul x12, x2, x8
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    umulh x10, x2, x11
+; CHECK-SD-NEXT:    adds x14, x9, x16
+; CHECK-SD-NEXT:    cinc x15, x13, hs
+; CHECK-SD-NEXT:    subs x18, x0, x14
+; CHECK-SD-NEXT:    umulh x17, x2, x8
+; CHECK-SD-NEXT:    cset w5, lo
+; CHECK-SD-NEXT:    sub x5, x1, x5
+; CHECK-SD-NEXT:    umulh x6, x3, x11
+; CHECK-SD-NEXT:    sub x15, x5, x15
+; CHECK-SD-NEXT:    extr x18, x15, x18, #1
+; CHECK-SD-NEXT:    mul x11, x3, x11
+; CHECK-SD-NEXT:    lsr x15, x15, #1
+; CHECK-SD-NEXT:    umulh x4, x3, x8
+; CHECK-SD-NEXT:    adds x14, x18, x14
+; CHECK-SD-NEXT:    cinc x15, x15, hs
+; CHECK-SD-NEXT:    cmn x9, x16
+; CHECK-SD-NEXT:    mul x8, x3, x8
+; CHECK-SD-NEXT:    adc x9, x15, x13
+; CHECK-SD-NEXT:    adds x10, x10, x12
+; CHECK-SD-NEXT:    cinc x12, x17, hs
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x10, x12, x6
+; CHECK-SD-NEXT:    cinc x11, x4, hs
+; CHECK-SD-NEXT:    adds x12, x10, x8
+; CHECK-SD-NEXT:    cinc x13, x11, hs
+; CHECK-SD-NEXT:    subs x15, x2, x12
+; CHECK-SD-NEXT:    cset w16, lo
+; CHECK-SD-NEXT:    sub x16, x3, x16
+; CHECK-SD-NEXT:    sub x13, x16, x13
+; CHECK-SD-NEXT:    extr x15, x13, x15, #1
+; CHECK-SD-NEXT:    lsr x13, x13, #1
+; CHECK-SD-NEXT:    adds x12, x15, x12
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    cmn x10, x8
+; CHECK-SD-NEXT:    extr x8, x9, x14, #2
+; CHECK-SD-NEXT:    adc x10, x13, x11
+; CHECK-SD-NEXT:    mov w11, #7 // =0x7
+; CHECK-SD-NEXT:    lsr x9, x9, #2
+; CHECK-SD-NEXT:    extr x12, x10, x12, #2
+; CHECK-SD-NEXT:    umulh x13, x8, x11
+; CHECK-SD-NEXT:    lsl x14, x8, #3
+; CHECK-SD-NEXT:    lsr x10, x10, #2
+; CHECK-SD-NEXT:    umulh x11, x12, x11
+; CHECK-SD-NEXT:    lsl x15, x12, #3
+; CHECK-SD-NEXT:    sub x8, x14, x8
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    sub x8, x15, x12
+; CHECK-SD-NEXT:    cset w12, lo
+; CHECK-SD-NEXT:    sub x13, x13, x9
+; CHECK-SD-NEXT:    subs x2, x2, x8
+; CHECK-SD-NEXT:    add x8, x13, x9, lsl #3
+; CHECK-SD-NEXT:    sub x11, x11, x10
+; CHECK-SD-NEXT:    add x9, x11, x10, lsl #3
+; CHECK-SD-NEXT:    cset w10, lo
+; CHECK-SD-NEXT:    sub x11, x1, x12
+; CHECK-SD-NEXT:    sub x10, x3, x10
+; CHECK-SD-NEXT:    sub x1, x11, x8
+; CHECK-SD-NEXT:    sub x3, x10, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: uv2i128_7:
@@ -3361,34 +3474,61 @@ entry:
 define <2 x i128> @uv2i128_100(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mov w12, #100 // =0x64
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-...
[truncated]

topperc · 2025-08-22T15:59:31Z

There was another patch for solving this problem that was not correctly linked to the issue #146238

mskamp · 2025-08-22T16:10:19Z

There was another patch for solving this problem that was not correctly linked to the issue #146238

Oh, I completely missed that patch. Thank you for the hint.

It seems, though, that the linked patch follows another approach that may not handle some constants (e.g., 67 if I'm not mistaken). Hence, the approaches in the two patches may complement each other. But an integration of both approaches would most likely require measurements and benchmarks to determine when to select which approach to get the best results.

What is the policy on how to proceed with this kind of conflicts?

sharkautarch · 2025-09-05T17:24:36Z

@mskamp

But an integration of both approaches would most likely require measurements and benchmarks to determine when to select which approach to get the best results.

I ran an x86-64 benchmark for just the @umod128(i128 %x) function in llvm/test/CodeGen/X86/divmod128.ll, comparing the two PRs' latency & inverse throughput on my alderlake intel cpu using llvm-exegesis

I will refer to this PR as the "magic test", and the other PR as the "chunk test"

Permalinks to the tested functions, for your convenience:

magic test:

llvm-project/llvm/test/CodeGen/X86/divmod128.ll

Line 70 in 5bdd489

; X86-64-NEXT: movabsq $8384883669867978007, %r10 # imm = 0x745D1745D1745D17
chunk test:

llvm-project/llvm/test/CodeGen/X86/divmod128.ll

Line 70 in 57dc169

; X86-64-NEXT: movq %rsi, %rax

magic test throughput asm file: https://gist.github.com/sharkautarch/99adfce423c3952bad11b8db0a8ba150#file-128bit_urem_magic_test-s

magic test latency asm file: https://gist.github.com/sharkautarch/ad17b4045ad0cba89ae80126fb0f5f3c#file-128bit_urem_magic_test_latency-s

chunk test throughput asm file: https://gist.github.com/sharkautarch/6385ddf31e241156e6e49d901141b694#file-128bit_urem_chunk_test-s

chunk test latency asm file: https://gist.github.com/sharkautarch/783a8ffb970c7db4d2ed6d38c629e113#file-128bit_urem_chunk_test_latency-s

Disclaimer: I didn't disable hyperthreading before running the benchmarks, and also the computer I'm running these tests on has P & E cores, tho I did pin the tests to one P core, and pinning the cpu seems to significantly reduce the noise from hyperthreading & the P & E core split.

magic throughput test: llvm-exegesis --mode=inverse_throughput --snippets-file=128bit_urem_magic_test.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1
full output (for one run): https://gist.github.com/sharkautarch/99a3aa70b6b04ba8dbd7b95c8b0b4a6f
per_snippet_value is about 9.0-9.2 cycles

chunk throughput test: llvm-exegesis --mode=inverse_throughput --snippets-file=128bit_urem_chunk_test.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1
full output (for one run): https://gist.github.com/sharkautarch/0931a87e622bd1dc811b0b75bd1dab13
per_snippet_value is about 6.52-6.58 cycles

magic latency test: llvm-exegesis --mode=latency --snippets-file=128bit_urem_magic_test_latency.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1
full output (for one run): https://gist.github.com/sharkautarch/76a33c941d095854065aa821a7fa9090
per_snippet_value is about 26-27.7 cycles

chunk latency test: llvm-exegesis --mode=latency --snippets-file=128bit_urem_chunk_test_latency.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1
full output (for one run): https://gist.github.com/sharkautarch/4126ca66c3bc1414fbeec3eff1b28548
per_snippet_value is about 19.60-19.71 cycles

This is only comparing codegen from one test function, but I don't have time to benchmark other functions

mskamp · 2025-09-07T13:33:56Z

@sharkautarch Thank you very much for your time and work. These are some valuable and helpful numbers.

I've started with a comparison of the approaches based on the output of llvm-mca. The rationale is that this way I can get a rough picture of the relative performance of both approaches for a wider array of hardware than is currently available to me. I haven't finished this and the results are only preliminary, but it still appears that the chunk-based approach is better for urem whereas the magic algorithm yields better code for udiv in most cases (i.e., when magics.IsAdd = false).

Based on your work, I've run some benchmarks for all cases listed in issue #137514 on my Icelake Intel CPU. As input for the experiments, I've used the following files:

I've turned all the usual performance features of the CPU off (e.g., turbo boost, hyperthreading) and performed 1000 runs for each algorithm and constant. The results are summarized here:

The results indicate that using the chunk-based approach for urem and the magic algorithm for udiv and the cases that are not handled by the chunk-based approach (3 constants out of 11 in the test) is probably not a bad heuristic. There are some cases, though, where the throughput of the magic algorithm is a bit higher than the throughput of the chunk-based approach even for urem (e.g., the constant 19). As a start, we might, however, very well use this heuristic to decide which approach to use.

I'll plan to perform similar tests on AArch64 next since the initial investigation with llvm-mca suggests that the results of both approaches might be closer to each other and less clear-cut there.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

arsenm · 2025-09-15T05:33:28Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

Move TLI to first argument (or better yet, just make this a TargetLowering member)

I've made this a member of TargetLowering.

For integer types twice as large as a legal type, we have previously generated a library call if another splitting technique was not applicable. With this change, we use an adaption of the Magic algorithm. This algorithm is also used for UDIV/UREM by constants on legal types. The implementation introduced here is a simple port of the already existing implementation to types twice the size of a legal type. The core idea of this algorithm is to replace (udiv x c) for a constant c with the bits higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c for some s and o. More details are available in Henry S. Warren, Jr.: "Hacker's Delight", chapter 10. An efficient handling of UDIV/UREM by constants on types twice as large as a legal type is mostly relevant for 32-bit platforms. But some projects may also benefit on 64-bit platforms. For example, the `fmt` library for C++ uses 128-bit unsigned divisions by 100 and 10000, which have not been covered by the previously existing optimizations. Closes llvm#137514.

mskamp · 2025-09-26T08:46:57Z

I've finally done some benchmarks for AArch64 on a Cortex A76. Here are my results:

Latency
Throughput

The numbers for the throughput look a bit strange to me. I'm not sure I've done everything correctly, though, so take it with a grain of salt.

With respect to the latency, the chunk-based algorithm yields better results in cases where the magic algorithm has been superior in the Icelake benchmark before. I think this is because the Cortex A76 does not support an operation that gives both the upper and lower word of an unsigned multiplication (such as MUL(X) on Icelake).

If we wanted to optimize the latency, we could use the following check list to determine the best algorithm based on the available measurements:

If the chunk-based algorithm does not support the given divisor: Use the magic algorithm.
If the operation to compute is UREM: Use the chunk-based algorithm.
If IsAdd = false in the result of UnsignedDivisionByConstantInfo::get() for the given divisor: Use the magic algorithm.
If the target does not have a legal UMUL_LOHI instruction for the required bit width: Use the chunk-based algorithm.
Otherwise: Use the magic algorithm.

Of course, this only fits (and likely overfits) the given measurements. The check previously suggested (essentially the first two and the last line in the check list) would probably still yield good results in most cases (in the benchmark, it would only yield suboptimal results for 4 data points). Hence, this is probably still a good start if we wanted to avoid overfitting.

sharkautarch · 2025-09-26T13:21:41Z

@mskamp wait, are the throughput measurements showing reciprocal throughput (cycles / instruction), or just "normal" throughput (instructions/cycle)?

mskamp · 2025-09-29T17:32:58Z

@mskamp wait, are the throughput measurements showing reciprocal throughput (cycles / instruction), or just "normal" throughput (instructions/cycle)?

@sharkautarch The throughput measurements just show the output of the llvm-exegesis mode inverse_throughput. As far as I understand, that should be reciprocal throughput.

mskamp added 2 commits August 22, 2025 04:45

[SelectionDAG] Add Tests for Large UDIV/UREM by Constant; NFC

ebb5abe

[SelectionDAG] Adjust Existing Tests; NFC

5c02eca

Add new test prefixes to some tests. Currently, these prefixes are unused but a subsequent commit will change the test result such that they become necessary. Furthermore, rename tests that will be folded after a subsequent commit.

llvmbot added backend:ARM backend:AArch64 backend:MIPS backend:RISC-V backend:PowerPC backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Aug 22, 2025

topperc reviewed Sep 12, 2025

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp Outdated Show resolved Hide resolved

mskamp force-pushed the feature_enhance_div_rem_double_bit_size branch from 5bdd489 to 8f3b0a3 Compare September 13, 2025 17:38

arsenm reviewed Sep 15, 2025

View reviewed changes

mskamp added 2 commits September 16, 2025 13:31

[SelectionDAG] Move UREM Decomposition to Own Function; NFC

ea788d4

mskamp force-pushed the feature_enhance_div_rem_double_bit_size branch from 8f3b0a3 to fc79e27 Compare September 16, 2025 13:44

[SelectionDAG] Use Magic Algorithm for Splitting UDIV/UREM by Constant #154968

Are you sure you want to change the base?

[SelectionDAG] Use Magic Algorithm for Splitting UDIV/UREM by Constant #154968

Conversation

mskamp commented Aug 22, 2025

Uh oh!

llvmbot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 22, 2025

Uh oh!

llvmbot commented Aug 22, 2025

Uh oh!

topperc commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mskamp commented Aug 22, 2025

Uh oh!

sharkautarch commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mskamp commented Sep 7, 2025

Uh oh!

Uh oh!

Uh oh!

arsenm Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

mskamp Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

mskamp commented Sep 26, 2025

Uh oh!

sharkautarch commented Sep 26, 2025

Uh oh!

mskamp commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

llvmbot commented Aug 22, 2025 •

edited

Loading

topperc commented Aug 22, 2025 •

edited

Loading

sharkautarch commented Sep 5, 2025 •

edited

Loading