Skip to content

Conversation

@mskamp
Copy link
Contributor

@mskamp mskamp commented Aug 22, 2025

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the fmt
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes #137514.

mskamp added 2 commits August 22, 2025 04:45
Add new test prefixes to some tests. Currently, these prefixes are
unused but a subsequent commit will change the test result such that
they become necessary.

Furthermore, rename tests that will be folded after a subsequent commit.
@llvmbot
Copy link
Member

llvmbot commented Aug 22, 2025

@llvm/pr-subscribers-backend-arm

@llvm/pr-subscribers-backend-x86

Author: Marius Kamp (mskamp)

Changes

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the fmt
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes #137514.


Patch is 138.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154968.diff

15 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+159-33)
  • (modified) llvm/test/CodeGen/AArch64/rem-by-const.ll (+210-70)
  • (modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+130-110)
  • (modified) llvm/test/CodeGen/Mips/funnel-shift.ll (+188-185)
  • (modified) llvm/test/CodeGen/PowerPC/funnel-shift.ll (+134-178)
  • (modified) llvm/test/CodeGen/PowerPC/urem-lkk.ll (+82-18)
  • (modified) llvm/test/CodeGen/RISCV/div-by-constant.ll (+42-7)
  • (modified) llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll (+144-28)
  • (modified) llvm/test/CodeGen/RISCV/split-urem-by-constant.ll (+192-28)
  • (modified) llvm/test/CodeGen/RISCV/urem-lkk.ll (+49-13)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+158-46)
  • (modified) llvm/test/CodeGen/X86/divide-by-constant.ll (+458-32)
  • (modified) llvm/test/CodeGen/X86/divmod128.ll (+625-15)
  • (modified) llvm/test/CodeGen/X86/funnel-shift.ll (+73-39)
  • (modified) llvm/test/CodeGen/X86/i128-udiv.ll (+37-9)
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 402a012e8e555..9e1a51e291ddb 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -8011,25 +8011,12 @@ bool TargetLowering::expandMUL(SDNode *N, SDValue &Lo, SDValue &Hi, EVT HiLoVT,
 // dividend and multiply by the multiplicative inverse of the shifted divisor.
 // If we want the remainder, we shift the value left by the number of trailing
 // zeros and add the bits that were shifted out of the dividend.
-bool TargetLowering::expandDIVREMByConstant(SDNode *N,
-                                            SmallVectorImpl<SDValue> &Result,
-                                            EVT HiLoVT, SelectionDAG &DAG,
-                                            SDValue LL, SDValue LH) const {
+static bool expandUDIVREMByConstantViaUREMDecomposition(
+    SDNode *N, APInt Divisor, SmallVectorImpl<SDValue> &Result, EVT HiLoVT,
+    SelectionDAG &DAG, SDValue LL, SDValue LH, const TargetLowering &TLI) {
   unsigned Opcode = N->getOpcode();
   EVT VT = N->getValueType(0);
 
-  // TODO: Support signed division/remainder.
-  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
-    return false;
-  assert(
-      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
-      "Unexpected opcode");
-
-  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
-  if (!CN)
-    return false;
-
-  APInt Divisor = CN->getAPIntValue();
   unsigned BitWidth = Divisor.getBitWidth();
   unsigned HBitWidth = BitWidth / 2;
   assert(VT.getScalarSizeInBits() == BitWidth &&
@@ -8040,20 +8027,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   if (Divisor.uge(HalfMaxPlus1))
     return false;
 
-  // We depend on the UREM by constant optimization in DAGCombiner that requires
-  // high multiply.
-  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
-      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
-    return false;
-
-  // Don't expand if optimizing for size.
-  if (DAG.shouldOptForSize())
-    return false;
-
-  // Early out for 0 or 1 divisors.
-  if (Divisor.ule(1))
-    return false;
-
   // If the divisor is even, shift it until it becomes odd.
   unsigned TrailingZeros = 0;
   if (!Divisor[0]) {
@@ -8097,8 +8070,8 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
 
     // Use uaddo_carry if we can, otherwise use a compare to detect overflow.
     EVT SetCCType =
-        getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
-    if (isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
+        TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
+    if (TLI.isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
       SDVTList VTList = DAG.getVTList(HiLoVT, SetCCType);
       Sum = DAG.getNode(ISD::UADDO, dl, VTList, LL, LH);
       Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, Sum,
@@ -8108,7 +8081,7 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
       SDValue Carry = DAG.getSetCC(dl, SetCCType, Sum, LL, ISD::SETULT);
       // If the boolean for the target is 0 or 1, we can add the setcc result
       // directly.
-      if (getBooleanContents(HiLoVT) ==
+      if (TLI.getBooleanContents(HiLoVT) ==
           TargetLoweringBase::ZeroOrOneBooleanContent)
         Carry = DAG.getZExtOrTrunc(Carry, dl, HiLoVT);
       else
@@ -8164,6 +8137,159 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   return true;
 }
 
+static bool
+expandUDIVREMByConstantViaUMulHiMagic(SDNode *N, const APInt &Divisor,
+                                      SmallVectorImpl<SDValue> &Result,
+                                      EVT HiLoVT, SelectionDAG &DAG, SDValue LL,
+                                      SDValue LH, const TargetLowering &TLI) {
+
+  SDValue N0 = N->getOperand(0);
+  EVT VT = N0->getValueType(0);
+  SDLoc DL{N};
+
+  assert(!Divisor.isOne() && "Magic algorithm does not work for division by 1");
+
+  // This helper creates a MUL_LOHI of the pair (LL, LH) by a constant.
+  auto MakeMUL_LOHIByConst = [&](unsigned Opc, SDValue LL, SDValue LH,
+                                 const APInt &Const,
+                                 SmallVectorImpl<SDValue> &Result) {
+    SDValue LHS = DAG.getNode(ISD::BUILD_PAIR, DL, VT, LL, LH);
+    SDValue RHS = DAG.getConstant(Const, DL, VT);
+    auto [RL, RH] = DAG.SplitScalar(RHS, DL, HiLoVT, HiLoVT);
+    return TLI.expandMUL_LOHI(
+        Opc, VT, DL, LHS, RHS, Result, HiLoVT, DAG,
+        TargetLowering::MulExpansionKind::OnlyLegalOrCustom, LL, LH, RL, RH);
+  };
+
+  // This helper creates an ADD/SUB of the pairs (LL, LH) and (RL, RH).
+  auto MakeAddSubLong = [&](unsigned Opc, SDValue LL, SDValue LH, SDValue RL,
+                            SDValue RH) {
+    SDValue AddSubNode =
+        DAG.getNode(Opc == ISD::ADD ? ISD::UADDO : ISD::USUBO, DL,
+                    DAG.getVTList(HiLoVT, MVT::i1), LL, RL);
+    SDValue OutL, OutH, Overflow;
+    TLI.expandUADDSUBO(AddSubNode.getNode(), OutL, Overflow, DAG);
+    SDValue WithOverflow = DAG.getNode(
+        Opc, DL, HiLoVT, LH, DAG.getZExtOrTrunc(Overflow, DL, HiLoVT));
+    OutH = DAG.getNode(Opc, DL, HiLoVT, WithOverflow, RH);
+    return std::make_pair(OutL, OutH);
+  };
+
+  // This helper creates a SRL of the pair (LL, LH) by Shift.
+  auto MakeSRLLong = [&](SDValue LL, SDValue LH, unsigned Shift) {
+    unsigned HBitWidth = HiLoVT.getScalarSizeInBits();
+    if (Shift < HBitWidth) {
+      SDValue ShAmt = DAG.getConstant(Shift, DL, HiLoVT);
+      SDValue ResL = DAG.getNode(ISD::FSHR, DL, HiLoVT, LH, LL, ShAmt);
+      SDValue ResH = DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt);
+      return std::make_pair(ResL, ResH);
+    }
+    SDValue Zero = DAG.getConstant(0, DL, HiLoVT);
+    if (Shift == HBitWidth)
+      return std::make_pair(LH, Zero);
+    assert(Shift - HBitWidth < HBitWidth &&
+           "We shouldn't generate an undefined shift");
+    SDValue ShAmt = DAG.getConstant(Shift - HBitWidth, DL, HiLoVT);
+    return std::make_pair(DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt), Zero);
+  };
+
+  // Knowledge of leading zeros may help to reduce the multiplier.
+  unsigned KnownLeadingZeros = DAG.computeKnownBits(N0).countMinLeadingZeros();
+
+  UnsignedDivisionByConstantInfo Magics = UnsignedDivisionByConstantInfo::get(
+      Divisor, std::min(KnownLeadingZeros, Divisor.countl_zero()));
+
+  assert(!LL == !LH && "Expected both input halves or no input halves!");
+  if (!LL)
+    std::tie(LL, LH) = DAG.SplitScalar(N0, DL, HiLoVT, HiLoVT);
+  SDValue QL = LL;
+  SDValue QH = LH;
+  if (Magics.PreShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PreShift);
+
+  SmallVector<SDValue, 2> UMulResult;
+  if (!MakeMUL_LOHIByConst(ISD::UMUL_LOHI, QL, QH, Magics.Magic, UMulResult))
+    return false;
+
+  QL = UMulResult[2];
+  QH = UMulResult[3];
+
+  if (Magics.IsAdd) {
+    auto [NPQL, NPQH] = MakeAddSubLong(ISD::SUB, LL, LH, QL, QH);
+    std::tie(NPQL, NPQH) = MakeSRLLong(NPQL, NPQH, 1);
+    std::tie(QL, QH) = MakeAddSubLong(ISD::ADD, NPQL, NPQH, QL, QH);
+  }
+
+  if (Magics.PostShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PostShift);
+
+  unsigned Opcode = N->getOpcode();
+  if (Opcode != ISD::UREM) {
+    Result.push_back(QL);
+    Result.push_back(QH);
+  }
+
+  if (Opcode != ISD::UDIV) {
+    SmallVector<SDValue, 2> MulResult;
+    if (!MakeMUL_LOHIByConst(ISD::MUL, QL, QH, Divisor, MulResult))
+      return false;
+
+    assert(MulResult.size() == 2);
+
+    auto [RemL, RemH] =
+        MakeAddSubLong(ISD::SUB, LL, LH, MulResult[0], MulResult[1]);
+
+    Result.push_back(RemL);
+    Result.push_back(RemH);
+  }
+
+  return true;
+}
+
+bool TargetLowering::expandDIVREMByConstant(SDNode *N,
+                                            SmallVectorImpl<SDValue> &Result,
+                                            EVT HiLoVT, SelectionDAG &DAG,
+                                            SDValue LL, SDValue LH) const {
+  unsigned Opcode = N->getOpcode();
+
+  // TODO: Support signed division/remainder.
+  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
+    return false;
+  assert(
+      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
+      "Unexpected opcode");
+
+  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
+  if (!CN)
+    return false;
+
+  APInt Divisor = CN->getAPIntValue();
+
+  // We depend on the UREM by constant optimization in DAGCombiner that requires
+  // high multiply.
+  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
+      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
+    return false;
+
+  // Don't expand if optimizing for size.
+  if (DAG.shouldOptForSize())
+    return false;
+
+  // Early out for 0 or 1 divisors.
+  if (Divisor.ule(1))
+    return false;
+
+  if (expandUDIVREMByConstantViaUREMDecomposition(N, Divisor, Result, HiLoVT,
+                                                  DAG, LL, LH, *this))
+    return true;
+
+  if (expandUDIVREMByConstantViaUMulHiMagic(N, Divisor, Result, HiLoVT, DAG, LL,
+                                            LH, *this))
+    return true;
+
+  return false;
+}
+
 // Check that (every element of) Z is undef or not an exact multiple of BW.
 static bool isNonZeroModBitWidthOrUndef(SDValue Z, unsigned BW) {
   return ISD::matchUnaryPredicate(
diff --git a/llvm/test/CodeGen/AArch64/rem-by-const.ll b/llvm/test/CodeGen/AArch64/rem-by-const.ll
index c57383ad9b1e7..0554b2e66a0be 100644
--- a/llvm/test/CodeGen/AArch64/rem-by-const.ll
+++ b/llvm/test/CodeGen/AArch64/rem-by-const.ll
@@ -513,13 +513,50 @@ entry:
 define i128 @ui128_7(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x11, x9, x8
+; CHECK-SD-NEXT:    cinc x12, x10, hs
+; CHECK-SD-NEXT:    subs x13, x0, x11
+; CHECK-SD-NEXT:    cset w14, lo
+; CHECK-SD-NEXT:    sub x14, x1, x14
+; CHECK-SD-NEXT:    sub x12, x14, x12
+; CHECK-SD-NEXT:    extr x13, x12, x13, #1
+; CHECK-SD-NEXT:    lsr x12, x12, #1
+; CHECK-SD-NEXT:    adds x11, x13, x11
+; CHECK-SD-NEXT:    cinc x12, x12, hs
+; CHECK-SD-NEXT:    cmn x9, x8
+; CHECK-SD-NEXT:    adc x8, x12, x10
+; CHECK-SD-NEXT:    mov w10, #7 // =0x7
+; CHECK-SD-NEXT:    extr x9, x8, x11, #2
+; CHECK-SD-NEXT:    lsr x8, x8, #2
+; CHECK-SD-NEXT:    umulh x10, x9, x10
+; CHECK-SD-NEXT:    lsl x11, x9, #3
+; CHECK-SD-NEXT:    sub x9, x11, x9
+; CHECK-SD-NEXT:    subs x0, x0, x9
+; CHECK-SD-NEXT:    cset w9, lo
+; CHECK-SD-NEXT:    sub x10, x10, x8
+; CHECK-SD-NEXT:    sub x9, x1, x9
+; CHECK-SD-NEXT:    add x8, x10, x8, lsl #3
+; CHECK-SD-NEXT:    sub x1, x9, x8
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_7:
@@ -596,13 +633,38 @@ entry:
 define i128 @ui128_100(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x8, x9, x8
+; CHECK-SD-NEXT:    cinc x9, x10, hs
+; CHECK-SD-NEXT:    mov w10, #100 // =0x64
+; CHECK-SD-NEXT:    extr x8, x9, x8, #4
+; CHECK-SD-NEXT:    lsr x9, x9, #4
+; CHECK-SD-NEXT:    umulh x11, x8, x10
+; CHECK-SD-NEXT:    mul x8, x8, x10
+; CHECK-SD-NEXT:    madd x9, x9, x10, x11
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    cset w8, lo
+; CHECK-SD-NEXT:    sub x8, x1, x8
+; CHECK-SD-NEXT:    sub x1, x8, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_100:
@@ -3204,34 +3266,85 @@ entry:
 define <2 x i128> @uv2i128_7(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    mul x12, x2, x8
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    umulh x10, x2, x11
+; CHECK-SD-NEXT:    adds x14, x9, x16
+; CHECK-SD-NEXT:    cinc x15, x13, hs
+; CHECK-SD-NEXT:    subs x18, x0, x14
+; CHECK-SD-NEXT:    umulh x17, x2, x8
+; CHECK-SD-NEXT:    cset w5, lo
+; CHECK-SD-NEXT:    sub x5, x1, x5
+; CHECK-SD-NEXT:    umulh x6, x3, x11
+; CHECK-SD-NEXT:    sub x15, x5, x15
+; CHECK-SD-NEXT:    extr x18, x15, x18, #1
+; CHECK-SD-NEXT:    mul x11, x3, x11
+; CHECK-SD-NEXT:    lsr x15, x15, #1
+; CHECK-SD-NEXT:    umulh x4, x3, x8
+; CHECK-SD-NEXT:    adds x14, x18, x14
+; CHECK-SD-NEXT:    cinc x15, x15, hs
+; CHECK-SD-NEXT:    cmn x9, x16
+; CHECK-SD-NEXT:    mul x8, x3, x8
+; CHECK-SD-NEXT:    adc x9, x15, x13
+; CHECK-SD-NEXT:    adds x10, x10, x12
+; CHECK-SD-NEXT:    cinc x12, x17, hs
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x10, x12, x6
+; CHECK-SD-NEXT:    cinc x11, x4, hs
+; CHECK-SD-NEXT:    adds x12, x10, x8
+; CHECK-SD-NEXT:    cinc x13, x11, hs
+; CHECK-SD-NEXT:    subs x15, x2, x12
+; CHECK-SD-NEXT:    cset w16, lo
+; CHECK-SD-NEXT:    sub x16, x3, x16
+; CHECK-SD-NEXT:    sub x13, x16, x13
+; CHECK-SD-NEXT:    extr x15, x13, x15, #1
+; CHECK-SD-NEXT:    lsr x13, x13, #1
+; CHECK-SD-NEXT:    adds x12, x15, x12
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    cmn x10, x8
+; CHECK-SD-NEXT:    extr x8, x9, x14, #2
+; CHECK-SD-NEXT:    adc x10, x13, x11
+; CHECK-SD-NEXT:    mov w11, #7 // =0x7
+; CHECK-SD-NEXT:    lsr x9, x9, #2
+; CHECK-SD-NEXT:    extr x12, x10, x12, #2
+; CHECK-SD-NEXT:    umulh x13, x8, x11
+; CHECK-SD-NEXT:    lsl x14, x8, #3
+; CHECK-SD-NEXT:    lsr x10, x10, #2
+; CHECK-SD-NEXT:    umulh x11, x12, x11
+; CHECK-SD-NEXT:    lsl x15, x12, #3
+; CHECK-SD-NEXT:    sub x8, x14, x8
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    sub x8, x15, x12
+; CHECK-SD-NEXT:    cset w12, lo
+; CHECK-SD-NEXT:    sub x13, x13, x9
+; CHECK-SD-NEXT:    subs x2, x2, x8
+; CHECK-SD-NEXT:    add x8, x13, x9, lsl #3
+; CHECK-SD-NEXT:    sub x11, x11, x10
+; CHECK-SD-NEXT:    add x9, x11, x10, lsl #3
+; CHECK-SD-NEXT:    cset w10, lo
+; CHECK-SD-NEXT:    sub x11, x1, x12
+; CHECK-SD-NEXT:    sub x10, x3, x10
+; CHECK-SD-NEXT:    sub x1, x11, x8
+; CHECK-SD-NEXT:    sub x3, x10, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: uv2i128_7:
@@ -3361,34 +3474,61 @@ entry:
 define <2 x i128> @uv2i128_100(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mov w12, #100 // =0x64
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 22, 2025

@llvm/pr-subscribers-llvm-selectiondag

Author: Marius Kamp (mskamp)

Changes

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the fmt
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes #137514.


Patch is 138.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154968.diff

15 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+159-33)
  • (modified) llvm/test/CodeGen/AArch64/rem-by-const.ll (+210-70)
  • (modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+130-110)
  • (modified) llvm/test/CodeGen/Mips/funnel-shift.ll (+188-185)
  • (modified) llvm/test/CodeGen/PowerPC/funnel-shift.ll (+134-178)
  • (modified) llvm/test/CodeGen/PowerPC/urem-lkk.ll (+82-18)
  • (modified) llvm/test/CodeGen/RISCV/div-by-constant.ll (+42-7)
  • (modified) llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll (+144-28)
  • (modified) llvm/test/CodeGen/RISCV/split-urem-by-constant.ll (+192-28)
  • (modified) llvm/test/CodeGen/RISCV/urem-lkk.ll (+49-13)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+158-46)
  • (modified) llvm/test/CodeGen/X86/divide-by-constant.ll (+458-32)
  • (modified) llvm/test/CodeGen/X86/divmod128.ll (+625-15)
  • (modified) llvm/test/CodeGen/X86/funnel-shift.ll (+73-39)
  • (modified) llvm/test/CodeGen/X86/i128-udiv.ll (+37-9)
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 402a012e8e555..9e1a51e291ddb 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -8011,25 +8011,12 @@ bool TargetLowering::expandMUL(SDNode *N, SDValue &Lo, SDValue &Hi, EVT HiLoVT,
 // dividend and multiply by the multiplicative inverse of the shifted divisor.
 // If we want the remainder, we shift the value left by the number of trailing
 // zeros and add the bits that were shifted out of the dividend.
-bool TargetLowering::expandDIVREMByConstant(SDNode *N,
-                                            SmallVectorImpl<SDValue> &Result,
-                                            EVT HiLoVT, SelectionDAG &DAG,
-                                            SDValue LL, SDValue LH) const {
+static bool expandUDIVREMByConstantViaUREMDecomposition(
+    SDNode *N, APInt Divisor, SmallVectorImpl<SDValue> &Result, EVT HiLoVT,
+    SelectionDAG &DAG, SDValue LL, SDValue LH, const TargetLowering &TLI) {
   unsigned Opcode = N->getOpcode();
   EVT VT = N->getValueType(0);
 
-  // TODO: Support signed division/remainder.
-  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
-    return false;
-  assert(
-      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
-      "Unexpected opcode");
-
-  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
-  if (!CN)
-    return false;
-
-  APInt Divisor = CN->getAPIntValue();
   unsigned BitWidth = Divisor.getBitWidth();
   unsigned HBitWidth = BitWidth / 2;
   assert(VT.getScalarSizeInBits() == BitWidth &&
@@ -8040,20 +8027,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   if (Divisor.uge(HalfMaxPlus1))
     return false;
 
-  // We depend on the UREM by constant optimization in DAGCombiner that requires
-  // high multiply.
-  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
-      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
-    return false;
-
-  // Don't expand if optimizing for size.
-  if (DAG.shouldOptForSize())
-    return false;
-
-  // Early out for 0 or 1 divisors.
-  if (Divisor.ule(1))
-    return false;
-
   // If the divisor is even, shift it until it becomes odd.
   unsigned TrailingZeros = 0;
   if (!Divisor[0]) {
@@ -8097,8 +8070,8 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
 
     // Use uaddo_carry if we can, otherwise use a compare to detect overflow.
     EVT SetCCType =
-        getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
-    if (isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
+        TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
+    if (TLI.isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
       SDVTList VTList = DAG.getVTList(HiLoVT, SetCCType);
       Sum = DAG.getNode(ISD::UADDO, dl, VTList, LL, LH);
       Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, Sum,
@@ -8108,7 +8081,7 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
       SDValue Carry = DAG.getSetCC(dl, SetCCType, Sum, LL, ISD::SETULT);
       // If the boolean for the target is 0 or 1, we can add the setcc result
       // directly.
-      if (getBooleanContents(HiLoVT) ==
+      if (TLI.getBooleanContents(HiLoVT) ==
           TargetLoweringBase::ZeroOrOneBooleanContent)
         Carry = DAG.getZExtOrTrunc(Carry, dl, HiLoVT);
       else
@@ -8164,6 +8137,159 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   return true;
 }
 
+static bool
+expandUDIVREMByConstantViaUMulHiMagic(SDNode *N, const APInt &Divisor,
+                                      SmallVectorImpl<SDValue> &Result,
+                                      EVT HiLoVT, SelectionDAG &DAG, SDValue LL,
+                                      SDValue LH, const TargetLowering &TLI) {
+
+  SDValue N0 = N->getOperand(0);
+  EVT VT = N0->getValueType(0);
+  SDLoc DL{N};
+
+  assert(!Divisor.isOne() && "Magic algorithm does not work for division by 1");
+
+  // This helper creates a MUL_LOHI of the pair (LL, LH) by a constant.
+  auto MakeMUL_LOHIByConst = [&](unsigned Opc, SDValue LL, SDValue LH,
+                                 const APInt &Const,
+                                 SmallVectorImpl<SDValue> &Result) {
+    SDValue LHS = DAG.getNode(ISD::BUILD_PAIR, DL, VT, LL, LH);
+    SDValue RHS = DAG.getConstant(Const, DL, VT);
+    auto [RL, RH] = DAG.SplitScalar(RHS, DL, HiLoVT, HiLoVT);
+    return TLI.expandMUL_LOHI(
+        Opc, VT, DL, LHS, RHS, Result, HiLoVT, DAG,
+        TargetLowering::MulExpansionKind::OnlyLegalOrCustom, LL, LH, RL, RH);
+  };
+
+  // This helper creates an ADD/SUB of the pairs (LL, LH) and (RL, RH).
+  auto MakeAddSubLong = [&](unsigned Opc, SDValue LL, SDValue LH, SDValue RL,
+                            SDValue RH) {
+    SDValue AddSubNode =
+        DAG.getNode(Opc == ISD::ADD ? ISD::UADDO : ISD::USUBO, DL,
+                    DAG.getVTList(HiLoVT, MVT::i1), LL, RL);
+    SDValue OutL, OutH, Overflow;
+    TLI.expandUADDSUBO(AddSubNode.getNode(), OutL, Overflow, DAG);
+    SDValue WithOverflow = DAG.getNode(
+        Opc, DL, HiLoVT, LH, DAG.getZExtOrTrunc(Overflow, DL, HiLoVT));
+    OutH = DAG.getNode(Opc, DL, HiLoVT, WithOverflow, RH);
+    return std::make_pair(OutL, OutH);
+  };
+
+  // This helper creates a SRL of the pair (LL, LH) by Shift.
+  auto MakeSRLLong = [&](SDValue LL, SDValue LH, unsigned Shift) {
+    unsigned HBitWidth = HiLoVT.getScalarSizeInBits();
+    if (Shift < HBitWidth) {
+      SDValue ShAmt = DAG.getConstant(Shift, DL, HiLoVT);
+      SDValue ResL = DAG.getNode(ISD::FSHR, DL, HiLoVT, LH, LL, ShAmt);
+      SDValue ResH = DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt);
+      return std::make_pair(ResL, ResH);
+    }
+    SDValue Zero = DAG.getConstant(0, DL, HiLoVT);
+    if (Shift == HBitWidth)
+      return std::make_pair(LH, Zero);
+    assert(Shift - HBitWidth < HBitWidth &&
+           "We shouldn't generate an undefined shift");
+    SDValue ShAmt = DAG.getConstant(Shift - HBitWidth, DL, HiLoVT);
+    return std::make_pair(DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt), Zero);
+  };
+
+  // Knowledge of leading zeros may help to reduce the multiplier.
+  unsigned KnownLeadingZeros = DAG.computeKnownBits(N0).countMinLeadingZeros();
+
+  UnsignedDivisionByConstantInfo Magics = UnsignedDivisionByConstantInfo::get(
+      Divisor, std::min(KnownLeadingZeros, Divisor.countl_zero()));
+
+  assert(!LL == !LH && "Expected both input halves or no input halves!");
+  if (!LL)
+    std::tie(LL, LH) = DAG.SplitScalar(N0, DL, HiLoVT, HiLoVT);
+  SDValue QL = LL;
+  SDValue QH = LH;
+  if (Magics.PreShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PreShift);
+
+  SmallVector<SDValue, 2> UMulResult;
+  if (!MakeMUL_LOHIByConst(ISD::UMUL_LOHI, QL, QH, Magics.Magic, UMulResult))
+    return false;
+
+  QL = UMulResult[2];
+  QH = UMulResult[3];
+
+  if (Magics.IsAdd) {
+    auto [NPQL, NPQH] = MakeAddSubLong(ISD::SUB, LL, LH, QL, QH);
+    std::tie(NPQL, NPQH) = MakeSRLLong(NPQL, NPQH, 1);
+    std::tie(QL, QH) = MakeAddSubLong(ISD::ADD, NPQL, NPQH, QL, QH);
+  }
+
+  if (Magics.PostShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PostShift);
+
+  unsigned Opcode = N->getOpcode();
+  if (Opcode != ISD::UREM) {
+    Result.push_back(QL);
+    Result.push_back(QH);
+  }
+
+  if (Opcode != ISD::UDIV) {
+    SmallVector<SDValue, 2> MulResult;
+    if (!MakeMUL_LOHIByConst(ISD::MUL, QL, QH, Divisor, MulResult))
+      return false;
+
+    assert(MulResult.size() == 2);
+
+    auto [RemL, RemH] =
+        MakeAddSubLong(ISD::SUB, LL, LH, MulResult[0], MulResult[1]);
+
+    Result.push_back(RemL);
+    Result.push_back(RemH);
+  }
+
+  return true;
+}
+
+bool TargetLowering::expandDIVREMByConstant(SDNode *N,
+                                            SmallVectorImpl<SDValue> &Result,
+                                            EVT HiLoVT, SelectionDAG &DAG,
+                                            SDValue LL, SDValue LH) const {
+  unsigned Opcode = N->getOpcode();
+
+  // TODO: Support signed division/remainder.
+  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
+    return false;
+  assert(
+      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
+      "Unexpected opcode");
+
+  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
+  if (!CN)
+    return false;
+
+  APInt Divisor = CN->getAPIntValue();
+
+  // We depend on the UREM by constant optimization in DAGCombiner that requires
+  // high multiply.
+  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
+      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
+    return false;
+
+  // Don't expand if optimizing for size.
+  if (DAG.shouldOptForSize())
+    return false;
+
+  // Early out for 0 or 1 divisors.
+  if (Divisor.ule(1))
+    return false;
+
+  if (expandUDIVREMByConstantViaUREMDecomposition(N, Divisor, Result, HiLoVT,
+                                                  DAG, LL, LH, *this))
+    return true;
+
+  if (expandUDIVREMByConstantViaUMulHiMagic(N, Divisor, Result, HiLoVT, DAG, LL,
+                                            LH, *this))
+    return true;
+
+  return false;
+}
+
 // Check that (every element of) Z is undef or not an exact multiple of BW.
 static bool isNonZeroModBitWidthOrUndef(SDValue Z, unsigned BW) {
   return ISD::matchUnaryPredicate(
diff --git a/llvm/test/CodeGen/AArch64/rem-by-const.ll b/llvm/test/CodeGen/AArch64/rem-by-const.ll
index c57383ad9b1e7..0554b2e66a0be 100644
--- a/llvm/test/CodeGen/AArch64/rem-by-const.ll
+++ b/llvm/test/CodeGen/AArch64/rem-by-const.ll
@@ -513,13 +513,50 @@ entry:
 define i128 @ui128_7(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x11, x9, x8
+; CHECK-SD-NEXT:    cinc x12, x10, hs
+; CHECK-SD-NEXT:    subs x13, x0, x11
+; CHECK-SD-NEXT:    cset w14, lo
+; CHECK-SD-NEXT:    sub x14, x1, x14
+; CHECK-SD-NEXT:    sub x12, x14, x12
+; CHECK-SD-NEXT:    extr x13, x12, x13, #1
+; CHECK-SD-NEXT:    lsr x12, x12, #1
+; CHECK-SD-NEXT:    adds x11, x13, x11
+; CHECK-SD-NEXT:    cinc x12, x12, hs
+; CHECK-SD-NEXT:    cmn x9, x8
+; CHECK-SD-NEXT:    adc x8, x12, x10
+; CHECK-SD-NEXT:    mov w10, #7 // =0x7
+; CHECK-SD-NEXT:    extr x9, x8, x11, #2
+; CHECK-SD-NEXT:    lsr x8, x8, #2
+; CHECK-SD-NEXT:    umulh x10, x9, x10
+; CHECK-SD-NEXT:    lsl x11, x9, #3
+; CHECK-SD-NEXT:    sub x9, x11, x9
+; CHECK-SD-NEXT:    subs x0, x0, x9
+; CHECK-SD-NEXT:    cset w9, lo
+; CHECK-SD-NEXT:    sub x10, x10, x8
+; CHECK-SD-NEXT:    sub x9, x1, x9
+; CHECK-SD-NEXT:    add x8, x10, x8, lsl #3
+; CHECK-SD-NEXT:    sub x1, x9, x8
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_7:
@@ -596,13 +633,38 @@ entry:
 define i128 @ui128_100(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x8, x9, x8
+; CHECK-SD-NEXT:    cinc x9, x10, hs
+; CHECK-SD-NEXT:    mov w10, #100 // =0x64
+; CHECK-SD-NEXT:    extr x8, x9, x8, #4
+; CHECK-SD-NEXT:    lsr x9, x9, #4
+; CHECK-SD-NEXT:    umulh x11, x8, x10
+; CHECK-SD-NEXT:    mul x8, x8, x10
+; CHECK-SD-NEXT:    madd x9, x9, x10, x11
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    cset w8, lo
+; CHECK-SD-NEXT:    sub x8, x1, x8
+; CHECK-SD-NEXT:    sub x1, x8, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_100:
@@ -3204,34 +3266,85 @@ entry:
 define <2 x i128> @uv2i128_7(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    mul x12, x2, x8
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    umulh x10, x2, x11
+; CHECK-SD-NEXT:    adds x14, x9, x16
+; CHECK-SD-NEXT:    cinc x15, x13, hs
+; CHECK-SD-NEXT:    subs x18, x0, x14
+; CHECK-SD-NEXT:    umulh x17, x2, x8
+; CHECK-SD-NEXT:    cset w5, lo
+; CHECK-SD-NEXT:    sub x5, x1, x5
+; CHECK-SD-NEXT:    umulh x6, x3, x11
+; CHECK-SD-NEXT:    sub x15, x5, x15
+; CHECK-SD-NEXT:    extr x18, x15, x18, #1
+; CHECK-SD-NEXT:    mul x11, x3, x11
+; CHECK-SD-NEXT:    lsr x15, x15, #1
+; CHECK-SD-NEXT:    umulh x4, x3, x8
+; CHECK-SD-NEXT:    adds x14, x18, x14
+; CHECK-SD-NEXT:    cinc x15, x15, hs
+; CHECK-SD-NEXT:    cmn x9, x16
+; CHECK-SD-NEXT:    mul x8, x3, x8
+; CHECK-SD-NEXT:    adc x9, x15, x13
+; CHECK-SD-NEXT:    adds x10, x10, x12
+; CHECK-SD-NEXT:    cinc x12, x17, hs
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x10, x12, x6
+; CHECK-SD-NEXT:    cinc x11, x4, hs
+; CHECK-SD-NEXT:    adds x12, x10, x8
+; CHECK-SD-NEXT:    cinc x13, x11, hs
+; CHECK-SD-NEXT:    subs x15, x2, x12
+; CHECK-SD-NEXT:    cset w16, lo
+; CHECK-SD-NEXT:    sub x16, x3, x16
+; CHECK-SD-NEXT:    sub x13, x16, x13
+; CHECK-SD-NEXT:    extr x15, x13, x15, #1
+; CHECK-SD-NEXT:    lsr x13, x13, #1
+; CHECK-SD-NEXT:    adds x12, x15, x12
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    cmn x10, x8
+; CHECK-SD-NEXT:    extr x8, x9, x14, #2
+; CHECK-SD-NEXT:    adc x10, x13, x11
+; CHECK-SD-NEXT:    mov w11, #7 // =0x7
+; CHECK-SD-NEXT:    lsr x9, x9, #2
+; CHECK-SD-NEXT:    extr x12, x10, x12, #2
+; CHECK-SD-NEXT:    umulh x13, x8, x11
+; CHECK-SD-NEXT:    lsl x14, x8, #3
+; CHECK-SD-NEXT:    lsr x10, x10, #2
+; CHECK-SD-NEXT:    umulh x11, x12, x11
+; CHECK-SD-NEXT:    lsl x15, x12, #3
+; CHECK-SD-NEXT:    sub x8, x14, x8
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    sub x8, x15, x12
+; CHECK-SD-NEXT:    cset w12, lo
+; CHECK-SD-NEXT:    sub x13, x13, x9
+; CHECK-SD-NEXT:    subs x2, x2, x8
+; CHECK-SD-NEXT:    add x8, x13, x9, lsl #3
+; CHECK-SD-NEXT:    sub x11, x11, x10
+; CHECK-SD-NEXT:    add x9, x11, x10, lsl #3
+; CHECK-SD-NEXT:    cset w10, lo
+; CHECK-SD-NEXT:    sub x11, x1, x12
+; CHECK-SD-NEXT:    sub x10, x3, x10
+; CHECK-SD-NEXT:    sub x1, x11, x8
+; CHECK-SD-NEXT:    sub x3, x10, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: uv2i128_7:
@@ -3361,34 +3474,61 @@ entry:
 define <2 x i128> @uv2i128_100(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mov w12, #100 // =0x64
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 22, 2025

@llvm/pr-subscribers-backend-mips

Author: Marius Kamp (mskamp)

Changes

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the fmt
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes #137514.


Patch is 138.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154968.diff

15 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+159-33)
  • (modified) llvm/test/CodeGen/AArch64/rem-by-const.ll (+210-70)
  • (modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+130-110)
  • (modified) llvm/test/CodeGen/Mips/funnel-shift.ll (+188-185)
  • (modified) llvm/test/CodeGen/PowerPC/funnel-shift.ll (+134-178)
  • (modified) llvm/test/CodeGen/PowerPC/urem-lkk.ll (+82-18)
  • (modified) llvm/test/CodeGen/RISCV/div-by-constant.ll (+42-7)
  • (modified) llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll (+144-28)
  • (modified) llvm/test/CodeGen/RISCV/split-urem-by-constant.ll (+192-28)
  • (modified) llvm/test/CodeGen/RISCV/urem-lkk.ll (+49-13)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+158-46)
  • (modified) llvm/test/CodeGen/X86/divide-by-constant.ll (+458-32)
  • (modified) llvm/test/CodeGen/X86/divmod128.ll (+625-15)
  • (modified) llvm/test/CodeGen/X86/funnel-shift.ll (+73-39)
  • (modified) llvm/test/CodeGen/X86/i128-udiv.ll (+37-9)
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 402a012e8e555..9e1a51e291ddb 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -8011,25 +8011,12 @@ bool TargetLowering::expandMUL(SDNode *N, SDValue &Lo, SDValue &Hi, EVT HiLoVT,
 // dividend and multiply by the multiplicative inverse of the shifted divisor.
 // If we want the remainder, we shift the value left by the number of trailing
 // zeros and add the bits that were shifted out of the dividend.
-bool TargetLowering::expandDIVREMByConstant(SDNode *N,
-                                            SmallVectorImpl<SDValue> &Result,
-                                            EVT HiLoVT, SelectionDAG &DAG,
-                                            SDValue LL, SDValue LH) const {
+static bool expandUDIVREMByConstantViaUREMDecomposition(
+    SDNode *N, APInt Divisor, SmallVectorImpl<SDValue> &Result, EVT HiLoVT,
+    SelectionDAG &DAG, SDValue LL, SDValue LH, const TargetLowering &TLI) {
   unsigned Opcode = N->getOpcode();
   EVT VT = N->getValueType(0);
 
-  // TODO: Support signed division/remainder.
-  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
-    return false;
-  assert(
-      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
-      "Unexpected opcode");
-
-  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
-  if (!CN)
-    return false;
-
-  APInt Divisor = CN->getAPIntValue();
   unsigned BitWidth = Divisor.getBitWidth();
   unsigned HBitWidth = BitWidth / 2;
   assert(VT.getScalarSizeInBits() == BitWidth &&
@@ -8040,20 +8027,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   if (Divisor.uge(HalfMaxPlus1))
     return false;
 
-  // We depend on the UREM by constant optimization in DAGCombiner that requires
-  // high multiply.
-  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
-      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
-    return false;
-
-  // Don't expand if optimizing for size.
-  if (DAG.shouldOptForSize())
-    return false;
-
-  // Early out for 0 or 1 divisors.
-  if (Divisor.ule(1))
-    return false;
-
   // If the divisor is even, shift it until it becomes odd.
   unsigned TrailingZeros = 0;
   if (!Divisor[0]) {
@@ -8097,8 +8070,8 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
 
     // Use uaddo_carry if we can, otherwise use a compare to detect overflow.
     EVT SetCCType =
-        getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
-    if (isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
+        TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), HiLoVT);
+    if (TLI.isOperationLegalOrCustom(ISD::UADDO_CARRY, HiLoVT)) {
       SDVTList VTList = DAG.getVTList(HiLoVT, SetCCType);
       Sum = DAG.getNode(ISD::UADDO, dl, VTList, LL, LH);
       Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, Sum,
@@ -8108,7 +8081,7 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
       SDValue Carry = DAG.getSetCC(dl, SetCCType, Sum, LL, ISD::SETULT);
       // If the boolean for the target is 0 or 1, we can add the setcc result
       // directly.
-      if (getBooleanContents(HiLoVT) ==
+      if (TLI.getBooleanContents(HiLoVT) ==
           TargetLoweringBase::ZeroOrOneBooleanContent)
         Carry = DAG.getZExtOrTrunc(Carry, dl, HiLoVT);
       else
@@ -8164,6 +8137,159 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
   return true;
 }
 
+static bool
+expandUDIVREMByConstantViaUMulHiMagic(SDNode *N, const APInt &Divisor,
+                                      SmallVectorImpl<SDValue> &Result,
+                                      EVT HiLoVT, SelectionDAG &DAG, SDValue LL,
+                                      SDValue LH, const TargetLowering &TLI) {
+
+  SDValue N0 = N->getOperand(0);
+  EVT VT = N0->getValueType(0);
+  SDLoc DL{N};
+
+  assert(!Divisor.isOne() && "Magic algorithm does not work for division by 1");
+
+  // This helper creates a MUL_LOHI of the pair (LL, LH) by a constant.
+  auto MakeMUL_LOHIByConst = [&](unsigned Opc, SDValue LL, SDValue LH,
+                                 const APInt &Const,
+                                 SmallVectorImpl<SDValue> &Result) {
+    SDValue LHS = DAG.getNode(ISD::BUILD_PAIR, DL, VT, LL, LH);
+    SDValue RHS = DAG.getConstant(Const, DL, VT);
+    auto [RL, RH] = DAG.SplitScalar(RHS, DL, HiLoVT, HiLoVT);
+    return TLI.expandMUL_LOHI(
+        Opc, VT, DL, LHS, RHS, Result, HiLoVT, DAG,
+        TargetLowering::MulExpansionKind::OnlyLegalOrCustom, LL, LH, RL, RH);
+  };
+
+  // This helper creates an ADD/SUB of the pairs (LL, LH) and (RL, RH).
+  auto MakeAddSubLong = [&](unsigned Opc, SDValue LL, SDValue LH, SDValue RL,
+                            SDValue RH) {
+    SDValue AddSubNode =
+        DAG.getNode(Opc == ISD::ADD ? ISD::UADDO : ISD::USUBO, DL,
+                    DAG.getVTList(HiLoVT, MVT::i1), LL, RL);
+    SDValue OutL, OutH, Overflow;
+    TLI.expandUADDSUBO(AddSubNode.getNode(), OutL, Overflow, DAG);
+    SDValue WithOverflow = DAG.getNode(
+        Opc, DL, HiLoVT, LH, DAG.getZExtOrTrunc(Overflow, DL, HiLoVT));
+    OutH = DAG.getNode(Opc, DL, HiLoVT, WithOverflow, RH);
+    return std::make_pair(OutL, OutH);
+  };
+
+  // This helper creates a SRL of the pair (LL, LH) by Shift.
+  auto MakeSRLLong = [&](SDValue LL, SDValue LH, unsigned Shift) {
+    unsigned HBitWidth = HiLoVT.getScalarSizeInBits();
+    if (Shift < HBitWidth) {
+      SDValue ShAmt = DAG.getConstant(Shift, DL, HiLoVT);
+      SDValue ResL = DAG.getNode(ISD::FSHR, DL, HiLoVT, LH, LL, ShAmt);
+      SDValue ResH = DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt);
+      return std::make_pair(ResL, ResH);
+    }
+    SDValue Zero = DAG.getConstant(0, DL, HiLoVT);
+    if (Shift == HBitWidth)
+      return std::make_pair(LH, Zero);
+    assert(Shift - HBitWidth < HBitWidth &&
+           "We shouldn't generate an undefined shift");
+    SDValue ShAmt = DAG.getConstant(Shift - HBitWidth, DL, HiLoVT);
+    return std::make_pair(DAG.getNode(ISD::SRL, DL, HiLoVT, LH, ShAmt), Zero);
+  };
+
+  // Knowledge of leading zeros may help to reduce the multiplier.
+  unsigned KnownLeadingZeros = DAG.computeKnownBits(N0).countMinLeadingZeros();
+
+  UnsignedDivisionByConstantInfo Magics = UnsignedDivisionByConstantInfo::get(
+      Divisor, std::min(KnownLeadingZeros, Divisor.countl_zero()));
+
+  assert(!LL == !LH && "Expected both input halves or no input halves!");
+  if (!LL)
+    std::tie(LL, LH) = DAG.SplitScalar(N0, DL, HiLoVT, HiLoVT);
+  SDValue QL = LL;
+  SDValue QH = LH;
+  if (Magics.PreShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PreShift);
+
+  SmallVector<SDValue, 2> UMulResult;
+  if (!MakeMUL_LOHIByConst(ISD::UMUL_LOHI, QL, QH, Magics.Magic, UMulResult))
+    return false;
+
+  QL = UMulResult[2];
+  QH = UMulResult[3];
+
+  if (Magics.IsAdd) {
+    auto [NPQL, NPQH] = MakeAddSubLong(ISD::SUB, LL, LH, QL, QH);
+    std::tie(NPQL, NPQH) = MakeSRLLong(NPQL, NPQH, 1);
+    std::tie(QL, QH) = MakeAddSubLong(ISD::ADD, NPQL, NPQH, QL, QH);
+  }
+
+  if (Magics.PostShift != 0)
+    std::tie(QL, QH) = MakeSRLLong(QL, QH, Magics.PostShift);
+
+  unsigned Opcode = N->getOpcode();
+  if (Opcode != ISD::UREM) {
+    Result.push_back(QL);
+    Result.push_back(QH);
+  }
+
+  if (Opcode != ISD::UDIV) {
+    SmallVector<SDValue, 2> MulResult;
+    if (!MakeMUL_LOHIByConst(ISD::MUL, QL, QH, Divisor, MulResult))
+      return false;
+
+    assert(MulResult.size() == 2);
+
+    auto [RemL, RemH] =
+        MakeAddSubLong(ISD::SUB, LL, LH, MulResult[0], MulResult[1]);
+
+    Result.push_back(RemL);
+    Result.push_back(RemH);
+  }
+
+  return true;
+}
+
+bool TargetLowering::expandDIVREMByConstant(SDNode *N,
+                                            SmallVectorImpl<SDValue> &Result,
+                                            EVT HiLoVT, SelectionDAG &DAG,
+                                            SDValue LL, SDValue LH) const {
+  unsigned Opcode = N->getOpcode();
+
+  // TODO: Support signed division/remainder.
+  if (Opcode == ISD::SREM || Opcode == ISD::SDIV || Opcode == ISD::SDIVREM)
+    return false;
+  assert(
+      (Opcode == ISD::UREM || Opcode == ISD::UDIV || Opcode == ISD::UDIVREM) &&
+      "Unexpected opcode");
+
+  auto *CN = dyn_cast<ConstantSDNode>(N->getOperand(1));
+  if (!CN)
+    return false;
+
+  APInt Divisor = CN->getAPIntValue();
+
+  // We depend on the UREM by constant optimization in DAGCombiner that requires
+  // high multiply.
+  if (!isOperationLegalOrCustom(ISD::MULHU, HiLoVT) &&
+      !isOperationLegalOrCustom(ISD::UMUL_LOHI, HiLoVT))
+    return false;
+
+  // Don't expand if optimizing for size.
+  if (DAG.shouldOptForSize())
+    return false;
+
+  // Early out for 0 or 1 divisors.
+  if (Divisor.ule(1))
+    return false;
+
+  if (expandUDIVREMByConstantViaUREMDecomposition(N, Divisor, Result, HiLoVT,
+                                                  DAG, LL, LH, *this))
+    return true;
+
+  if (expandUDIVREMByConstantViaUMulHiMagic(N, Divisor, Result, HiLoVT, DAG, LL,
+                                            LH, *this))
+    return true;
+
+  return false;
+}
+
 // Check that (every element of) Z is undef or not an exact multiple of BW.
 static bool isNonZeroModBitWidthOrUndef(SDValue Z, unsigned BW) {
   return ISD::matchUnaryPredicate(
diff --git a/llvm/test/CodeGen/AArch64/rem-by-const.ll b/llvm/test/CodeGen/AArch64/rem-by-const.ll
index c57383ad9b1e7..0554b2e66a0be 100644
--- a/llvm/test/CodeGen/AArch64/rem-by-const.ll
+++ b/llvm/test/CodeGen/AArch64/rem-by-const.ll
@@ -513,13 +513,50 @@ entry:
 define i128 @ui128_7(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x11, x9, x8
+; CHECK-SD-NEXT:    cinc x12, x10, hs
+; CHECK-SD-NEXT:    subs x13, x0, x11
+; CHECK-SD-NEXT:    cset w14, lo
+; CHECK-SD-NEXT:    sub x14, x1, x14
+; CHECK-SD-NEXT:    sub x12, x14, x12
+; CHECK-SD-NEXT:    extr x13, x12, x13, #1
+; CHECK-SD-NEXT:    lsr x12, x12, #1
+; CHECK-SD-NEXT:    adds x11, x13, x11
+; CHECK-SD-NEXT:    cinc x12, x12, hs
+; CHECK-SD-NEXT:    cmn x9, x8
+; CHECK-SD-NEXT:    adc x8, x12, x10
+; CHECK-SD-NEXT:    mov w10, #7 // =0x7
+; CHECK-SD-NEXT:    extr x9, x8, x11, #2
+; CHECK-SD-NEXT:    lsr x8, x8, #2
+; CHECK-SD-NEXT:    umulh x10, x9, x10
+; CHECK-SD-NEXT:    lsl x11, x9, #3
+; CHECK-SD-NEXT:    sub x9, x11, x9
+; CHECK-SD-NEXT:    subs x0, x0, x9
+; CHECK-SD-NEXT:    cset w9, lo
+; CHECK-SD-NEXT:    sub x10, x10, x8
+; CHECK-SD-NEXT:    sub x9, x1, x9
+; CHECK-SD-NEXT:    add x8, x10, x8, lsl #3
+; CHECK-SD-NEXT:    sub x1, x9, x8
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_7:
@@ -596,13 +633,38 @@ entry:
 define i128 @ui128_100(i128 %a, i128 %b) {
 ; CHECK-SD-LABEL: ui128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    .cfi_offset w30, -16
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mul x11, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    mul x8, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    cinc x10, x13, hs
+; CHECK-SD-NEXT:    adds x8, x9, x8
+; CHECK-SD-NEXT:    cinc x9, x10, hs
+; CHECK-SD-NEXT:    mov w10, #100 // =0x64
+; CHECK-SD-NEXT:    extr x8, x9, x8, #4
+; CHECK-SD-NEXT:    lsr x9, x9, #4
+; CHECK-SD-NEXT:    umulh x11, x8, x10
+; CHECK-SD-NEXT:    mul x8, x8, x10
+; CHECK-SD-NEXT:    madd x9, x9, x10, x11
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    cset w8, lo
+; CHECK-SD-NEXT:    sub x8, x1, x8
+; CHECK-SD-NEXT:    sub x1, x8, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: ui128_100:
@@ -3204,34 +3266,85 @@ entry:
 define <2 x i128> @uv2i128_7(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #7 // =0x7
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #9362 // =0x2492
+; CHECK-SD-NEXT:    mov x11, #18725 // =0x4925
+; CHECK-SD-NEXT:    movk x8, #37449, lsl #16
+; CHECK-SD-NEXT:    movk x11, #9362, lsl #16
+; CHECK-SD-NEXT:    movk x8, #18724, lsl #32
+; CHECK-SD-NEXT:    movk x11, #37449, lsl #32
+; CHECK-SD-NEXT:    movk x8, #9362, lsl #48
+; CHECK-SD-NEXT:    movk x11, #18724, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-NEXT:    adcs x9, x9, x14
+; CHECK-SD-NEXT:    mul x12, x2, x8
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    umulh x10, x2, x11
+; CHECK-SD-NEXT:    adds x14, x9, x16
+; CHECK-SD-NEXT:    cinc x15, x13, hs
+; CHECK-SD-NEXT:    subs x18, x0, x14
+; CHECK-SD-NEXT:    umulh x17, x2, x8
+; CHECK-SD-NEXT:    cset w5, lo
+; CHECK-SD-NEXT:    sub x5, x1, x5
+; CHECK-SD-NEXT:    umulh x6, x3, x11
+; CHECK-SD-NEXT:    sub x15, x5, x15
+; CHECK-SD-NEXT:    extr x18, x15, x18, #1
+; CHECK-SD-NEXT:    mul x11, x3, x11
+; CHECK-SD-NEXT:    lsr x15, x15, #1
+; CHECK-SD-NEXT:    umulh x4, x3, x8
+; CHECK-SD-NEXT:    adds x14, x18, x14
+; CHECK-SD-NEXT:    cinc x15, x15, hs
+; CHECK-SD-NEXT:    cmn x9, x16
+; CHECK-SD-NEXT:    mul x8, x3, x8
+; CHECK-SD-NEXT:    adc x9, x15, x13
+; CHECK-SD-NEXT:    adds x10, x10, x12
+; CHECK-SD-NEXT:    cinc x12, x17, hs
+; CHECK-SD-NEXT:    cmn x10, x11
+; CHECK-SD-NEXT:    adcs x10, x12, x6
+; CHECK-SD-NEXT:    cinc x11, x4, hs
+; CHECK-SD-NEXT:    adds x12, x10, x8
+; CHECK-SD-NEXT:    cinc x13, x11, hs
+; CHECK-SD-NEXT:    subs x15, x2, x12
+; CHECK-SD-NEXT:    cset w16, lo
+; CHECK-SD-NEXT:    sub x16, x3, x16
+; CHECK-SD-NEXT:    sub x13, x16, x13
+; CHECK-SD-NEXT:    extr x15, x13, x15, #1
+; CHECK-SD-NEXT:    lsr x13, x13, #1
+; CHECK-SD-NEXT:    adds x12, x15, x12
+; CHECK-SD-NEXT:    cinc x13, x13, hs
+; CHECK-SD-NEXT:    cmn x10, x8
+; CHECK-SD-NEXT:    extr x8, x9, x14, #2
+; CHECK-SD-NEXT:    adc x10, x13, x11
+; CHECK-SD-NEXT:    mov w11, #7 // =0x7
+; CHECK-SD-NEXT:    lsr x9, x9, #2
+; CHECK-SD-NEXT:    extr x12, x10, x12, #2
+; CHECK-SD-NEXT:    umulh x13, x8, x11
+; CHECK-SD-NEXT:    lsl x14, x8, #3
+; CHECK-SD-NEXT:    lsr x10, x10, #2
+; CHECK-SD-NEXT:    umulh x11, x12, x11
+; CHECK-SD-NEXT:    lsl x15, x12, #3
+; CHECK-SD-NEXT:    sub x8, x14, x8
+; CHECK-SD-NEXT:    subs x0, x0, x8
+; CHECK-SD-NEXT:    sub x8, x15, x12
+; CHECK-SD-NEXT:    cset w12, lo
+; CHECK-SD-NEXT:    sub x13, x13, x9
+; CHECK-SD-NEXT:    subs x2, x2, x8
+; CHECK-SD-NEXT:    add x8, x13, x9, lsl #3
+; CHECK-SD-NEXT:    sub x11, x11, x10
+; CHECK-SD-NEXT:    add x9, x11, x10, lsl #3
+; CHECK-SD-NEXT:    cset w10, lo
+; CHECK-SD-NEXT:    sub x11, x1, x12
+; CHECK-SD-NEXT:    sub x10, x3, x10
+; CHECK-SD-NEXT:    sub x1, x11, x8
+; CHECK-SD-NEXT:    sub x3, x10, x9
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: uv2i128_7:
@@ -3361,34 +3474,61 @@ entry:
 define <2 x i128> @uv2i128_100(<2 x i128> %d, <2 x i128> %e) {
 ; CHECK-SD-LABEL: uv2i128_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    str x30, [sp, #-48]! // 8-byte Folded Spill
-; CHECK-SD-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    .cfi_def_cfa_offset 48
-; CHECK-SD-NEXT:    .cfi_offset w19, -8
-; CHECK-SD-NEXT:    .cfi_offset w20, -16
-; CHECK-SD-NEXT:    .cfi_offset w21, -24
-; CHECK-SD-NEXT:    .cfi_offset w22, -32
-; CHECK-SD-NEXT:    .cfi_offset w30, -48
-; CHECK-SD-NEXT:    mov x19, x3
-; CHECK-SD-NEXT:    mov x20, x2
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x21, x0
-; CHECK-SD-NEXT:    mov x22, x1
-; CHECK-SD-NEXT:    mov x0, x20
-; CHECK-SD-NEXT:    mov x1, x19
-; CHECK-SD-NEXT:    mov w2, #100 // =0x64
-; CHECK-SD-NEXT:    mov x3, xzr
-; CHECK-SD-NEXT:    bl __umodti3
-; CHECK-SD-NEXT:    mov x2, x0
-; CHECK-SD-NEXT:    mov x3, x1
-; CHECK-SD-NEXT:    mov x0, x21
-; CHECK-SD-NEXT:    mov x1, x22
-; CHECK-SD-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
-; CHECK-SD-NEXT:    ldr x30, [sp], #48 // 8-byte Folded Reload
+; CHECK-SD-NEXT:    mov x8, #62914 // =0xf5c2
+; CHECK-SD-NEXT:    mov x11, #23593 // =0x5c29
+; CHECK-SD-NEXT:    movk x8, #23592, lsl #16
+; CHECK-SD-NEXT:    movk x11, #49807, lsl #16
+; CHECK-SD-NEXT:    movk x8, #49807, lsl #32
+; CHECK-SD-NEXT:    movk x11, #10485, lsl #32
+; CHECK-SD-NEXT:    movk x8, #10485, lsl #48
+; CHECK-SD-NEXT:    movk x11, #36700, lsl #48
+; CHECK-SD-NEXT:    mul x10, x0, x8
+; CHECK-SD-NEXT:    umulh x12, x0, x11
+; CHECK-SD-NEXT:    umulh x9, x0, x8
+; CHECK-SD-NEXT:    mul x15, x1, x11
+; CHECK-SD-NEXT:    adds x10, x12, x10
+; CHECK-SD-NEXT:    mov w12, #100 // =0x64
+; CHECK-SD-NEXT:    umulh x14, x1, x11
+; CHECK-SD-NEXT:    cinc x9, x9, hs
+; CHECK-SD-NEXT:    umulh x13, x1, x8
+; CHECK-SD-NEXT:    cmn x10, x15
+; CHECK-SD-NEXT:    mul x16, x1, x8
+; CHECK-SD-...
[truncated]

@topperc
Copy link
Collaborator

topperc commented Aug 22, 2025

There was another patch for solving this problem that was not correctly linked to the issue #146238

@mskamp
Copy link
Contributor Author

mskamp commented Aug 22, 2025

There was another patch for solving this problem that was not correctly linked to the issue #146238

Oh, I completely missed that patch. Thank you for the hint.

It seems, though, that the linked patch follows another approach that may not handle some constants (e.g., 67 if I'm not mistaken). Hence, the approaches in the two patches may complement each other. But an integration of both approaches would most likely require measurements and benchmarks to determine when to select which approach to get the best results.

What is the policy on how to proceed with this kind of conflicts?

@sharkautarch
Copy link

sharkautarch commented Sep 5, 2025

@mskamp

But an integration of both approaches would most likely require measurements and benchmarks to determine when to select which approach to get the best results.

I ran an x86-64 benchmark for just the @umod128(i128 %x) function in llvm/test/CodeGen/X86/divmod128.ll, comparing the two PRs' latency & inverse throughput on my alderlake intel cpu using llvm-exegesis

I will refer to this PR as the "magic test", and the other PR as the "chunk test"

Permalinks to the tested functions, for your convenience:

magic test throughput asm file: https://gist.github.com/sharkautarch/99adfce423c3952bad11b8db0a8ba150#file-128bit_urem_magic_test-s

magic test latency asm file: https://gist.github.com/sharkautarch/ad17b4045ad0cba89ae80126fb0f5f3c#file-128bit_urem_magic_test_latency-s

chunk test throughput asm file: https://gist.github.com/sharkautarch/6385ddf31e241156e6e49d901141b694#file-128bit_urem_chunk_test-s

chunk test latency asm file: https://gist.github.com/sharkautarch/783a8ffb970c7db4d2ed6d38c629e113#file-128bit_urem_chunk_test_latency-s

Disclaimer: I didn't disable hyperthreading before running the benchmarks, and also the computer I'm running these tests on has P & E cores, tho I did pin the tests to one P core, and pinning the cpu seems to significantly reduce the noise from hyperthreading & the P & E core split.


magic throughput test: llvm-exegesis --mode=inverse_throughput --snippets-file=128bit_urem_magic_test.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1
full output (for one run): https://gist.github.com/sharkautarch/99a3aa70b6b04ba8dbd7b95c8b0b4a6f
per_snippet_value is about 9.0-9.2 cycles

chunk throughput test: llvm-exegesis --mode=inverse_throughput --snippets-file=128bit_urem_chunk_test.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1
full output (for one run): https://gist.github.com/sharkautarch/0931a87e622bd1dc811b0b75bd1dab13
per_snippet_value is about 6.52-6.58 cycles


magic latency test: llvm-exegesis --mode=latency --snippets-file=128bit_urem_magic_test_latency.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1
full output (for one run): https://gist.github.com/sharkautarch/76a33c941d095854065aa821a7fa9090
per_snippet_value is about 26-27.7 cycles

chunk latency test: llvm-exegesis --mode=latency --snippets-file=128bit_urem_chunk_test_latency.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1
full output (for one run): https://gist.github.com/sharkautarch/4126ca66c3bc1414fbeec3eff1b28548
per_snippet_value is about 19.60-19.71 cycles


This is only comparing codegen from one test function, but I don't have time to benchmark other functions

@mskamp
Copy link
Contributor Author

mskamp commented Sep 7, 2025

@sharkautarch Thank you very much for your time and work. These are some valuable and helpful numbers.

I've started with a comparison of the approaches based on the output of llvm-mca. The rationale is that this way I can get a rough picture of the relative performance of both approaches for a wider array of hardware than is currently available to me. I haven't finished this and the results are only preliminary, but it still appears that the chunk-based approach is better for urem whereas the magic algorithm yields better code for udiv in most cases (i.e., when magics.IsAdd = false).

Based on your work, I've run some benchmarks for all cases listed in issue #137514 on my Icelake Intel CPU. As input for the experiments, I've used the following files:

I've turned all the usual performance features of the CPU off (e.g., turbo boost, hyperthreading) and performed 1000 runs for each algorithm and constant. The results are summarized here:

The results indicate that using the chunk-based approach for urem and the magic algorithm for udiv and the cases that are not handled by the chunk-based approach (3 constants out of 11 in the test) is probably not a bad heuristic. There are some cases, though, where the throughput of the magic algorithm is a bit higher than the throughput of the chunk-based approach even for urem (e.g., the constant 19). As a start, we might, however, very well use this heuristic to decide which approach to use.

I'll plan to perform similar tests on AArch64 next since the initial investigation with llvm-mca suggests that the results of both approaches might be closer to each other and less clear-cut there.

@mskamp mskamp force-pushed the feature_enhance_div_rem_double_bit_size branch from 5bdd489 to 8f3b0a3 Compare September 13, 2025 17:38
Comment on lines 8141 to 8144
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move TLI to first argument (or better yet, just make this a TargetLowering member)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this a member of TargetLowering.

For integer types twice as large as a legal type, we have previously
generated a library call if another splitting technique was not
applicable.

With this change, we use an adaption of the Magic algorithm. This
algorithm is also used for UDIV/UREM by constants on legal types. The
implementation introduced here is a simple port of the already existing
implementation to types twice the size of a legal type. The core idea of
this algorithm is to replace (udiv x c) for a constant c with the bits
higher or equal to the s-th bit of the multiplication of x by (2^s + o)/c
for some s and o. More details are available in Henry S. Warren, Jr.:
"Hacker's Delight", chapter 10.

An efficient handling of UDIV/UREM by constants on types twice as large
as a legal type is mostly relevant for 32-bit platforms. But some
projects may also benefit on 64-bit platforms. For example, the `fmt`
library for C++ uses 128-bit unsigned divisions by 100 and 10000, which
have not been covered by the previously existing optimizations.

Closes llvm#137514.
@mskamp mskamp force-pushed the feature_enhance_div_rem_double_bit_size branch from 8f3b0a3 to fc79e27 Compare September 16, 2025 13:44
@mskamp
Copy link
Contributor Author

mskamp commented Sep 26, 2025

I've finally done some benchmarks for AArch64 on a Cortex A76. Here are my results:

Latency
Throughput

The numbers for the throughput look a bit strange to me. I'm not sure I've done everything correctly, though, so take it with a grain of salt.

With respect to the latency, the chunk-based algorithm yields better results in cases where the magic algorithm has been superior in the Icelake benchmark before. I think this is because the Cortex A76 does not support an operation that gives both the upper and lower word of an unsigned multiplication (such as MUL(X) on Icelake).

If we wanted to optimize the latency, we could use the following check list to determine the best algorithm based on the available measurements:

  • If the chunk-based algorithm does not support the given divisor: Use the magic algorithm.
  • If the operation to compute is UREM: Use the chunk-based algorithm.
  • If IsAdd = false in the result of UnsignedDivisionByConstantInfo::get() for the given divisor: Use the magic algorithm.
  • If the target does not have a legal UMUL_LOHI instruction for the required bit width: Use the chunk-based algorithm.
  • Otherwise: Use the magic algorithm.

Of course, this only fits (and likely overfits) the given measurements. The check previously suggested (essentially the first two and the last line in the check list) would probably still yield good results in most cases (in the benchmark, it would only yield suboptimal results for 4 data points). Hence, this is probably still a good start if we wanted to avoid overfitting.

@sharkautarch
Copy link

@mskamp wait, are the throughput measurements showing reciprocal throughput (cycles / instruction), or just "normal" throughput (instructions/cycle)?

@mskamp
Copy link
Contributor Author

mskamp commented Sep 29, 2025

@mskamp wait, are the throughput measurements showing reciprocal throughput (cycles / instruction), or just "normal" throughput (instructions/cycle)?

@sharkautarch The throughput measurements just show the output of the llvm-exegesis mode inverse_throughput. As far as I understand, that should be reciprocal throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missed optimization: Suboptimal div / rem for 128-bits int by constant lowering

5 participants