[AArch64][CostModel] Alter sdiv/srem cost where the divisor is constant #123552

sushgokh · 2025-01-20T07:06:48Z

This patch draws its inspiration from the udiv/urem patch #122236

For sdiv, typical sequence of instructions as per the type and divisor property is as follows:

Scalar power-of-2: cmp + csel + asr
Vector power-of-2(Neon): usra + sshr

Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add 
Vector non-power-2(Neon):
  a) <2 x i64>: 2 * (smulh + asr + add)    --> This yeilds scalarized form.
  b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra

SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions.

For srem, typical sequence of instructions as per the type and divisor property is as follows:

Scalar version: <set of sdiv instructions> + msub
Vector version: <set of sdiv instructions> + 2-msub/1-mls

llvmbot · 2025-01-20T07:07:24Z

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-aarch64

Author: Sushant Gokhale (sushgokh)

Changes

This patch draws its inspiration from the udiv/urem patch #122236

For sdiv, typical sequence of instructions as per the type and divisor property is as follows:

Scalar power-of-2: cmp + csel + asr
Vector power-of-2(Neon): usra + sshr

Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add 
Vector non-power-2(Neon):
  a) &lt;2 x i64&gt;: 2 * (smulh + asr + add)    --&gt; This yeilds scalarized form.
  b) &lt;4 x i32&gt;: smull2 + smull + uzp2 + add + sshr + usra

SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions.

For srem, typical sequence of instructions as per the type and divisor property is as follows:

Scalar version: &lt;set of sdiv instructions&gt; + msub
Vector version: &lt;set of sdiv instructions&gt; + 2-msub/1-mls

Patch is 171.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123552.diff

9 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+46-16)
(modified) llvm/test/Analysis/CostModel/AArch64/div.ll (+138-138)
(modified) llvm/test/Analysis/CostModel/AArch64/div_cte.ll (+3-3)
(modified) llvm/test/Analysis/CostModel/AArch64/rem.ll (+157-157)
(modified) llvm/test/Analysis/CostModel/AArch64/sve-div.ll (+48-48)
(modified) llvm/test/Analysis/CostModel/AArch64/sve-rem.ll (+138-138)
(modified) llvm/test/Analysis/CostModel/AArch64/sve-remainder.ll (+12-12)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/blend-costs.ll (+225-5)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/div.ll (+16-70)

diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 932a6f9ce23fd2..9815cadb756b52 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3491,23 +3491,53 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
   default:
     return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,
                                          Op2Info);
+  case ISD::SREM:
   case ISD::SDIV:
-    if (Op2Info.isConstant() && Op2Info.isUniform() && Op2Info.isPowerOf2()) {
-      // On AArch64, scalar signed division by constants power-of-two are
-      // normally expanded to the sequence ADD + CMP + SELECT + SRA.
-      // The OperandValue properties many not be same as that of previous
-      // operation; conservatively assume OP_None.
-      InstructionCost Cost = getArithmeticInstrCost(
-          Instruction::Add, Ty, CostKind,
-          Op1Info.getNoProps(), Op2Info.getNoProps());
-      Cost += getArithmeticInstrCost(Instruction::Sub, Ty, CostKind,
-                                     Op1Info.getNoProps(), Op2Info.getNoProps());
-      Cost += getArithmeticInstrCost(
-          Instruction::Select, Ty, CostKind,
-          Op1Info.getNoProps(), Op2Info.getNoProps());
-      Cost += getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
-                                     Op1Info.getNoProps(), Op2Info.getNoProps());
-      return Cost;
+    /*
+    For sdiv, typical sequence of instructions as per the type and divisor
+    property is as follows:
+    Scalar power-of-2: cmp + csel + asr
+    Vector power-of-2: usra + sshr
+
+    Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add
+    Vector non-power-2:
+      a) <2 x i64>: 2 * (smulh + asr + add)   --> This yeilds scalarized form.
+      b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra
+
+    SVE versions should have more or less the same cost because sometimes they
+    yeild native sdiv instructions, which should have less cost or the same
+    sequence of neon instructions.
+
+    For srem, typical sequence of instructions as per the type and divisor
+    property is as follows:
+    Scalar version: <set of sdiv instructions> + msub
+    Vector version: <set of sdiv instructions> + 2-msub/mls
+    */
+    if (Op2Info.isConstant()) {
+      InstructionCost AsrCost =
+          getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
+                                 Op1Info.getNoProps(), Op2Info.getNoProps());
+      InstructionCost AddCost =
+          getArithmeticInstrCost(Instruction::Add, Ty, CostKind,
+                                 Op1Info.getNoProps(), Op2Info.getNoProps());
+      InstructionCost MulCost =
+          getArithmeticInstrCost(Instruction::Mul, Ty, CostKind,
+                                 Op1Info.getNoProps(), Op2Info.getNoProps());
+
+      bool HasSMUL = !Op2Info.isPowerOf2();
+      unsigned NumOfSMUL = HasSMUL ? (LT.second.isVector() ? 2 : 1) : 0;
+      bool HasExtraAsr =
+          (LT.second.isVector() || LT.second == MVT::i32) && HasSMUL;
+
+      InstructionCost CommonCost = AsrCost + AddCost;
+      // We typicall get 1 msub for scalar and 2-msub/1-mls for the vector form.
+      // Typically, the cost of msub is same and mls is twice as costly as
+      // add/sub/mul.
+      InstructionCost MlsOrMSubCost = (LT.second.isVector() ? 2 : 1) * MulCost;
+      InstructionCost DivCost =
+          CommonCost + (MulCost * NumOfSMUL) /* SMULH/SMULH */ +
+          (AsrCost * HasExtraAsr); // Coming with second SMULH
+      return DivCost + (ISD == ISD::SREM ? MlsOrMSubCost : 0);
     }
     [[fallthrough]];
   case ISD::UDIV: {
diff --git a/llvm/test/Analysis/CostModel/AArch64/div.ll b/llvm/test/Analysis/CostModel/AArch64/div.ll
index ef52d0db01eefd..0881aa39810408 100644
--- a/llvm/test/Analysis/CostModel/AArch64/div.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/div.ll
@@ -121,29 +121,29 @@ define i32 @udiv() {
 
 define i32 @sdiv_const() {
 ; CHECK-LABEL: 'sdiv_const'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I128 = sdiv i128 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %I64 = sdiv i64 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = sdiv <2 x i64> undef, <i64 6, i64 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V4i64 = sdiv <4 x i64> undef, <i64 4, i64 5, i64 6, i64 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 112 for instruction: %V8i64 = sdiv <8 x i64> undef, <i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I32 = sdiv i32 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i32 = sdiv <2 x i32> undef, <i32 4, i32 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V4i32 = sdiv <4 x i32> undef, <i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 96 for instruction: %V8i32 = sdiv <8 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 192 for instruction: %V16i32 = sdiv <16 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = sdiv i16 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i16 = sdiv <2 x i16> undef, <i16 4, i16 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V4i16 = sdiv <4 x i16> undef, <i16 4, i16 5, i16 6, i16 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 88 for instruction: %V8i16 = sdiv <8 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 176 for instruction: %V16i16 = sdiv <16 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 352 for instruction: %V32i16 = sdiv <32 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19, i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = sdiv i8 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i8 = sdiv <2 x i8> undef, <i8 4, i8 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V4i8 = sdiv <4 x i8> undef, <i8 4, i8 5, i8 6, i8 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 88 for instruction: %V8i8 = sdiv <8 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 168 for instruction: %V16i8 = sdiv <16 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 336 for instruction: %V32i8 = sdiv <32 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 672 for instruction: %V64i8 = sdiv <64 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %I128 = sdiv i128 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %I64 = sdiv i64 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V2i64 = sdiv <2 x i64> undef, <i64 6, i64 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 62 for instruction: %V4i64 = sdiv <4 x i64> undef, <i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 124 for instruction: %V8i64 = sdiv <8 x i64> undef, <i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %I32 = sdiv i32 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i32 = sdiv <2 x i32> undef, <i32 4, i32 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i32 = sdiv <4 x i32> undef, <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V8i32 = sdiv <8 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V16i32 = sdiv <16 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %I16 = sdiv i16 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i16 = sdiv <2 x i16> undef, <i16 4, i16 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i16 = sdiv <4 x i16> undef, <i16 4, i16 5, i16 6, i16 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8i16 = sdiv <8 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V16i16 = sdiv <16 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V32i16 = sdiv <32 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19, i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %I8 = sdiv i8 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i8 = sdiv <2 x i8> undef, <i8 4, i8 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i8 = sdiv <4 x i8> undef, <i8 4, i8 5, i8 6, i8 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8i8 = sdiv <8 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V16i8 = sdiv <16 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32i8 = sdiv <32 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V64i8 = sdiv <64 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
 ;
   %I128 = sdiv i128 undef, 7
@@ -238,29 +238,29 @@ define i32 @udiv_const() {
 
 define i32 @sdiv_uniformconst() {
 ; CHECK-LABEL: 'sdiv_uniformconst'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I128 = sdiv i128 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %I64 = sdiv i64 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2i64 = sdiv <2 x i64> undef, splat (i64 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V4i64 = sdiv <4 x i64> undef, splat (i64 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8i64 = sdiv <8 x i64> undef, splat (i64 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I32 = sdiv i32 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2i32 = sdiv <2 x i32> undef, splat (i32 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4i32 = sdiv <4 x i32> undef, splat (i32 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8i32 = sdiv <8 x i32> undef, splat (i32 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %V16i32 = sdiv <16 x i32> undef, splat (i32 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = sdiv i16 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2i16 = sdiv <2 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V4i16 = sdiv <4 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8i16 = sdiv <8 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %V16i16 = sdiv <16 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %V32i16 = sdiv <32 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = sdiv i8 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2i8 = sdiv <2 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V4i8 = sdiv <4 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8i8 = sdiv <8 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V16i8 = sdiv <16 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %V32i8 = sdiv <32 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 320 for instruction: %V64i8 = sdiv <64 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %I128 = sdiv i128 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %I64 = sdiv i64 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V2i64 = sdiv <2 x i64> undef, splat (i64 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 62 for instruction: %V4i64 = sdiv <4 x i64> undef, splat (i64 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 124 for instruction: %V8i64 = sdiv <8 x i64> undef, splat (i64 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %I32 = sdiv i32 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i32 = sdiv <2 x i32> undef, splat (i32 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i32 = sdiv <4 x i32> undef, splat (i32 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V8i32 = sdiv <8 x i32> undef, splat (i32 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V16i32 = sdiv <16 x i32> undef, splat (i32 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %I16 = sdiv i16 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i16 = sdiv <2 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i16 = sdiv <4 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8i16 = sdiv <8 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V16i16 = sdiv <16 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V32i16 = sdiv <32 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %I8 = sdiv i8 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i8 = sdiv <2 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i8 = sdiv <4 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8i8 = sdiv <8 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V16i8 = sdiv <16 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32i8 = sdiv <32 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V64i8 = sdiv <64 x i8> undef, splat (i8 7)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
 ;
   %I128 = sdiv i128 undef, 7
@@ -354,29 +354,29 @@ define i32 @udiv_uniformconst() {
 
 define i32 @sdiv_constpow2() {
 ; CHECK-LABEL: 'sdiv_constpow2'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I128 = sdiv i128 undef, 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I64 = sdiv i64 undef, 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = sdiv <2 x i64> undef, <i64 8, i64 16>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V4i64 = sdiv <4 x i64> undef, <i64 2, i64 4, i64 8, i64 16>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 112 for instruction: %V8i64 = sdiv <8 x i64> undef, <i64 2, i64 4, i64 8, i64 16, i64 32, i64 64, i64 128, i64 256>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I32 = sdiv i32 undef, 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i32 = sdiv <2 x i32> undef, <i32 2, i32 4>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V4i32 = sdiv <4 x i32> undef, <i32 2, i32 4, i32 8, i32 16>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 96 for instruction: %V8i32 = sdiv <8 x i32> undef, <i32 2, i32 4, i32 8, i32 16, i32 32, i32 64, i32 128, i32 256>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 192 for instruction: %V16i32 = sdiv <16 x i32> undef, <i32 2, i32 4, i32 8, i32 16, i32 32, i32 64, i32 128, i32 256, i32 2, i32 4, i32 8, i32 16, i32 32, i32 64, i32 128, i32 256>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I16 = sdiv i16 undef, 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i16 = sdiv <2 x i16> undef, <i16 2, i16 4>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V4i16 = sdiv <4 x i16> undef, <i16 2, i16 4, i16 8, i16 16>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 88 for instruction: %V8i16 = sdiv <8 x i16> undef, <i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 176 for instruction: %V16i16 = sdiv <16 x i16> undef, <i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 352 for instruction: %V32i16 = sdiv <32 x i16> undef, <i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256>
-; ...
[truncated]

davemgreen

Thanks for doing this. The general idea looks good to me, but I haven't had a chance to look into all the numbers yet (or get back to #122236 or #122469 yet). We will need to make sure that scalable sizes work too, AFAIU they will not currently detect constants correctly.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

davemgreen · 2025-01-20T22:31:27Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    /*
+    For sdiv, typical sequence of instructions as per the type and divisor
+    property is as follows:
+    Scalar power-of-2: cmp + csel + asr


asr -> ashr or sra would be more common.

I was trying to mention the assembly instruction and hence, asr.

davemgreen · 2025-01-20T22:33:38Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+
+      InstructionCost CommonCost = AsrCost + AddCost;
+      // We typicall get 1 msub for scalar and 2-msub/1-mls for the vector form.
+      // Typically, the cost of msub is same and mls is twice as costly as


Should I read this as (the cost of msub is same), in which case what is same? And (mls is twice as costly as add/sub/mul), in which case the mls/mla is often the same cost as a mul.

ok, will change this wording

llvm/test/Analysis/CostModel/AArch64/rem.ll

davemgreen · 2025-01-24T08:13:19Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+          getArithmeticInstrCost(Instruction::Mul, Ty, CostKind,
+                                 Op1Info.getNoProps(), Op2Info.getNoProps());
+
+      bool HasSMUL = !Op2Info.isPowerOf2();


I remember from few weeks ago that negative power-2 should be roughly the same as power-2 for sdiv/srem?

yes, I tried multiple scenarios where using negative power-2 affects the codegen but it does not dramatically alter the codegen.

davemgreen · 2025-02-21T11:05:24Z

@sushgokh Now that 122469 is in, this should be able to test sve costs correctly if you can give it a rebase. We can hopefully move this forward.

sushgokh · 2025-02-21T11:25:11Z

@sushgokh Now that 122469 is in, this should be able to test sve costs correctly if you can give it a rebase. We can hopefully move this forward.

Thanks @davemgreen for the 122469 patch.

For your udiv/urem patch, I could see change in codegen when the constants are negative. I anticipate similar change here as well. I will soon get back the data and concerned changes as well.

davemgreen · 2025-02-27T18:42:20Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+          // instruction when SVE is available.
+          // e.g. %1 = sdiv <vscale x 4 x i32> %a, splat (i32 8)
+          if (Ty->isScalableTy() && ST->hasSVE())
+            return 2 * AsrCost;


What is the reason to use 2x for asrd?

asrd is almost twice costly as compared to asr(by costly, I mean I am referring to latency here).

The default CostKind is TCK_RecipThroughput, not TCK_Latency. (This function currently only handles TCK_RecipThroughput, I was hoping to add at least codesize soon and it would be good to cover others). Unless we have a strong reason to discourage SVE generation here (which I don't think we do?), we should favour the throughput costs.

I think just having 1x cost wouldnt be prudent because we are not just comparing the cost against similar instructions but other instructions as well i.e. there is no grouping of instructions where instruction cost is compared against instruction cost from the same group.
But if you think having 1x helps, will do that.

The A725 has a throughput of 1 for these, as opposed to 2 for most vector operations. So there is precedence for it.

I'm not sure I understood what you meant though. What do you mean by the groups?

The A725 has a throughput of 1 for these, as opposed to 2 for most vector operations. So there is precedence for it.

I assume fdiv as one of the examples here. This is OK here. But for cpu like Neoverse-v2 where the throughput>=1 for most of the instructions, recip_tput becomes approx equal to 1 for all. There is no way to differentiate how costly the instruction is wrt some other instruction.

Ideally, we would always like to know the no. of cycles consumed and this is the thing that we refer to when using tools like llvm-mca. We never go on calculating recip_tput. Also, in articles like this, the unit of recip_tput is cycles/instr which is nothing but latency(under certain conditions though).

Having cost=1 (with recip_tput as the cost metric)for most of the instructions is problematic I think for the same reason e.g. a load from constant pool would be costed same as a normal mul/add etc.

What do you mean by the groups?

I mean some sort of equivalence groups.
e.g. group of MemoryOps consisting of load/store where the instruction in this group is compared only within this group and then assigned a cost relative to others in this group. If there is comparison between two diff groups, groups can be coalesced to have a revised costing.
Now, this is my thinking but there maybe flaws with this.

I'm not sure I am understanding what you are saying. I think this should still be 1 considering the instruction that is produced, but it seems OK either way as the cost of the SVE instruction will still be lower than the scalar / vector version. Lets go with this for the moment and we can adjust it in the future if we need to.

If you mean that you can't take two disparate recip-throughput costs, add them together and expect to come up with a sensible "reciprical-throughput", then yes I agree that doesn't always work very well. It would be better to have a cost-model that understood that some throughput costs are separate (loads/stores vs vector ops vs integer ops vs m-ops, etc) and was able to measure throughput bottlenecks better.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

davemgreen · 2025-02-27T18:51:07Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+          // When SVE is available, we get:
+          // smulh + lsr + add/sub + asr + add/sub.
+          if (Ty->isScalableTy() && ST->hasSVE())
+            return 2 * MulCost /*smulh cost*/ + 2 * AddCost + 2 * AsrCost;


Would 1 * MulCost be better for SVE?

again, smulh is twice as costly as normal mul(in terms of latency). So, 2x should be the case I think.

I would expect SVE to be cheaper as it can use smulh. They are currently both the same cost, which is likely OK for now.

ok sure, will change this

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/div.ll

llvm/test/Transforms/LoopVectorize/AArch64/blend-costs.ll

davemgreen

Thanks for doing this. It looks like a good improvement. LGTM.

davemgreen · 2025-03-08T20:50:25Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+          // instruction when SVE is available.
+          // e.g. %1 = sdiv <vscale x 4 x i32> %a, splat (i32 8)
+          if (Ty->isScalableTy() && ST->hasSVE())
+            return 2 * AsrCost;


I'm not sure I am understanding what you are saying. I think this should still be 1 considering the instruction that is produced, but it seems OK either way as the cost of the SVE instruction will still be lower than the scalar / vector version. Lets go with this for the moment and we can adjust it in the future if we need to.

If you mean that you can't take two disparate recip-throughput costs, add them together and expect to come up with a sensible "reciprical-throughput", then yes I agree that doesn't always work very well. It would be better to have a cost-model that understood that some throughput costs are separate (loads/stores vs vector ops vs integer ops vs m-ops, etc) and was able to measure throughput bottlenecks better.

davemgreen · 2025-03-08T20:53:13Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Avoid TLI->getValueType() by using Ty's number of elements?

yup, will remove this. Its duplicate. Thanks

davemgreen · 2025-03-08T20:57:03Z

llvm/test/Analysis/CostModel/AArch64/div.ll

These seem a little low - they should be higher than or equal to uniformconst, which is 8 in this case.

I will update this. I need to add extractelement/insertelement cost as you previously pointed out.

davemgreen · 2025-03-08T22:04:35Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+          // When SVE is available, we get:
+          // smulh + lsr + add/sub + asr + add/sub.
+          if (Ty->isScalableTy() && ST->hasSVE())
+            return 2 * MulCost /*smulh cost*/ + 2 * AddCost + 2 * AsrCost;


I would expect SVE to be cheaper as it can use smulh. They are currently both the same cost, which is likely OK for now.

This patch draws its inspiration from the udiv/urem patch llvm#122236 For sdiv, typical sequence of instructions as per the type and divisor property is as follows: Scalar power-of-2: cmp + csel + asr Neon power-of-2: usra + sshr Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add Vector non-power-2: a) <2 x i64>: 2 * (smulh + asr + add) . This yeilds scalarized form. b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions. For srem, typical sequence of instructions as per the type and divisor property is as follows: Scalar version: <set of sdiv instructions> + msub Vector version: <set of sdiv instructions> + 2-msub/1-mls

llvm-ci · 2025-03-10T05:32:38Z

LLVM Buildbot has detected a new failure on builder clang-cmake-x86_64-avx512-win running on avx512-intel64-win while building llvm at step 4 "cmake stage 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/81/builds/5060

Here is the relevant piece of the build log for the reference

Step 4 (cmake stage 1) failure: 'cmake -G ...' (failure)
'cmake' is not recognized as an internal or external command,
operable program or batch file.

sushgokh requested review from davemgreen, david-arm, madhur13490, sdesmalen-arm and sjoerdmeijer January 20, 2025 07:06

llvmbot added backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Jan 20, 2025

davemgreen reviewed Jan 24, 2025

View reviewed changes

sushgokh force-pushed the sdiv-srem-cost branch from 4e2c6bd to 3be10c0 Compare February 27, 2025 14:53

sushgokh requested a review from davemgreen February 27, 2025 15:10

davemgreen reviewed Feb 27, 2025

View reviewed changes

davemgreen approved these changes Mar 8, 2025

View reviewed changes

sushgokh added 4 commits March 9, 2025 22:18

Rebase and improve the previous patch

074f53f

Address comments

f5a11ac

Address comments

0829f8c

sushgokh force-pushed the sdiv-srem-cost branch from ea1bf1e to 0829f8c Compare March 10, 2025 05:18

sushgokh merged commit c480874 into llvm:main Mar 10, 2025
6 of 9 checks passed

shiltian mentioned this pull request Mar 10, 2025

[AMDGPU] Fix test failures when expensive checks are enabled #130644

Merged

[AArch64][CostModel] Alter sdiv/srem cost where the divisor is constant #123552

[AArch64][CostModel] Alter sdiv/srem cost where the divisor is constant #123552

Uh oh!

Conversation

sushgokh commented Jan 20, 2025

Uh oh!

llvmbot commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davemgreen commented Feb 21, 2025

Uh oh!

sushgokh commented Feb 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvm-ci commented Mar 10, 2025

llvmbot commented Jan 20, 2025 •

edited

Loading