-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[AArch64][CostModel] Alter sdiv/srem cost where the divisor is constant #123552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-backend-aarch64 Author: Sushant Gokhale (sushgokh) ChangesThis patch draws its inspiration from the udiv/urem patch #122236 For sdiv, typical sequence of instructions as per the type and divisor property is as follows: SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions. For srem, typical sequence of instructions as per the type and divisor property is as follows: Patch is 171.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123552.diff 9 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 932a6f9ce23fd2..9815cadb756b52 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3491,23 +3491,53 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
default:
return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,
Op2Info);
+ case ISD::SREM:
case ISD::SDIV:
- if (Op2Info.isConstant() && Op2Info.isUniform() && Op2Info.isPowerOf2()) {
- // On AArch64, scalar signed division by constants power-of-two are
- // normally expanded to the sequence ADD + CMP + SELECT + SRA.
- // The OperandValue properties many not be same as that of previous
- // operation; conservatively assume OP_None.
- InstructionCost Cost = getArithmeticInstrCost(
- Instruction::Add, Ty, CostKind,
- Op1Info.getNoProps(), Op2Info.getNoProps());
- Cost += getArithmeticInstrCost(Instruction::Sub, Ty, CostKind,
- Op1Info.getNoProps(), Op2Info.getNoProps());
- Cost += getArithmeticInstrCost(
- Instruction::Select, Ty, CostKind,
- Op1Info.getNoProps(), Op2Info.getNoProps());
- Cost += getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
- Op1Info.getNoProps(), Op2Info.getNoProps());
- return Cost;
+ /*
+ For sdiv, typical sequence of instructions as per the type and divisor
+ property is as follows:
+ Scalar power-of-2: cmp + csel + asr
+ Vector power-of-2: usra + sshr
+
+ Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add
+ Vector non-power-2:
+ a) <2 x i64>: 2 * (smulh + asr + add) --> This yeilds scalarized form.
+ b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra
+
+ SVE versions should have more or less the same cost because sometimes they
+ yeild native sdiv instructions, which should have less cost or the same
+ sequence of neon instructions.
+
+ For srem, typical sequence of instructions as per the type and divisor
+ property is as follows:
+ Scalar version: <set of sdiv instructions> + msub
+ Vector version: <set of sdiv instructions> + 2-msub/mls
+ */
+ if (Op2Info.isConstant()) {
+ InstructionCost AsrCost =
+ getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost AddCost =
+ getArithmeticInstrCost(Instruction::Add, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost MulCost =
+ getArithmeticInstrCost(Instruction::Mul, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+
+ bool HasSMUL = !Op2Info.isPowerOf2();
+ unsigned NumOfSMUL = HasSMUL ? (LT.second.isVector() ? 2 : 1) : 0;
+ bool HasExtraAsr =
+ (LT.second.isVector() || LT.second == MVT::i32) && HasSMUL;
+
+ InstructionCost CommonCost = AsrCost + AddCost;
+ // We typicall get 1 msub for scalar and 2-msub/1-mls for the vector form.
+ // Typically, the cost of msub is same and mls is twice as costly as
+ // add/sub/mul.
+ InstructionCost MlsOrMSubCost = (LT.second.isVector() ? 2 : 1) * MulCost;
+ InstructionCost DivCost =
+ CommonCost + (MulCost * NumOfSMUL) /* SMULH/SMULH */ +
+ (AsrCost * HasExtraAsr); // Coming with second SMULH
+ return DivCost + (ISD == ISD::SREM ? MlsOrMSubCost : 0);
}
[[fallthrough]];
case ISD::UDIV: {
diff --git a/llvm/test/Analysis/CostModel/AArch64/div.ll b/llvm/test/Analysis/CostModel/AArch64/div.ll
index ef52d0db01eefd..0881aa39810408 100644
--- a/llvm/test/Analysis/CostModel/AArch64/div.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/div.ll
@@ -121,29 +121,29 @@ define i32 @udiv() {
define i32 @sdiv_const() {
; CHECK-LABEL: 'sdiv_const'
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %I128 = sdiv i128 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %I64 = sdiv i64 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = sdiv <2 x i64> undef, <i64 6, i64 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 56 for instruction: %V4i64 = sdiv <4 x i64> undef, <i64 4, i64 5, i64 6, i64 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 112 for instruction: %V8i64 = sdiv <8 x i64> undef, <i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I32 = sdiv i32 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i32 = sdiv <2 x i32> undef, <i32 4, i32 5>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V4i32 = sdiv <4 x i32> undef, <i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 96 for instruction: %V8i32 = sdiv <8 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 192 for instruction: %V16i32 = sdiv <16 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I16 = sdiv i16 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i16 = sdiv <2 x i16> undef, <i16 4, i16 5>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V4i16 = sdiv <4 x i16> undef, <i16 4, i16 5, i16 6, i16 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V8i16 = sdiv <8 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 176 for instruction: %V16i16 = sdiv <16 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 352 for instruction: %V32i16 = sdiv <32 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19, i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I8 = sdiv i8 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i8 = sdiv <2 x i8> undef, <i8 4, i8 5>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V4i8 = sdiv <4 x i8> undef, <i8 4, i8 5, i8 6, i8 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V8i8 = sdiv <8 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 168 for instruction: %V16i8 = sdiv <16 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 336 for instruction: %V32i8 = sdiv <32 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 672 for instruction: %V64i8 = sdiv <64 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %I128 = sdiv i128 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %I64 = sdiv i64 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %V2i64 = sdiv <2 x i64> undef, <i64 6, i64 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 62 for instruction: %V4i64 = sdiv <4 x i64> undef, <i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 124 for instruction: %V8i64 = sdiv <8 x i64> undef, <i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %I32 = sdiv i32 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i32 = sdiv <2 x i32> undef, <i32 4, i32 5>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i32 = sdiv <4 x i32> undef, <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V8i32 = sdiv <8 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V16i32 = sdiv <16 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %I16 = sdiv i16 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i16 = sdiv <2 x i16> undef, <i16 4, i16 5>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i16 = sdiv <4 x i16> undef, <i16 4, i16 5, i16 6, i16 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8i16 = sdiv <8 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V16i16 = sdiv <16 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V32i16 = sdiv <32 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19, i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %I8 = sdiv i8 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i8 = sdiv <2 x i8> undef, <i8 4, i8 5>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i8 = sdiv <4 x i8> undef, <i8 4, i8 5, i8 6, i8 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8i8 = sdiv <8 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V16i8 = sdiv <16 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V32i8 = sdiv <32 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V64i8 = sdiv <64 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
;
%I128 = sdiv i128 undef, 7
@@ -238,29 +238,29 @@ define i32 @udiv_const() {
define i32 @sdiv_uniformconst() {
; CHECK-LABEL: 'sdiv_uniformconst'
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %I128 = sdiv i128 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %I64 = sdiv i64 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2i64 = sdiv <2 x i64> undef, splat (i64 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4i64 = sdiv <4 x i64> undef, splat (i64 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8i64 = sdiv <8 x i64> undef, splat (i64 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I32 = sdiv i32 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2i32 = sdiv <2 x i32> undef, splat (i32 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V4i32 = sdiv <4 x i32> undef, splat (i32 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8i32 = sdiv <8 x i32> undef, splat (i32 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 80 for instruction: %V16i32 = sdiv <16 x i32> undef, splat (i32 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I16 = sdiv i16 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2i16 = sdiv <2 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4i16 = sdiv <4 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V8i16 = sdiv <8 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 80 for instruction: %V16i16 = sdiv <16 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 160 for instruction: %V32i16 = sdiv <32 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I8 = sdiv i8 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2i8 = sdiv <2 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4i8 = sdiv <4 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8i8 = sdiv <8 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V16i8 = sdiv <16 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 160 for instruction: %V32i8 = sdiv <32 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 320 for instruction: %V64i8 = sdiv <64 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %I128 = sdiv i128 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %I64 = sdiv i64 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %V2i64 = sdiv <2 x i64> undef, splat (i64 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 62 for instruction: %V4i64 = sdiv <4 x i64> undef, splat (i64 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 124 for instruction: %V8i64 = sdiv <8 x i64> undef, splat (i64 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %I32 = sdiv i32 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i32 = sdiv <2 x i32> undef, splat (i32 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i32 = sdiv <4 x i32> undef, splat (i32 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V8i32 = sdiv <8 x i32> undef, splat (i32 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V16i32 = sdiv <16 x i32> undef, splat (i32 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %I16 = sdiv i16 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i16 = sdiv <2 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i16 = sdiv <4 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8i16 = sdiv <8 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V16i16 = sdiv <16 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V32i16 = sdiv <32 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %I8 = sdiv i8 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i8 = sdiv <2 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i8 = sdiv <4 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8i8 = sdiv <8 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V16i8 = sdiv <16 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V32i8 = sdiv <32 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V64i8 = sdiv <64 x i8> undef, splat (i8 7)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
;
%I128 = sdiv i128 undef, 7
@@ -354,29 +354,29 @@ define i32 @udiv_uniformconst() {
define i32 @sdiv_constpow2() {
; CHECK-LABEL: 'sdiv_constpow2'
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %I128 = sdiv i128 undef, 16
-; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I64 = sdiv i64 undef, 16
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = sdiv <2 x i64> undef, <i64 8, i64 16>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 56 for instruction: %V4i64 = sdiv <4 x i64> undef, <i64 2, i64 4, i64 8, i64 16>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 112 for instruction: %V8i64 = sdiv <8 x i64> undef, <i64 2, i64 4, i64 8, i64 16, i64 32, i64 64, i64 128, i64 256>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I32 = sdiv i32 undef, 16
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i32 = sdiv <2 x i32> undef, <i32 2, i32 4>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V4i32 = sdiv <4 x i32> undef, <i32 2, i32 4, i32 8, i32 16>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 96 for instruction: %V8i32 = sdiv <8 x i32> undef, <i32 2, i32 4, i32 8, i32 16, i32 32, i32 64, i32 128, i32 256>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 192 for instruction: %V16i32 = sdiv <16 x i32> undef, <i32 2, i32 4, i32 8, i32 16, i32 32, i32 64, i32 128, i32 256, i32 2, i32 4, i32 8, i32 16, i32 32, i32 64, i32 128, i32 256>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I16 = sdiv i16 undef, 16
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i16 = sdiv <2 x i16> undef, <i16 2, i16 4>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V4i16 = sdiv <4 x i16> undef, <i16 2, i16 4, i16 8, i16 16>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V8i16 = sdiv <8 x i16> undef, <i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 176 for instruction: %V16i16 = sdiv <16 x i16> undef, <i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 352 for instruction: %V32i16 = sdiv <32 x i16> undef, <i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256>
-; ...
[truncated]
|
davemgreen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /* | ||
| For sdiv, typical sequence of instructions as per the type and divisor | ||
| property is as follows: | ||
| Scalar power-of-2: cmp + csel + asr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asr -> ashr or sra would be more common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to mention the assembly instruction and hence, asr.
|
|
||
| InstructionCost CommonCost = AsrCost + AddCost; | ||
| // We typicall get 1 msub for scalar and 2-msub/1-mls for the vector form. | ||
| // Typically, the cost of msub is same and mls is twice as costly as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I read this as (the cost of msub is same), in which case what is same? And (mls is twice as costly as add/sub/mul), in which case the mls/mla is often the same cost as a mul.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will change this wording
| getArithmeticInstrCost(Instruction::Mul, Ty, CostKind, | ||
| Op1Info.getNoProps(), Op2Info.getNoProps()); | ||
|
|
||
| bool HasSMUL = !Op2Info.isPowerOf2(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember from few weeks ago that negative power-2 should be roughly the same as power-2 for sdiv/srem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I tried multiple scenarios where using negative power-2 affects the codegen but it does not dramatically alter the codegen.
|
@sushgokh Now that 122469 is in, this should be able to test sve costs correctly if you can give it a rebase. We can hopefully move this forward. |
Thanks @davemgreen for the 122469 patch. For your udiv/urem patch, I could see change in codegen when the constants are negative. I anticipate similar change here as well. I will soon get back the data and concerned changes as well. |
4e2c6bd to
3be10c0
Compare
| // instruction when SVE is available. | ||
| // e.g. %1 = sdiv <vscale x 4 x i32> %a, splat (i32 8) | ||
| if (Ty->isScalableTy() && ST->hasSVE()) | ||
| return 2 * AsrCost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to use 2x for asrd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asrd is almost twice costly as compared to asr(by costly, I mean I am referring to latency here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default CostKind is TCK_RecipThroughput, not TCK_Latency. (This function currently only handles TCK_RecipThroughput, I was hoping to add at least codesize soon and it would be good to cover others). Unless we have a strong reason to discourage SVE generation here (which I don't think we do?), we should favour the throughput costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just having 1x cost wouldnt be prudent because we are not just comparing the cost against similar instructions but other instructions as well i.e. there is no grouping of instructions where instruction cost is compared against instruction cost from the same group.
But if you think having 1x helps, will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The A725 has a throughput of 1 for these, as opposed to 2 for most vector operations. So there is precedence for it.
I'm not sure I understood what you meant though. What do you mean by the groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The A725 has a throughput of 1 for these, as opposed to 2 for most vector operations. So there is precedence for it.
I assume fdiv as one of the examples here. This is OK here. But for cpu like Neoverse-v2 where the throughput>=1 for most of the instructions, recip_tput becomes approx equal to 1 for all. There is no way to differentiate how costly the instruction is wrt some other instruction.
Ideally, we would always like to know the no. of cycles consumed and this is the thing that we refer to when using tools like llvm-mca. We never go on calculating recip_tput. Also, in articles like this, the unit of recip_tput is cycles/instr which is nothing but latency(under certain conditions though).
Having cost=1 (with recip_tput as the cost metric)for most of the instructions is problematic I think for the same reason e.g. a load from constant pool would be costed same as a normal mul/add etc.
What do you mean by the groups?
I mean some sort of equivalence groups.
e.g. group of MemoryOps consisting of load/store where the instruction in this group is compared only within this group and then assigned a cost relative to others in this group. If there is comparison between two diff groups, groups can be coalesced to have a revised costing.
Now, this is my thinking but there maybe flaws with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I am understanding what you are saying. I think this should still be 1 considering the instruction that is produced, but it seems OK either way as the cost of the SVE instruction will still be lower than the scalar / vector version. Lets go with this for the moment and we can adjust it in the future if we need to.
If you mean that you can't take two disparate recip-throughput costs, add them together and expect to come up with a sensible "reciprical-throughput", then yes I agree that doesn't always work very well. It would be better to have a cost-model that understood that some throughput costs are separate (loads/stores vs vector ops vs integer ops vs m-ops, etc) and was able to measure throughput bottlenecks better.
| // When SVE is available, we get: | ||
| // smulh + lsr + add/sub + asr + add/sub. | ||
| if (Ty->isScalableTy() && ST->hasSVE()) | ||
| return 2 * MulCost /*smulh cost*/ + 2 * AddCost + 2 * AsrCost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would 1 * MulCost be better for SVE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, smulh is twice as costly as normal mul(in terms of latency). So, 2x should be the case I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect SVE to be cheaper as it can use smulh. They are currently both the same cost, which is likely OK for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure, will change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
davemgreen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. It looks like a good improvement. LGTM.
| // instruction when SVE is available. | ||
| // e.g. %1 = sdiv <vscale x 4 x i32> %a, splat (i32 8) | ||
| if (Ty->isScalableTy() && ST->hasSVE()) | ||
| return 2 * AsrCost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I am understanding what you are saying. I think this should still be 1 considering the instruction that is produced, but it seems OK either way as the cost of the SVE instruction will still be lower than the scalar / vector version. Lets go with this for the moment and we can adjust it in the future if we need to.
If you mean that you can't take two disparate recip-throughput costs, add them together and expect to come up with a sensible "reciprical-throughput", then yes I agree that doesn't always work very well. It would be better to have a cost-model that understood that some throughput costs are separate (loads/stores vs vector ops vs integer ops vs m-ops, etc) and was able to measure throughput bottlenecks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid TLI->getValueType() by using Ty's number of elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, will remove this. Its duplicate. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem a little low - they should be higher than or equal to uniformconst, which is 8 in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update this. I need to add extractelement/insertelement cost as you previously pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| // When SVE is available, we get: | ||
| // smulh + lsr + add/sub + asr + add/sub. | ||
| if (Ty->isScalableTy() && ST->hasSVE()) | ||
| return 2 * MulCost /*smulh cost*/ + 2 * AddCost + 2 * AsrCost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect SVE to be cheaper as it can use smulh. They are currently both the same cost, which is likely OK for now.
This patch draws its inspiration from the udiv/urem patch llvm#122236 For sdiv, typical sequence of instructions as per the type and divisor property is as follows: Scalar power-of-2: cmp + csel + asr Neon power-of-2: usra + sshr Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add Vector non-power-2: a) <2 x i64>: 2 * (smulh + asr + add) . This yeilds scalarized form. b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions. For srem, typical sequence of instructions as per the type and divisor property is as follows: Scalar version: <set of sdiv instructions> + msub Vector version: <set of sdiv instructions> + 2-msub/1-mls
ea1bf1e to
0829f8c
Compare
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/81/builds/5060 Here is the relevant piece of the build log for the reference |
This patch draws its inspiration from the udiv/urem patch #122236
For sdiv, typical sequence of instructions as per the type and divisor property is as follows:
SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions.
For srem, typical sequence of instructions as per the type and divisor property is as follows: