- 
                Notifications
    You must be signed in to change notification settings 
- Fork 15k
[LV] Add on extra cost for scalarising math calls in vector loops #158611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When vectorising loops using -ffast-math with math calls such as
this:
  for (int i = 0; i < n; i++) {
dst[i] += expf(src[i] * src[i]);
  }
if there is no available vector variant of expf then we ask for
the cost of scalarising the vector math intrinsic, e.g.
  llvm.expf.v4f32
For AArch64, this turns out to be extremely expensive in many
cases because the surrounding vector code causes a lot of
spilling and filling of vector registers due to the particular
ABI of the math routine. In addition, the more vector work
performed in the loop the more registers we are likely to
spill, meaning that the cost can scale up with the size of
the loop.
This PR attempts to solve the problem described above by
introducing a new getCallScalarizationOverhead TTI hook that
returns a very large cost for AArch64, which can be controlled
by a new backend flag -call-cost-scalarization-multiplier. This
patch is also required for follow-on work that will reduce the
cost of 128-bit masked loads and stores when SVE is available,
since lowering the costs leads to us making poor vectorisation
choices when loops containing math calls.
    | @llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-llvm-transforms Author: David Sherwood (david-arm) ChangesWhen vectorising loops using -ffast-math with math calls such as this: if there is no available vector variant of expf then we ask for the cost of scalarising the vector math intrinsic, e.g. llvm.expf.v4f32. For AArch64, this turns out to be extremely expensive in many cases because the surrounding vector code causes a lot of spilling and filling of vector registers due to the particular ABI of the math routine. In addition, the more vector work performed in the loop the more registers we are likely to spill, meaning that the cost can scale up with the size of the loop. This PR attempts to solve the problem described above by introducing a new getCallScalarizationOverhead TTI hook that returns a very large cost for AArch64, which can be controlled by a new backend flag -call-cost-scalarization-multiplier. This patch is also required for follow-on work that will reduce the cost of 128-bit masked loads and stores when SVE is available, since lowering the costs leads to us making poor vectorisation choices when loops containing math calls. Patch is 43.11 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158611.diff 13 Files Affected: 
 diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index a6f4e51e258ab..e1adf36940ac6 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -968,6 +968,9 @@ class TargetTransformInfo {
   LLVM_ABI InstructionCost getOperandsScalarizationOverhead(
       ArrayRef<Type *> Tys, TTI::TargetCostKind CostKind) const;
 
+  LLVM_ABI InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                                        ElementCount VF) const;
+
   /// If target has efficient vector element load/store instructions, it can
   /// return true here so that insertion/extraction costs are not added to
   /// the scalarization cost of a load/store.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..f7c5080d49266 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -464,6 +464,11 @@ class TargetTransformInfoImplBase {
     return 0;
   }
 
+  virtual InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                                       ElementCount VF) const {
+    return 0;
+  }
+
   virtual bool supportsEfficientVectorElementLoadStore() const { return false; }
 
   virtual bool supportsTailCalls() const { return true; }
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index dce423fc1b18b..3dd9fa5f97995 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -304,30 +304,14 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
       const IntrinsicCostAttributes &ICA, TTI::TargetCostKind CostKind,
       RTLIB::Libcall LC,
       std::optional<unsigned> CallRetElementIndex = {}) const {
-    Type *RetTy = ICA.getReturnType();
-    // Vector variants of the intrinsic can be mapped to a vector library call.
-    auto const *LibInfo = ICA.getLibInfo();
-    if (!LibInfo || !isa<StructType>(RetTy) ||
-        !isVectorizedStructTy(cast<StructType>(RetTy)))
-      return std::nullopt;
-
-    // Find associated libcall.
-    const char *LCName = getTLI()->getLibcallName(LC);
-    if (!LCName)
-      return std::nullopt;
-
-    // Search for a corresponding vector variant.
-    LLVMContext &Ctx = RetTy->getContext();
-    ElementCount VF = getVectorizedTypeVF(RetTy);
-    VecDesc const *VD = nullptr;
-    for (bool Masked : {false, true}) {
-      if ((VD = LibInfo->getVectorMappingInfo(LCName, VF, Masked)))
-        break;
-    }
+    VecDesc const *VD = getMultipleResultIntrinsicVectorLibCallDesc(ICA, LC);
     if (!VD)
       return std::nullopt;
 
     // Cost the call + mask.
+    Type *RetTy = ICA.getReturnType();
+    ElementCount VF = getVectorizedTypeVF(RetTy);
+    LLVMContext &Ctx = RetTy->getContext();
     auto Cost =
         thisT()->getCallInstrCost(nullptr, RetTy, ICA.getArgTypes(), CostKind);
     if (VD->isMasked()) {
@@ -371,6 +355,30 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
   using TargetTransformInfoImplBase::DL;
 
 public:
+  VecDesc const *getMultipleResultIntrinsicVectorLibCallDesc(
+      const IntrinsicCostAttributes &ICA, RTLIB::Libcall LC) const {
+    Type *RetTy = ICA.getReturnType();
+    // Vector variants of the intrinsic can be mapped to a vector library call.
+    auto const *LibInfo = ICA.getLibInfo();
+    if (!LibInfo || !isa<StructType>(RetTy) ||
+        !isVectorizedStructTy(cast<StructType>(RetTy)))
+      return nullptr;
+
+    // Find associated libcall.
+    const char *LCName = getTLI()->getLibcallName(LC);
+    if (!LCName)
+      return nullptr;
+
+    // Search for a corresponding vector variant.
+    ElementCount VF = getVectorizedTypeVF(RetTy);
+    VecDesc const *VD = nullptr;
+    for (bool Masked : {false, true}) {
+      if ((VD = LibInfo->getVectorMappingInfo(LCName, VF, Masked)))
+        break;
+    }
+    return VD;
+  }
+
   /// \name Scalar TTI Implementations
   /// @{
   bool allowsMisalignedMemoryAccesses(LLVMContext &Context, unsigned BitWidth,
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..045616c8839e8 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -641,6 +641,12 @@ InstructionCost TargetTransformInfo::getOperandsScalarizationOverhead(
   return TTIImpl->getOperandsScalarizationOverhead(Tys, CostKind);
 }
 
+InstructionCost
+TargetTransformInfo::getCallScalarizationOverhead(CallInst *CI,
+                                                  ElementCount VF) const {
+  return TTIImpl->getCallScalarizationOverhead(CI, VF);
+}
+
 bool TargetTransformInfo::supportsEfficientVectorElementLoadStore() const {
   return TTIImpl->supportsEfficientVectorElementLoadStore();
 }
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 92321a76dbd80..befaa1b68d4b7 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -46,8 +46,13 @@ static cl::opt<unsigned> SVEGatherOverhead("sve-gather-overhead", cl::init(10),
 static cl::opt<unsigned> SVEScatterOverhead("sve-scatter-overhead",
                                             cl::init(10), cl::Hidden);
 
-static cl::opt<unsigned> SVETailFoldInsnThreshold("sve-tail-folding-insn-threshold",
-                                                  cl::init(15), cl::Hidden);
+static cl::opt<unsigned>
+    SVETailFoldInsnThreshold("sve-tail-folding-insn-threshold", cl::init(15),
+                             cl::Hidden);
+
+static cl::opt<unsigned>
+    CallScalarizationCostMultiplier("call-scalarization-cost-multiplier",
+                                    cl::init(10), cl::Hidden);
 
 static cl::opt<unsigned>
     NeonNonConstStrideOverhead("neon-nonconst-stride-overhead", cl::init(10),
@@ -594,6 +599,12 @@ static InstructionCost getHistogramCost(const AArch64Subtarget *ST,
   return InstructionCost::getInvalid();
 }
 
+static InstructionCost getCallScalarizationCost(ElementCount VF) {
+  if (VF.isScalable())
+    return InstructionCost::getInvalid();
+  return VF.getFixedValue() * CallScalarizationCostMultiplier;
+}
+
 InstructionCost
 AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                                       TTI::TargetCostKind CostKind) const {
@@ -606,6 +617,7 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     if (VTy->getElementCount() == ElementCount::getScalable(1))
       return InstructionCost::getInvalid();
 
+  InstructionCost BaseCost = 0;
   switch (ICA.getID()) {
   case Intrinsic::experimental_vector_histogram_add: {
     InstructionCost HistCost = getHistogramCost(ST, ICA);
@@ -1004,10 +1016,44 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     break;
   }
+  case Intrinsic::asin:
+  case Intrinsic::acos:
+  case Intrinsic::atan:
+  case Intrinsic::atan2:
+  case Intrinsic::sin:
+  case Intrinsic::cos:
+  case Intrinsic::tan:
+  case Intrinsic::sinh:
+  case Intrinsic::cosh:
+  case Intrinsic::tanh:
+  case Intrinsic::pow:
+  case Intrinsic::exp:
+  case Intrinsic::exp10:
+  case Intrinsic::exp2:
+  case Intrinsic::log:
+  case Intrinsic::log10:
+  case Intrinsic::log2: {
+    if (auto *FixedVTy = dyn_cast<FixedVectorType>(RetTy))
+      BaseCost = getCallScalarizationCost(FixedVTy->getElementCount());
+    break;
+  }
+  case Intrinsic::sincos:
+  case Intrinsic::sincospi: {
+    Type *FirstRetTy = getContainedTypes(RetTy).front();
+    if (auto *FixedVTy = dyn_cast<FixedVectorType>(FirstRetTy)) {
+      EVT ScalarVT = getTLI()->getValueType(DL, FirstRetTy).getScalarType();
+      RTLIB::Libcall LC = ICA.getID() == Intrinsic::sincos
+                              ? RTLIB::getSINCOS(ScalarVT)
+                              : RTLIB::getSINCOSPI(ScalarVT);
+      if (!getMultipleResultIntrinsicVectorLibCallDesc(ICA, LC))
+        BaseCost = getCallScalarizationCost(FixedVTy->getElementCount());
+    }
+    break;
+  }
   default:
     break;
   }
-  return BaseT::getIntrinsicInstrCost(ICA, CostKind);
+  return BaseCost + BaseT::getIntrinsicInstrCost(ICA, CostKind);
 }
 
 /// The function will remove redundant reinterprets casting in the presence
@@ -4045,6 +4091,12 @@ InstructionCost AArch64TTIImpl::getScalarizationOverhead(
   return DemandedElts.popcount() * (Insert + Extract) * VecInstCost;
 }
 
+InstructionCost
+AArch64TTIImpl::getCallScalarizationOverhead(CallInst *CI,
+                                             ElementCount VF) const {
+  return getCallScalarizationCost(VF);
+}
+
 std::optional<InstructionCost> AArch64TTIImpl::getFP16BF16PromoteCost(
     Type *Ty, TTI::TargetCostKind CostKind, TTI::OperandValueInfo Op1Info,
     TTI::OperandValueInfo Op2Info, bool IncludeTrunc,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index fe2e849258e3f..aadd3c28d7b65 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -479,6 +479,9 @@ class AArch64TTIImpl final : public BasicTTIImplBase<AArch64TTIImpl> {
       TTI::TargetCostKind CostKind, bool ForPoisonSrc = true,
       ArrayRef<Value *> VL = {}) const override;
 
+  InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                               ElementCount VF) const override;
+
   /// Return the cost of the scaling factor used in the addressing
   /// mode represented by AM for this target, for a load/store
   /// of the specified type.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c04b5cb10eac2..7d4e98b3be746 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5775,6 +5775,7 @@ void LoopVectorizationCostModel::setVectorizedCallDecision(ElementCount VF) {
         // Compute costs of unpacking argument values for the scalar calls and
         // packing the return values to a vector.
         InstructionCost ScalarizationCost = getScalarizationOverhead(CI, VF);
+        ScalarizationCost += TTI.getCallScalarizationOverhead(CI, VF);
         ScalarCost = ScalarCallCost * VF.getKnownMinValue() + ScalarizationCost;
       } else {
         // There is no point attempting to calculate the scalar cost for a
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index bf51489543098..93069536416bb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -3159,6 +3159,8 @@ InstructionCost VPReplicateRecipe::computeCost(ElementCount VF,
               /*Extract=*/false, Ctx.CostKind);
         }
       }
+      ScalarizationCost +=
+          Ctx.TTI.getCallScalarizationOverhead(cast<CallInst>(UI), VF);
       // Skip operands that do not require extraction/scalarization and do not
       // incur any overhead.
       SmallPtrSet<const VPValue *, 4> UniqueOperands;
diff --git a/llvm/test/Analysis/CostModel/AArch64/sincos.ll b/llvm/test/Analysis/CostModel/AArch64/sincos.ll
index 32408acb582d0..f11c4c84eeb45 100644
--- a/llvm/test/Analysis/CostModel/AArch64/sincos.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/sincos.ll
@@ -1,5 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --filter "sincos"
 ; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s
+; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -call-scalarization-cost-multiplier=1 -passes="print<cost-model>" \
+; RUN:   -cost-kind=throughput 2>&1 -disable-output | FileCheck --check-prefix=CHECK-LOW-SCALARIZATION-COST %s
 ; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -vector-library=ArmPL -passes="print<cost-model>" -intrinsic-cost-strategy=intrinsic-cost -cost-kind=throughput 2>&1 -disable-output | FileCheck %s -check-prefix=CHECK-VECLIB
 
 define void @sincos() {
@@ -8,31 +10,43 @@ define void @sincos() {
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
-;
-; CHECK:  Cost Model: Found an estimated cost of 36 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
-; CHECK:  Cost Model: Found an estimated cost of 52 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
-; CHECK:  Cost Model: Found an estimated cost of 24 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
-; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
-; CHECK:  Cost Model: Found an estimated cost of 104 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
-;
+; CHECK:  Cost Model: Found an estimated cost of 116 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK:  Cost Model: Found an estimated cost of 92 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
+; CHECK:  Cost Model: Found an estimated cost of 44 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
+; CHECK:  Cost Model: Found an estimated cost of 20 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK:  Cost Model: Found an estimated cost of 184 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv1f128 = call { <vscale x 1 x fp128>, <vscale x 1 x fp128> } @llvm.sincos.nxv1f128(<vscale x 1 x fp128> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv8f32 = call { <vscale x 8 x float>, <vscale x 8 x float> } @llvm.sincos.nxv8f32(<vscale x 8 x float> poison)
 ;
+; CHECK-LOW-SCALARIZATION-COST-LABEL: 'sincos'
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call { half, half } @llvm.sincos.f16(half poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 44 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 56 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 26 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 11 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 112 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv1f128 = call { <vscale x 1 x fp128>, <vscale x 1 x fp128> } @llvm.sincos.nxv1f128(<vscale x 1 x fp128> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv8f32 = call { <vscale x 8 x float>, <vscale x 8 x float> } @llvm.sincos.nxv8f32(<vscale x 8 x float> poison)
+;
 ; CHECK-VECLIB-LABEL: 'sincos'
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call { half, half } @llvm.sincos.f16(half poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
-;
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 36 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 116 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 12 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 12 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 104 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
-;
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 20 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 184 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 13 for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 13 for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll b/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll
index 544ef5c82c7ac..1dda7c2826b67 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64...
[truncated]
 | 
| @llvm/pr-subscribers-backend-aarch64 Author: David Sherwood (david-arm) ChangesWhen vectorising loops using -ffast-math with math calls such as this: if there is no available vector variant of expf then we ask for the cost of scalarising the vector math intrinsic, e.g. llvm.expf.v4f32. For AArch64, this turns out to be extremely expensive in many cases because the surrounding vector code causes a lot of spilling and filling of vector registers due to the particular ABI of the math routine. In addition, the more vector work performed in the loop the more registers we are likely to spill, meaning that the cost can scale up with the size of the loop. This PR attempts to solve the problem described above by introducing a new getCallScalarizationOverhead TTI hook that returns a very large cost for AArch64, which can be controlled by a new backend flag -call-cost-scalarization-multiplier. This patch is also required for follow-on work that will reduce the cost of 128-bit masked loads and stores when SVE is available, since lowering the costs leads to us making poor vectorisation choices when loops containing math calls. Patch is 43.11 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158611.diff 13 Files Affected: 
 diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index a6f4e51e258ab..e1adf36940ac6 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -968,6 +968,9 @@ class TargetTransformInfo {
   LLVM_ABI InstructionCost getOperandsScalarizationOverhead(
       ArrayRef<Type *> Tys, TTI::TargetCostKind CostKind) const;
 
+  LLVM_ABI InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                                        ElementCount VF) const;
+
   /// If target has efficient vector element load/store instructions, it can
   /// return true here so that insertion/extraction costs are not added to
   /// the scalarization cost of a load/store.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..f7c5080d49266 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -464,6 +464,11 @@ class TargetTransformInfoImplBase {
     return 0;
   }
 
+  virtual InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                                       ElementCount VF) const {
+    return 0;
+  }
+
   virtual bool supportsEfficientVectorElementLoadStore() const { return false; }
 
   virtual bool supportsTailCalls() const { return true; }
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index dce423fc1b18b..3dd9fa5f97995 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -304,30 +304,14 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
       const IntrinsicCostAttributes &ICA, TTI::TargetCostKind CostKind,
       RTLIB::Libcall LC,
       std::optional<unsigned> CallRetElementIndex = {}) const {
-    Type *RetTy = ICA.getReturnType();
-    // Vector variants of the intrinsic can be mapped to a vector library call.
-    auto const *LibInfo = ICA.getLibInfo();
-    if (!LibInfo || !isa<StructType>(RetTy) ||
-        !isVectorizedStructTy(cast<StructType>(RetTy)))
-      return std::nullopt;
-
-    // Find associated libcall.
-    const char *LCName = getTLI()->getLibcallName(LC);
-    if (!LCName)
-      return std::nullopt;
-
-    // Search for a corresponding vector variant.
-    LLVMContext &Ctx = RetTy->getContext();
-    ElementCount VF = getVectorizedTypeVF(RetTy);
-    VecDesc const *VD = nullptr;
-    for (bool Masked : {false, true}) {
-      if ((VD = LibInfo->getVectorMappingInfo(LCName, VF, Masked)))
-        break;
-    }
+    VecDesc const *VD = getMultipleResultIntrinsicVectorLibCallDesc(ICA, LC);
     if (!VD)
       return std::nullopt;
 
     // Cost the call + mask.
+    Type *RetTy = ICA.getReturnType();
+    ElementCount VF = getVectorizedTypeVF(RetTy);
+    LLVMContext &Ctx = RetTy->getContext();
     auto Cost =
         thisT()->getCallInstrCost(nullptr, RetTy, ICA.getArgTypes(), CostKind);
     if (VD->isMasked()) {
@@ -371,6 +355,30 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
   using TargetTransformInfoImplBase::DL;
 
 public:
+  VecDesc const *getMultipleResultIntrinsicVectorLibCallDesc(
+      const IntrinsicCostAttributes &ICA, RTLIB::Libcall LC) const {
+    Type *RetTy = ICA.getReturnType();
+    // Vector variants of the intrinsic can be mapped to a vector library call.
+    auto const *LibInfo = ICA.getLibInfo();
+    if (!LibInfo || !isa<StructType>(RetTy) ||
+        !isVectorizedStructTy(cast<StructType>(RetTy)))
+      return nullptr;
+
+    // Find associated libcall.
+    const char *LCName = getTLI()->getLibcallName(LC);
+    if (!LCName)
+      return nullptr;
+
+    // Search for a corresponding vector variant.
+    ElementCount VF = getVectorizedTypeVF(RetTy);
+    VecDesc const *VD = nullptr;
+    for (bool Masked : {false, true}) {
+      if ((VD = LibInfo->getVectorMappingInfo(LCName, VF, Masked)))
+        break;
+    }
+    return VD;
+  }
+
   /// \name Scalar TTI Implementations
   /// @{
   bool allowsMisalignedMemoryAccesses(LLVMContext &Context, unsigned BitWidth,
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..045616c8839e8 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -641,6 +641,12 @@ InstructionCost TargetTransformInfo::getOperandsScalarizationOverhead(
   return TTIImpl->getOperandsScalarizationOverhead(Tys, CostKind);
 }
 
+InstructionCost
+TargetTransformInfo::getCallScalarizationOverhead(CallInst *CI,
+                                                  ElementCount VF) const {
+  return TTIImpl->getCallScalarizationOverhead(CI, VF);
+}
+
 bool TargetTransformInfo::supportsEfficientVectorElementLoadStore() const {
   return TTIImpl->supportsEfficientVectorElementLoadStore();
 }
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 92321a76dbd80..befaa1b68d4b7 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -46,8 +46,13 @@ static cl::opt<unsigned> SVEGatherOverhead("sve-gather-overhead", cl::init(10),
 static cl::opt<unsigned> SVEScatterOverhead("sve-scatter-overhead",
                                             cl::init(10), cl::Hidden);
 
-static cl::opt<unsigned> SVETailFoldInsnThreshold("sve-tail-folding-insn-threshold",
-                                                  cl::init(15), cl::Hidden);
+static cl::opt<unsigned>
+    SVETailFoldInsnThreshold("sve-tail-folding-insn-threshold", cl::init(15),
+                             cl::Hidden);
+
+static cl::opt<unsigned>
+    CallScalarizationCostMultiplier("call-scalarization-cost-multiplier",
+                                    cl::init(10), cl::Hidden);
 
 static cl::opt<unsigned>
     NeonNonConstStrideOverhead("neon-nonconst-stride-overhead", cl::init(10),
@@ -594,6 +599,12 @@ static InstructionCost getHistogramCost(const AArch64Subtarget *ST,
   return InstructionCost::getInvalid();
 }
 
+static InstructionCost getCallScalarizationCost(ElementCount VF) {
+  if (VF.isScalable())
+    return InstructionCost::getInvalid();
+  return VF.getFixedValue() * CallScalarizationCostMultiplier;
+}
+
 InstructionCost
 AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                                       TTI::TargetCostKind CostKind) const {
@@ -606,6 +617,7 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     if (VTy->getElementCount() == ElementCount::getScalable(1))
       return InstructionCost::getInvalid();
 
+  InstructionCost BaseCost = 0;
   switch (ICA.getID()) {
   case Intrinsic::experimental_vector_histogram_add: {
     InstructionCost HistCost = getHistogramCost(ST, ICA);
@@ -1004,10 +1016,44 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     break;
   }
+  case Intrinsic::asin:
+  case Intrinsic::acos:
+  case Intrinsic::atan:
+  case Intrinsic::atan2:
+  case Intrinsic::sin:
+  case Intrinsic::cos:
+  case Intrinsic::tan:
+  case Intrinsic::sinh:
+  case Intrinsic::cosh:
+  case Intrinsic::tanh:
+  case Intrinsic::pow:
+  case Intrinsic::exp:
+  case Intrinsic::exp10:
+  case Intrinsic::exp2:
+  case Intrinsic::log:
+  case Intrinsic::log10:
+  case Intrinsic::log2: {
+    if (auto *FixedVTy = dyn_cast<FixedVectorType>(RetTy))
+      BaseCost = getCallScalarizationCost(FixedVTy->getElementCount());
+    break;
+  }
+  case Intrinsic::sincos:
+  case Intrinsic::sincospi: {
+    Type *FirstRetTy = getContainedTypes(RetTy).front();
+    if (auto *FixedVTy = dyn_cast<FixedVectorType>(FirstRetTy)) {
+      EVT ScalarVT = getTLI()->getValueType(DL, FirstRetTy).getScalarType();
+      RTLIB::Libcall LC = ICA.getID() == Intrinsic::sincos
+                              ? RTLIB::getSINCOS(ScalarVT)
+                              : RTLIB::getSINCOSPI(ScalarVT);
+      if (!getMultipleResultIntrinsicVectorLibCallDesc(ICA, LC))
+        BaseCost = getCallScalarizationCost(FixedVTy->getElementCount());
+    }
+    break;
+  }
   default:
     break;
   }
-  return BaseT::getIntrinsicInstrCost(ICA, CostKind);
+  return BaseCost + BaseT::getIntrinsicInstrCost(ICA, CostKind);
 }
 
 /// The function will remove redundant reinterprets casting in the presence
@@ -4045,6 +4091,12 @@ InstructionCost AArch64TTIImpl::getScalarizationOverhead(
   return DemandedElts.popcount() * (Insert + Extract) * VecInstCost;
 }
 
+InstructionCost
+AArch64TTIImpl::getCallScalarizationOverhead(CallInst *CI,
+                                             ElementCount VF) const {
+  return getCallScalarizationCost(VF);
+}
+
 std::optional<InstructionCost> AArch64TTIImpl::getFP16BF16PromoteCost(
     Type *Ty, TTI::TargetCostKind CostKind, TTI::OperandValueInfo Op1Info,
     TTI::OperandValueInfo Op2Info, bool IncludeTrunc,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index fe2e849258e3f..aadd3c28d7b65 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -479,6 +479,9 @@ class AArch64TTIImpl final : public BasicTTIImplBase<AArch64TTIImpl> {
       TTI::TargetCostKind CostKind, bool ForPoisonSrc = true,
       ArrayRef<Value *> VL = {}) const override;
 
+  InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                               ElementCount VF) const override;
+
   /// Return the cost of the scaling factor used in the addressing
   /// mode represented by AM for this target, for a load/store
   /// of the specified type.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c04b5cb10eac2..7d4e98b3be746 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5775,6 +5775,7 @@ void LoopVectorizationCostModel::setVectorizedCallDecision(ElementCount VF) {
         // Compute costs of unpacking argument values for the scalar calls and
         // packing the return values to a vector.
         InstructionCost ScalarizationCost = getScalarizationOverhead(CI, VF);
+        ScalarizationCost += TTI.getCallScalarizationOverhead(CI, VF);
         ScalarCost = ScalarCallCost * VF.getKnownMinValue() + ScalarizationCost;
       } else {
         // There is no point attempting to calculate the scalar cost for a
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index bf51489543098..93069536416bb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -3159,6 +3159,8 @@ InstructionCost VPReplicateRecipe::computeCost(ElementCount VF,
               /*Extract=*/false, Ctx.CostKind);
         }
       }
+      ScalarizationCost +=
+          Ctx.TTI.getCallScalarizationOverhead(cast<CallInst>(UI), VF);
       // Skip operands that do not require extraction/scalarization and do not
       // incur any overhead.
       SmallPtrSet<const VPValue *, 4> UniqueOperands;
diff --git a/llvm/test/Analysis/CostModel/AArch64/sincos.ll b/llvm/test/Analysis/CostModel/AArch64/sincos.ll
index 32408acb582d0..f11c4c84eeb45 100644
--- a/llvm/test/Analysis/CostModel/AArch64/sincos.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/sincos.ll
@@ -1,5 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --filter "sincos"
 ; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s
+; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -call-scalarization-cost-multiplier=1 -passes="print<cost-model>" \
+; RUN:   -cost-kind=throughput 2>&1 -disable-output | FileCheck --check-prefix=CHECK-LOW-SCALARIZATION-COST %s
 ; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -vector-library=ArmPL -passes="print<cost-model>" -intrinsic-cost-strategy=intrinsic-cost -cost-kind=throughput 2>&1 -disable-output | FileCheck %s -check-prefix=CHECK-VECLIB
 
 define void @sincos() {
@@ -8,31 +10,43 @@ define void @sincos() {
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
-;
-; CHECK:  Cost Model: Found an estimated cost of 36 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
-; CHECK:  Cost Model: Found an estimated cost of 52 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
-; CHECK:  Cost Model: Found an estimated cost of 24 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
-; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
-; CHECK:  Cost Model: Found an estimated cost of 104 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
-;
+; CHECK:  Cost Model: Found an estimated cost of 116 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK:  Cost Model: Found an estimated cost of 92 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
+; CHECK:  Cost Model: Found an estimated cost of 44 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
+; CHECK:  Cost Model: Found an estimated cost of 20 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK:  Cost Model: Found an estimated cost of 184 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv1f128 = call { <vscale x 1 x fp128>, <vscale x 1 x fp128> } @llvm.sincos.nxv1f128(<vscale x 1 x fp128> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv8f32 = call { <vscale x 8 x float>, <vscale x 8 x float> } @llvm.sincos.nxv8f32(<vscale x 8 x float> poison)
 ;
+; CHECK-LOW-SCALARIZATION-COST-LABEL: 'sincos'
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call { half, half } @llvm.sincos.f16(half poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 44 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 56 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 26 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 11 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 112 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv1f128 = call { <vscale x 1 x fp128>, <vscale x 1 x fp128> } @llvm.sincos.nxv1f128(<vscale x 1 x fp128> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv8f32 = call { <vscale x 8 x float>, <vscale x 8 x float> } @llvm.sincos.nxv8f32(<vscale x 8 x float> poison)
+;
 ; CHECK-VECLIB-LABEL: 'sincos'
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call { half, half } @llvm.sincos.f16(half poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
-;
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 36 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 116 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 12 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 12 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 104 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
-;
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 20 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 184 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 13 for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 13 for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll b/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll
index 544ef5c82c7ac..1dda7c2826b67 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64...
[truncated]
 | 
| @llvm/pr-subscribers-vectorizers Author: David Sherwood (david-arm) ChangesWhen vectorising loops using -ffast-math with math calls such as this: if there is no available vector variant of expf then we ask for the cost of scalarising the vector math intrinsic, e.g. llvm.expf.v4f32. For AArch64, this turns out to be extremely expensive in many cases because the surrounding vector code causes a lot of spilling and filling of vector registers due to the particular ABI of the math routine. In addition, the more vector work performed in the loop the more registers we are likely to spill, meaning that the cost can scale up with the size of the loop. This PR attempts to solve the problem described above by introducing a new getCallScalarizationOverhead TTI hook that returns a very large cost for AArch64, which can be controlled by a new backend flag -call-cost-scalarization-multiplier. This patch is also required for follow-on work that will reduce the cost of 128-bit masked loads and stores when SVE is available, since lowering the costs leads to us making poor vectorisation choices when loops containing math calls. Patch is 43.11 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158611.diff 13 Files Affected: 
 diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index a6f4e51e258ab..e1adf36940ac6 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -968,6 +968,9 @@ class TargetTransformInfo {
   LLVM_ABI InstructionCost getOperandsScalarizationOverhead(
       ArrayRef<Type *> Tys, TTI::TargetCostKind CostKind) const;
 
+  LLVM_ABI InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                                        ElementCount VF) const;
+
   /// If target has efficient vector element load/store instructions, it can
   /// return true here so that insertion/extraction costs are not added to
   /// the scalarization cost of a load/store.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..f7c5080d49266 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -464,6 +464,11 @@ class TargetTransformInfoImplBase {
     return 0;
   }
 
+  virtual InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                                       ElementCount VF) const {
+    return 0;
+  }
+
   virtual bool supportsEfficientVectorElementLoadStore() const { return false; }
 
   virtual bool supportsTailCalls() const { return true; }
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index dce423fc1b18b..3dd9fa5f97995 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -304,30 +304,14 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
       const IntrinsicCostAttributes &ICA, TTI::TargetCostKind CostKind,
       RTLIB::Libcall LC,
       std::optional<unsigned> CallRetElementIndex = {}) const {
-    Type *RetTy = ICA.getReturnType();
-    // Vector variants of the intrinsic can be mapped to a vector library call.
-    auto const *LibInfo = ICA.getLibInfo();
-    if (!LibInfo || !isa<StructType>(RetTy) ||
-        !isVectorizedStructTy(cast<StructType>(RetTy)))
-      return std::nullopt;
-
-    // Find associated libcall.
-    const char *LCName = getTLI()->getLibcallName(LC);
-    if (!LCName)
-      return std::nullopt;
-
-    // Search for a corresponding vector variant.
-    LLVMContext &Ctx = RetTy->getContext();
-    ElementCount VF = getVectorizedTypeVF(RetTy);
-    VecDesc const *VD = nullptr;
-    for (bool Masked : {false, true}) {
-      if ((VD = LibInfo->getVectorMappingInfo(LCName, VF, Masked)))
-        break;
-    }
+    VecDesc const *VD = getMultipleResultIntrinsicVectorLibCallDesc(ICA, LC);
     if (!VD)
       return std::nullopt;
 
     // Cost the call + mask.
+    Type *RetTy = ICA.getReturnType();
+    ElementCount VF = getVectorizedTypeVF(RetTy);
+    LLVMContext &Ctx = RetTy->getContext();
     auto Cost =
         thisT()->getCallInstrCost(nullptr, RetTy, ICA.getArgTypes(), CostKind);
     if (VD->isMasked()) {
@@ -371,6 +355,30 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
   using TargetTransformInfoImplBase::DL;
 
 public:
+  VecDesc const *getMultipleResultIntrinsicVectorLibCallDesc(
+      const IntrinsicCostAttributes &ICA, RTLIB::Libcall LC) const {
+    Type *RetTy = ICA.getReturnType();
+    // Vector variants of the intrinsic can be mapped to a vector library call.
+    auto const *LibInfo = ICA.getLibInfo();
+    if (!LibInfo || !isa<StructType>(RetTy) ||
+        !isVectorizedStructTy(cast<StructType>(RetTy)))
+      return nullptr;
+
+    // Find associated libcall.
+    const char *LCName = getTLI()->getLibcallName(LC);
+    if (!LCName)
+      return nullptr;
+
+    // Search for a corresponding vector variant.
+    ElementCount VF = getVectorizedTypeVF(RetTy);
+    VecDesc const *VD = nullptr;
+    for (bool Masked : {false, true}) {
+      if ((VD = LibInfo->getVectorMappingInfo(LCName, VF, Masked)))
+        break;
+    }
+    return VD;
+  }
+
   /// \name Scalar TTI Implementations
   /// @{
   bool allowsMisalignedMemoryAccesses(LLVMContext &Context, unsigned BitWidth,
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..045616c8839e8 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -641,6 +641,12 @@ InstructionCost TargetTransformInfo::getOperandsScalarizationOverhead(
   return TTIImpl->getOperandsScalarizationOverhead(Tys, CostKind);
 }
 
+InstructionCost
+TargetTransformInfo::getCallScalarizationOverhead(CallInst *CI,
+                                                  ElementCount VF) const {
+  return TTIImpl->getCallScalarizationOverhead(CI, VF);
+}
+
 bool TargetTransformInfo::supportsEfficientVectorElementLoadStore() const {
   return TTIImpl->supportsEfficientVectorElementLoadStore();
 }
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 92321a76dbd80..befaa1b68d4b7 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -46,8 +46,13 @@ static cl::opt<unsigned> SVEGatherOverhead("sve-gather-overhead", cl::init(10),
 static cl::opt<unsigned> SVEScatterOverhead("sve-scatter-overhead",
                                             cl::init(10), cl::Hidden);
 
-static cl::opt<unsigned> SVETailFoldInsnThreshold("sve-tail-folding-insn-threshold",
-                                                  cl::init(15), cl::Hidden);
+static cl::opt<unsigned>
+    SVETailFoldInsnThreshold("sve-tail-folding-insn-threshold", cl::init(15),
+                             cl::Hidden);
+
+static cl::opt<unsigned>
+    CallScalarizationCostMultiplier("call-scalarization-cost-multiplier",
+                                    cl::init(10), cl::Hidden);
 
 static cl::opt<unsigned>
     NeonNonConstStrideOverhead("neon-nonconst-stride-overhead", cl::init(10),
@@ -594,6 +599,12 @@ static InstructionCost getHistogramCost(const AArch64Subtarget *ST,
   return InstructionCost::getInvalid();
 }
 
+static InstructionCost getCallScalarizationCost(ElementCount VF) {
+  if (VF.isScalable())
+    return InstructionCost::getInvalid();
+  return VF.getFixedValue() * CallScalarizationCostMultiplier;
+}
+
 InstructionCost
 AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                                       TTI::TargetCostKind CostKind) const {
@@ -606,6 +617,7 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     if (VTy->getElementCount() == ElementCount::getScalable(1))
       return InstructionCost::getInvalid();
 
+  InstructionCost BaseCost = 0;
   switch (ICA.getID()) {
   case Intrinsic::experimental_vector_histogram_add: {
     InstructionCost HistCost = getHistogramCost(ST, ICA);
@@ -1004,10 +1016,44 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     break;
   }
+  case Intrinsic::asin:
+  case Intrinsic::acos:
+  case Intrinsic::atan:
+  case Intrinsic::atan2:
+  case Intrinsic::sin:
+  case Intrinsic::cos:
+  case Intrinsic::tan:
+  case Intrinsic::sinh:
+  case Intrinsic::cosh:
+  case Intrinsic::tanh:
+  case Intrinsic::pow:
+  case Intrinsic::exp:
+  case Intrinsic::exp10:
+  case Intrinsic::exp2:
+  case Intrinsic::log:
+  case Intrinsic::log10:
+  case Intrinsic::log2: {
+    if (auto *FixedVTy = dyn_cast<FixedVectorType>(RetTy))
+      BaseCost = getCallScalarizationCost(FixedVTy->getElementCount());
+    break;
+  }
+  case Intrinsic::sincos:
+  case Intrinsic::sincospi: {
+    Type *FirstRetTy = getContainedTypes(RetTy).front();
+    if (auto *FixedVTy = dyn_cast<FixedVectorType>(FirstRetTy)) {
+      EVT ScalarVT = getTLI()->getValueType(DL, FirstRetTy).getScalarType();
+      RTLIB::Libcall LC = ICA.getID() == Intrinsic::sincos
+                              ? RTLIB::getSINCOS(ScalarVT)
+                              : RTLIB::getSINCOSPI(ScalarVT);
+      if (!getMultipleResultIntrinsicVectorLibCallDesc(ICA, LC))
+        BaseCost = getCallScalarizationCost(FixedVTy->getElementCount());
+    }
+    break;
+  }
   default:
     break;
   }
-  return BaseT::getIntrinsicInstrCost(ICA, CostKind);
+  return BaseCost + BaseT::getIntrinsicInstrCost(ICA, CostKind);
 }
 
 /// The function will remove redundant reinterprets casting in the presence
@@ -4045,6 +4091,12 @@ InstructionCost AArch64TTIImpl::getScalarizationOverhead(
   return DemandedElts.popcount() * (Insert + Extract) * VecInstCost;
 }
 
+InstructionCost
+AArch64TTIImpl::getCallScalarizationOverhead(CallInst *CI,
+                                             ElementCount VF) const {
+  return getCallScalarizationCost(VF);
+}
+
 std::optional<InstructionCost> AArch64TTIImpl::getFP16BF16PromoteCost(
     Type *Ty, TTI::TargetCostKind CostKind, TTI::OperandValueInfo Op1Info,
     TTI::OperandValueInfo Op2Info, bool IncludeTrunc,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index fe2e849258e3f..aadd3c28d7b65 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -479,6 +479,9 @@ class AArch64TTIImpl final : public BasicTTIImplBase<AArch64TTIImpl> {
       TTI::TargetCostKind CostKind, bool ForPoisonSrc = true,
       ArrayRef<Value *> VL = {}) const override;
 
+  InstructionCost getCallScalarizationOverhead(CallInst *CI,
+                                               ElementCount VF) const override;
+
   /// Return the cost of the scaling factor used in the addressing
   /// mode represented by AM for this target, for a load/store
   /// of the specified type.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c04b5cb10eac2..7d4e98b3be746 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5775,6 +5775,7 @@ void LoopVectorizationCostModel::setVectorizedCallDecision(ElementCount VF) {
         // Compute costs of unpacking argument values for the scalar calls and
         // packing the return values to a vector.
         InstructionCost ScalarizationCost = getScalarizationOverhead(CI, VF);
+        ScalarizationCost += TTI.getCallScalarizationOverhead(CI, VF);
         ScalarCost = ScalarCallCost * VF.getKnownMinValue() + ScalarizationCost;
       } else {
         // There is no point attempting to calculate the scalar cost for a
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index bf51489543098..93069536416bb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -3159,6 +3159,8 @@ InstructionCost VPReplicateRecipe::computeCost(ElementCount VF,
               /*Extract=*/false, Ctx.CostKind);
         }
       }
+      ScalarizationCost +=
+          Ctx.TTI.getCallScalarizationOverhead(cast<CallInst>(UI), VF);
       // Skip operands that do not require extraction/scalarization and do not
       // incur any overhead.
       SmallPtrSet<const VPValue *, 4> UniqueOperands;
diff --git a/llvm/test/Analysis/CostModel/AArch64/sincos.ll b/llvm/test/Analysis/CostModel/AArch64/sincos.ll
index 32408acb582d0..f11c4c84eeb45 100644
--- a/llvm/test/Analysis/CostModel/AArch64/sincos.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/sincos.ll
@@ -1,5 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --filter "sincos"
 ; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s
+; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -call-scalarization-cost-multiplier=1 -passes="print<cost-model>" \
+; RUN:   -cost-kind=throughput 2>&1 -disable-output | FileCheck --check-prefix=CHECK-LOW-SCALARIZATION-COST %s
 ; RUN: opt < %s -mtriple=aarch64-gnu-linux -mattr=+neon,+sve -vector-library=ArmPL -passes="print<cost-model>" -intrinsic-cost-strategy=intrinsic-cost -cost-kind=throughput 2>&1 -disable-output | FileCheck %s -check-prefix=CHECK-VECLIB
 
 define void @sincos() {
@@ -8,31 +10,43 @@ define void @sincos() {
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
 ; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
-;
-; CHECK:  Cost Model: Found an estimated cost of 36 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
-; CHECK:  Cost Model: Found an estimated cost of 52 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
-; CHECK:  Cost Model: Found an estimated cost of 24 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
-; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
-; CHECK:  Cost Model: Found an estimated cost of 104 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
-;
+; CHECK:  Cost Model: Found an estimated cost of 116 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK:  Cost Model: Found an estimated cost of 92 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
+; CHECK:  Cost Model: Found an estimated cost of 44 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
+; CHECK:  Cost Model: Found an estimated cost of 20 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK:  Cost Model: Found an estimated cost of 184 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv1f128 = call { <vscale x 1 x fp128>, <vscale x 1 x fp128> } @llvm.sincos.nxv1f128(<vscale x 1 x fp128> poison)
 ; CHECK:  Cost Model: Invalid cost for instruction: %nxv8f32 = call { <vscale x 8 x float>, <vscale x 8 x float> } @llvm.sincos.nxv8f32(<vscale x 8 x float> poison)
 ;
+; CHECK-LOW-SCALARIZATION-COST-LABEL: 'sincos'
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call { half, half } @llvm.sincos.f16(half poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 44 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 56 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 26 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 11 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Found an estimated cost of 112 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv1f128 = call { <vscale x 1 x fp128>, <vscale x 1 x fp128> } @llvm.sincos.nxv1f128(<vscale x 1 x fp128> poison)
+; CHECK-LOW-SCALARIZATION-COST:  Cost Model: Invalid cost for instruction: %nxv8f32 = call { <vscale x 8 x float>, <vscale x 8 x float> } @llvm.sincos.nxv8f32(<vscale x 8 x float> poison)
+;
 ; CHECK-VECLIB-LABEL: 'sincos'
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call { half, half } @llvm.sincos.f16(half poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f32 = call { float, float } @llvm.sincos.f32(float poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f64 = call { double, double } @llvm.sincos.f64(double poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %f128 = call { fp128, fp128 } @llvm.sincos.f128(fp128 poison)
-;
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 36 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 116 for instruction: %v8f16 = call { <8 x half>, <8 x half> } @llvm.sincos.v8f16(<8 x half> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 12 for instruction: %v4f32 = call { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 12 for instruction: %v2f64 = call { <2 x double>, <2 x double> } @llvm.sincos.v2f64(<2 x double> poison)
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 10 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
-; CHECK-VECLIB:  Cost Model: Found an estimated cost of 104 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
-;
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 20 for instruction: %v1f128 = call { <1 x fp128>, <1 x fp128> } @llvm.sincos.v1f128(<1 x fp128> poison)
+; CHECK-VECLIB:  Cost Model: Found an estimated cost of 184 for instruction: %v8f32 = call { <8 x float>, <8 x float> } @llvm.sincos.v8f32(<8 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Invalid cost for instruction: %nxv8f16 = call { <vscale x 8 x half>, <vscale x 8 x half> } @llvm.sincos.nxv8f16(<vscale x 8 x half> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 13 for instruction: %nxv4f32 = call { <vscale x 4 x float>, <vscale x 4 x float> } @llvm.sincos.nxv4f32(<vscale x 4 x float> poison)
 ; CHECK-VECLIB:  Cost Model: Found an estimated cost of 13 for instruction: %nxv2f64 = call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.sincos.nxv2f64(<vscale x 2 x double> poison)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll b/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll
index 544ef5c82c7ac..1dda7c2826b67 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/multiple-result-intrinsics.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64...
[truncated]
 | 
| case Intrinsic::sincos: | ||
| case Intrinsic::sincospi: { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this include Intrinsic::modf too? (which is also supported)
| if (auto *FixedVTy = dyn_cast<FixedVectorType>(RetTy)) | ||
| BaseCost = getCallScalarizationCost(FixedVTy->getElementCount()); | ||
| break; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we more directly just consider the cost of scalarizing the operands and vectorizing the result types or is that already included somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add it to the cost of scalarisation of the return type, but I didn't do this for two reasons:
- Does the return type guarantee to represent the VF of the loop in all cases? Probably for math calls this is true as we'll always return a value.
- We'd also have to pass in an extra Instruction pointer to provide context so that we only increase the cost for calls. It felt a bit awkward expanding the interface for this one case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like quite a high cost to add. Could we consider just fixing the codegeneration?
There is an example in https://godbolt.org/z/j7xrPnx1K. If you take out the noise from the last functions it should be something like
        mov     s8, v0.s[1]
        mov     s9, v0.s[2]
        mov     s10, v0.s[3]
        bl      expf
        fmov    s11, s0
        fmov    s0, s8
        bl      expf
        fmov    s8, s0
        fmov    s0, s9
        bl      expf
        fmov    s9, s0
        fmov    s0, s10
        bl      expf
        mov     v11.s[1], v8.s[0]
        mov     v11.s[2], v9.s[0]
        mov     v11.s[3], v0.s[0]
The costs of calls is always a bit difficult. The fmovs are not exactly free but sometimes close to it, and the rest is not far away from the existing 4*call+scalarization overhead.
I guess the general problem is that any call without a vector calling convention will cause spilling of v/z vector registers if they need to be live across it.
| 
 Yeah you raise a good point - the costs do seem very high, but I set the cost this high mainly because it was the only to way avoid significant regressions in benchmarks like wrf in future when I plan to lower the cost of the 128-bit masked loads and stores (currently a cost of 18 for VF=4). It's due to exactly the problem you described about spilling and filling (or rematerialising) SVE predicate and v/z registers. It wasn't obvious to me that we could ever improve the generated code without changing the ABI for libm math routines. In many loops the only thing currently preventing a terrible choice of vectorisation is the very high cost of the fixed-width masked loads and stores. The generated code in such loops (such as wrf) is often so bad that we end up spilling (or rematerialising) 4 or 5 SVE predicate or NEON/SVE vector registers around every function call (and there are several in a single loop). What I see is that the more work done in the loop, the worse it gets so it's not like the cost gets amortised. I suppose a more accurate way of costing would be to guess how the backend intends to schedule and allocate registers in code before and after the call, but I assume that would be fragile and computationally expensive. When trying out some of my hand-written micro-benchmarks (as well as wrf, etc) I couldn't see any performance benefits to vectorisation when scalarising math calls, and it also significantly increased the code size. I do want to improve the cost model for fixed-width masked loads and stores because they simply don't represent the generated code, but I'm currently held hostage by this math call scalarisation issue and it's difficult to know how else to proceed. I'm open to suggestions about other ideas on how to progress! One of the other problems I've noticed is that if a math call is predicated in C code like this example: when building with -ffast-math the scalar version remains in an if-block, but the loop vectoriser if-converts the loop (due to expf being safe to speculatively execute). Then in the backend we scalarise the vector intrinsic form of expf into individual scalar calls. If the condition is only triggered half of the time then the scalar loop is always going to be faster. The trouble here is that without PGO data we don't know whether we should flatten the loop or not. In such cases I could increase the call scalarisation cost further to prevent vectorisation, which would allow me to drop the cost for loops with unconditional math calls. | 
When vectorising loops using -ffast-math with math calls such as this:
if there is no available vector variant of expf then we ask for the cost of scalarising the vector math intrinsic, e.g. llvm.expf.v4f32.
For AArch64, this turns out to be extremely expensive in many cases because the surrounding vector code causes a lot of spilling and filling of vector registers due to the particular ABI of the math routine. In addition, the more vector work performed in the loop the more registers we are likely to spill, meaning that the cost can scale up with the size of the loop.
This PR attempts to solve the problem described above by introducing a new getCallScalarizationOverhead TTI hook that returns a very large cost for AArch64, which can be controlled by a new backend flag -call-cost-scalarization-multiplier. This patch is also required for follow-on work that will reduce the cost of 128-bit masked loads and stores when SVE is available, since lowering the costs leads to us making poor vectorisation choices when loops contain math calls.