Skip to content

Conversation

@sushgokh
Copy link
Contributor

@sushgokh sushgokh commented Oct 8, 2024

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
    then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @foo(<2 x double> %a) { 
  %1 = extractelement <2 x double> %a, i32 0 
  %2 = extractelement <2 x double> %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.

@llvmbot
Copy link
Member

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
    then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.


Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
  • (modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
  • (modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-llvm-analysis

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
    then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.


Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
  • (modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
  • (modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
    then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.


Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
  • (modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
  • (modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 2471f47 to 85ca8b2 Compare October 9, 2024 07:57
@github-actions
Copy link

github-actions bot commented Oct 9, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch 2 times, most recently from c88143d to 08e2879 Compare October 10, 2024 05:59
@sushgokh
Copy link
Contributor Author

ping

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - I think what this patch is trying to do makes a lot of sense. I guess there might be a reluctance due to the way it has to do it. There is quite a lot of new code. In the past we have tried to have the cost model functions able to test generic (non-existent) instructions. Sometimes they need to know more info (especially about uses and operands), so operands get added and sometimes a CxtI instruction to get the uses from. Am I right that that approach of passing a CxtI instruction doesn't work very well in this case because there is no instruction yet? The extract is new and the mul the new user of it. It would be nice if the interface was fairly generic if we an make it, not specific to the SLP vectorizer.

unsigned Index,
bool HasRealUse) {
InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove variant? I don't think it is necessary if we always pass the Opcode and then treat the instruction like we do elsewhere as an optional argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Sometimes Instruction itself is passed to this API to get more information. So, const Instruction *I has to be one of the arguments
  2. The thinking behind using single variable for Instruction and Opcode(using variant) is Instruction and Opcode is somewhat related but we never pass both unless its required in future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to avoid all the variant code:

getVectorInstrCostHelper(unsigned Opcode, Type *Val, unsigned Index, bool HasRealUse, const Instruction *CxtI)

If you have a CxtI, you can easily get the Opcode from it when calling the function. Then if just needs to check if (CxtI) for any checks that use the instruction. It feels like a lot of the added code that has to handle Opcode and I can then fold together into a single version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really cant merge the code for instruction and opcode, atleast for the patch, because the change wont be reflected in the cost-model tests.

  1. When opcode!=null , less information is available and we bail out early in general.
  2. When I!=null, more analysis needs to be done.

I will keep seperate instruction and opcode parameter as you say.


// InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
// with fmul does not happen.
auto IsFMulUserFAddFSub = [](const Value *FMul) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Can a fma not do the same trick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.
What this check does is we bail out of the optimisation if there is chance of generating fma.
I was getting regression either in one of the regression tests or SPEC17 benchmarks before this check(dont remember which one exactly).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have more details? This patch feels quite complex for what it does, and anything we can do to simplify it would help.

We would usually want to avoid things that are O(n) in the number of uses, more-so if it is O(n^2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say the situation goes like this:

define double @foo(<2 x double> %a, double %b)
{
        %1 = extractelement <2 x double> %a, i32 0
        %2 = extractelement <2 x double> %a, i32 1
        %3 = fmul double %1, %2
        %4 = fadd double %3, %b
        ret double %4
}

Codegen with ./llc -mtriple=aarch64 -fp-contract=fast test.ll is

foo:                                    // @foo
        .cfi_startproc
// %bb.0:
        mov     d2, v0.d[1]
        fmadd   d0, d0, d2, d1
        ret

So, even if you need the requirements of fmul and extractelement fusion, if fmul has fadd/fsub, we need to bail out.
Hence the patch checks if fmul has fadd/fsub as users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. That might be a problem with the tablegen patterns. I believe it works for other indices and when using -O3 (it takes a different route, using the machine combiner). I'll see if I can put together a fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing just in tablegen might not be sufficient. This might need fixing in instcombine(I believe it generates fma instrinsic).
Additionally, I think fma is better than fmul/extract fusion(I think I had measured it on Neoverse-v2)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some extra patterns in #114976 that might or might not help. Can you give more details on the instcombine/fma issue? I think it might be OK because we are only looking at mul here, and DAG will create the fma's depending on the optimization level (but that is hopefully fine with the new patterns).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch #114976. This fixes the concern I had cited in terms of the additional code. I will now just remove the code.

@sushgokh
Copy link
Contributor Author

sushgokh commented Oct 24, 2024

Hi - I think what this patch is trying to do makes a lot of sense. I guess there might be a reluctance due to the way it has to do it. There is quite a lot of new code. In the past we have tried to have the cost model functions able to test generic (non-existent) instructions. Sometimes they need to know more info (especially about uses and operands), so operands get added and sometimes a CxtI instruction to get the uses from. Am I right that that approach of passing a CxtI instruction doesn't work very well in this case because there is no instruction yet? The extract is new and the mul the new user of it. It would be nice if the interface was fairly generic if we an make it, not specific to the SLP vectorizer.

Currently, the way SLP vect works is

  1. If the extract user is used out-of-SLP-tree, you know the exact user instruction
  2. If the extract user is going to be used in-SLP-tree, we dont have information about the user. User=nullptr for such cases. SLP updates the users at later stage post vectorization.

So, currently, the patch is just catering to out-of-tree users(and I realised this post uploading this patch due to my limited knowledge). I will amend this patch for in-tree users.

Now why not have have API which uses something like CtxI ?
If the user is out-of-tree, I can directly pass the user as single argument. But for in-tree user, we have no information(User=nullptr). In such case, we need to make more analysis and for the same reason, I need to pass array, having all extract information, as an API argument.
I hope this answers your question.

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 08e2879 to 72e6397 Compare November 4, 2024 05:29
@sushgokh sushgokh requested a review from davemgreen November 4, 2024 05:31
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say:

I will amend this patch for in-tree users.

Do you mean to do that here?

unsigned Index,
bool HasRealUse) {
InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to avoid all the variant code:

getVectorInstrCostHelper(unsigned Opcode, Type *Val, unsigned Index, bool HasRealUse, const Instruction *CxtI)

If you have a CxtI, you can easily get the Opcode from it when calling the function. Then if just needs to check if (CxtI) for any checks that use the instruction. It feels like a lot of the added code that has to handle Opcode and I can then fold together into a single version.


// InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
// with fmul does not happen.
auto IsFMulUserFAddFSub = [](const Value *FMul) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have more details? This patch feels quite complex for what it does, and anything we can do to simplify it would help.

We would usually want to avoid things that are O(n) in the number of uses, more-so if it is O(n^2).

@sushgokh
Copy link
Contributor Author

sushgokh commented Nov 4, 2024

When you say:

I will amend this patch for in-tree users.

Do you mean to do that here?

Yes, amended the current patch

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 72e6397 to f4f9396 Compare November 5, 2024 05:18
@sushgokh sushgokh requested a review from davemgreen November 5, 2024 05:19
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at fixing what is wrong with the testing?


// InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
// with fmul does not happen.
auto IsFMulUserFAddFSub = [](const Value *FMul) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some extra patterns in #114976 that might or might not help. Can you give more details on the instcombine/fma issue? I think it might be OK because we are only looking at mul here, and DAG will create the fma's depending on the optimization level (but that is hopefully fine with the new patterns).

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch 3 times, most recently from 242576a to 5205ae1 Compare November 6, 2024 19:21
@sushgokh
Copy link
Contributor Author

sushgokh commented Nov 6, 2024

Can you look at fixing what is wrong with the testing?

Check was missing. Fixed it.

@sushgokh sushgokh requested a review from davemgreen November 6, 2024 19:42
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. High level comment - this sounds OK and I don't know of a better way to get the same result. If @alexey-bataev doesn't disagree about the SLP parts and we can clean up the AArch64 code a little then this seem OK to me.

Comment on lines 3283 to 3284
if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the TypeConstraintsOnEESatisfied in both sides of the if be hoisted? I believe Val should always be a valid type. It might then be OK in inline TypeConstraintsOnEESatisfied and IsAllowedScalarTy as it it only used once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it cant be hoisted.

Val can vector of half floats without fullfp16 support.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean?

In this case I didn't mean to remove the checking of the types, just to move it up to above the if (Scalar) {. Then remove the version that uses EE->getVectorOperand()->getType() below, as that should always be equal to Val if it exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. Yes, hoisted the check. I initially forgot to do that since for I!=nullptr, the control flow is different and this idea didnt click me. Thanks

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 5205ae1 to 22c6544 Compare November 11, 2024 07:34
@sushgokh sushgokh requested a review from davemgreen November 11, 2024 07:39
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than pulling up the type checks this LGTM.

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that
  1. extractelement does not necessitate a move from vector_reg -> GPR.
  2. extractelement result feeds into fmul.
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0.
  then the extractelement can be merged with fmul in the backend and it incurs no cost.
  e.g.
  define double @foo(<2 x double> %a) {
    %1 = extractelement <2 x double> %a, i32 0
    %2 = extractelement <2 x double> %a, i32 1
    %res = fmul double %1, %2    ret double %res
  }
  %2 and %res can be merged in the backend to generate:
  fmul    d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.
@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 22c6544 to 532e3d2 Compare November 13, 2024 05:40
@sushgokh sushgokh merged commit 9991ea2 into llvm:main Nov 13, 2024
5 of 7 checks passed
@sushgokh sushgokh deleted the extractelement-costing-with-fmul-user branch November 19, 2024 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers