Skip to content

[VPlan] Get Addr computation cost with scalar type if it is uniform for gather/scatter. #150371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ElvisWang123
Copy link
Contributor

This patch query getAddressComputationCost() with scalar type if the address is uniform. This can help the cost for gather/scatter more accurate.

In current LV, non consecutive VPWidenMemoryRecipe (gather/scatter) will account the cost of address computation. But there are some cases that the address is uniform across all lanes, that makes the address can be calculated with scalar type and broadcast.

I have a followup optimization that tries to convert gather/scatter with uniform memory access to scalar load/store + broadcast (and select if needed). With this optimization, we can remove this temporary change.

This patch is preparation for #149955.

@llvmbot
Copy link
Member

llvmbot commented Jul 24, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Elvis Wang (ElvisWang123)

Changes

This patch query getAddressComputationCost() with scalar type if the address is uniform. This can help the cost for gather/scatter more accurate.

In current LV, non consecutive VPWidenMemoryRecipe (gather/scatter) will account the cost of address computation. But there are some cases that the address is uniform across all lanes, that makes the address can be calculated with scalar type and broadcast.

I have a followup optimization that tries to convert gather/scatter with uniform memory access to scalar load/store + broadcast (and select if needed). With this optimization, we can remove this temporary change.

This patch is preparation for #149955.


Full diff: https://github.com/llvm/llvm-project/pull/150371.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+6)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+12-3)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3ce9d29d34553..7adb87f4557f8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6932,6 +6932,12 @@ static bool planContainsAdditionalSimplifications(VPlan &Plan,
   auto Iter = vp_depth_first_deep(Plan.getVectorLoopRegion()->getEntry());
   for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) {
     for (VPRecipeBase &R : *VPBB) {
+      if (auto *MR = dyn_cast<VPWidenMemoryRecipe>(&R)) {
+        // The address computation cost can be query as scalar type if the
+        // address is uniform.
+        if (!MR->isConsecutive() && vputils::isSingleScalar(MR->getAddr()))
+          return true;
+      }
       if (auto *IR = dyn_cast<VPInterleaveRecipe>(&R)) {
         auto *IG = IR->getInterleaveGroup();
         unsigned NumMembers = IG->getNumMembers();
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 57b713d3dfcb9..e8a3951bbeb20 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -3083,9 +3083,18 @@ InstructionCost VPWidenMemoryRecipe::computeCost(ElementCount VF,
     const Value *Ptr = getLoadStorePointerOperand(&Ingredient);
     assert(!Reverse &&
            "Inconsecutive memory access should not have the order.");
-    return Ctx.TTI.getAddressComputationCost(Ty) +
-           Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked, Alignment,
-                                          Ctx.CostKind, &Ingredient);
+    InstructionCost Cost = 0;
+
+    // If the address value is uniform across all lane, then the address can be
+    // calculated with scalar type and broacast.
+    if (vputils::isSingleScalar(getAddr()))
+      Cost += Ctx.TTI.getAddressComputationCost(Ty->getScalarType());
+    else
+      Cost += Ctx.TTI.getAddressComputationCost(Ty);
+
+    return Cost + Ctx.TTI.getGatherScatterOpCost(Opcode, Ty, Ptr, IsMasked,
+                                                 Alignment, Ctx.CostKind,
+                                                 &Ingredient);
   }
 
   InstructionCost Cost = 0;

// If the address value is uniform across all lane, then the address can be
// calculated with scalar type and broacast.
if (vputils::isSingleScalar(getAddr()))
Cost += Ctx.TTI.getAddressComputationCost(Ty->getScalarType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine for RISC-V but is this accurate for e.g. x86? I think the address might still be broadcasted? cc @alexey-bataev

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is too optimistic

@ElvisWang123
Copy link
Contributor Author

Close since it is too optimistic. Will try another method to fix the regression caused by #149955.

@lukel97
Copy link
Contributor

lukel97 commented Aug 7, 2025

Hi, after thinking about this a bit more I think this PR is actually NFC, and is probably the easiest way forward.

  1. getAddressComputationCost is only called with a SCEV in one place, getMemInstScalarizationCost
  2. getMemInstScalarizationCost is only used to drive the widening decision + legacy cost model, we're not currently passing in the SCEV anywhere in VPlan
  3. The only targets that utilise the SCEV (ARM, AArch64 and X86) only distinguish between vector and non-vector types when the SCEV is passed.
  4. So this PR shouldn't have any effect.

I really can't think of a better way to avoid the regression in #149955. Is it possible to reopen this PR? Hopefully all of this will become simpler once the widening decisions are driven by VPlan/done as VPlan transformations themselves.

Pasting in the TTI hooks for reference:

InstructionCost ARMTTIImpl::getAddressComputationCost(Type *Ty,
                                                      ScalarEvolution *SE,
                                                      const SCEV *Ptr) const {
  // Address computations in vectorized code with non-consecutive addresses will
  // likely result in more instructions compared to scalar code where the
  // computation can more often be merged into the index mode. The resulting
  // extra micro-ops can significantly decrease throughput.
  unsigned NumVectorInstToHideOverhead = 10;
  int MaxMergeDistance = 64;

  if (ST->hasNEON()) {
    if (Ty->isVectorTy() && SE &&
        !BaseT::isConstantStridedAccessLessThan(SE, Ptr, MaxMergeDistance + 1))
      return NumVectorInstToHideOverhead;

    // In many cases the address computation is not merged into the instruction
    // addressing mode.
    return 1;
  }
  return BaseT::getAddressComputationCost(Ty, SE, Ptr);
}


InstructionCost
AArch64TTIImpl::getAddressComputationCost(Type *Ty, ScalarEvolution *SE,
                                          const SCEV *Ptr) const {
  // Address computations in vectorized code with non-consecutive addresses will
  // likely result in more instructions compared to scalar code where the
  // computation can more often be merged into the index mode. The resulting
  // extra micro-ops can significantly decrease throughput.
  unsigned NumVectorInstToHideOverhead = NeonNonConstStrideOverhead;
  int MaxMergeDistance = 64;

  if (Ty->isVectorTy() && SE &&
      !BaseT::isConstantStridedAccessLessThan(SE, Ptr, MaxMergeDistance + 1))
    return NumVectorInstToHideOverhead;

  // In many cases the address computation is not merged into the instruction
  // addressing mode.
  return 1;
}


InstructionCost X86TTIImpl::getAddressComputationCost(Type *Ty,
                                                      ScalarEvolution *SE,
                                                      const SCEV *Ptr) const {
  // Address computations in vectorized code with non-consecutive addresses will
  // likely result in more instructions compared to scalar code where the
  // computation can more often be merged into the index mode. The resulting
  // extra micro-ops can significantly decrease throughput.
  const unsigned NumVectorInstToHideOverhead = 10;

  // Cost modeling of Strided Access Computation is hidden by the indexing
  // modes of X86 regardless of the stride value. We dont believe that there
  // is a difference between constant strided access in gerenal and constant
  // strided value which is less than or equal to 64.
  // Even in the case of (loop invariant) stride whose value is not known at
  // compile time, the address computation will not incur more than one extra
  // ADD instruction.
  if (Ty->isVectorTy() && SE && !ST->hasAVX2()) {
    // TODO: AVX2 is the current cut-off because we don't have correct
    //       interleaving costs for prior ISA's.
    if (!BaseT::isStridedAccess(Ptr))
      return NumVectorInstToHideOverhead;
    if (!BaseT::getConstantStrideStep(SE, Ptr))
      return 1;
  }

  return BaseT::getAddressComputationCost(Ty, SE, Ptr);
}

…or gather/scatter.

This patch query `getAddressComputationCost()` with scalar type if the
address is uniform. This can help the cost for gather/scatter more
accurate.

In current LV, non consecutive VPWidenMemoryRecipe (gather/scatter) will
account the cost of address computation. But there are some cases that
the addr is uniform accross lanes, that makes the address can be
calculated with scalar type and broadcast.

I have a follow optimization that try to converts gather/scatter with
uniform memory acces to scalar load/store + broadcast. With this
optimization, we can remove this temporary change.
@ElvisWang123
Copy link
Contributor Author

Thanks @lukel97 will reopen this.

@ElvisWang123 ElvisWang123 reopened this Aug 7, 2025
// The address computation cost can be query as scalar type if the
// address is uniform.
if (!MR->isConsecutive() && vputils::isSingleScalar(MR->getAddr()))
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you seeing any cost model assertions with this PR? I guess the VPlan cost model never passed in the pointer SCEV to begin with so is it already out of sync on other targets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebase, don't need this anymore. Removed, thanks!

@ElvisWang123 ElvisWang123 force-pushed the lv-fix-addrComp-cost branch from d56bab9 to 8989f33 Compare August 8, 2025 06:05
Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Btw, I think this makes more sense if we change the Type argument to getAddressComputationCost to the type of the pointer, not the value. But we don't do this consistently today.

LoopVectorizationCostModel::getMemInstScalarizationCost passes in the pointer type:

  // Get the cost of the scalar memory instruction and address computation.
  InstructionCost Cost =
      VF.getFixedValue() * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV);

X86 calls it PtrTy:

InstructionCost getAddressComputationCost(Type *PtrTy, ScalarEvolution *SE,

But VectorCombine passes in the element type:

    ScalarizedCost += TTI.getAddressComputationCost(VecTy->getElementType());

As does VPWidenMemoryRecipe::computeCost.

I think we should fix this in a separate PR.

Comment on lines +3104 to +3105
// If the address value is uniform across all lane, then the address can be
// calculated with scalar type and broacast.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If the address value is uniform across all lane, then the address can be
// calculated with scalar type and broacast.
// If the address value is uniform across all lanes, then the address can be
// calculated with scalar type and broadcast.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to add a test for this?

@lukel97
Copy link
Contributor

lukel97 commented Aug 8, 2025

It should be possible to add a test for this?

I think this is NFC with the current TTIs, since they only use the type when the SCEV argument is passed. We don't pass it in VPWidenMemoryRecipe::computeCost.

It makes a difference with #149955 though because RISC-V will start checking the type without the SCEV. Should the changes be bundled up into that PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants