Skip to content

Conversation

@ppenzin
Copy link
Contributor

@ppenzin ppenzin commented Jan 28, 2025

Add tune knob for N*Log2(N) vrgather.vv cost.

@llvmbot
Copy link
Member

llvmbot commented Jan 28, 2025

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-risc-v

Author: Petr Penzin (ppenzin)

Changes

WIP, for review

Add tune knob for N*Log2(N) vrgather.vv cost.


Full diff: https://github.com/llvm/llvm-project/pull/124664.diff

4 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVFeatures.td (+4)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVProcessors.td (+1)
  • (modified) llvm/test/CodeGen/RISCV/features-info.ll (+1)
diff --git a/llvm/lib/Target/RISCV/RISCVFeatures.td b/llvm/lib/Target/RISCV/RISCVFeatures.td
index 4119dd77804f1a..966c56185b3fd9 100644
--- a/llvm/lib/Target/RISCV/RISCVFeatures.td
+++ b/llvm/lib/Target/RISCV/RISCVFeatures.td
@@ -1365,6 +1365,10 @@ def FeatureUnalignedVectorMem
                       "true", "Has reasonably performant unaligned vector "
                       "loads and stores">;
 
+def TuneFastVRGather
+   : SubtargetFeature<"fast-vrgather", "HasFastVRGather",
+                      "true", "Has vrgather.vv with LMUL*log2(LMUL) latency">;
+
 def TunePostRAScheduler : SubtargetFeature<"use-postra-scheduler",
     "UsePostRAScheduler", "true", "Schedule again after register allocation">;
 
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 5e5bc0819a10cc..e49c1e7ce9edbc 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -2848,6 +2848,11 @@ InstructionCost RISCVTargetLowering::getLMULCost(MVT VT) const {
 /// is generally quadratic in the number of vreg implied by LMUL.  Note that
 /// operand (index and possibly mask) are handled separately.
 InstructionCost RISCVTargetLowering::getVRGatherVVCost(MVT VT) const {
+  auto LMULCost = getLMULCost(VT);
+  if (true && Subtarget.hasFastVRGather() && LMULCost.isValid()) {
+    unsigned Log = Log2_64(*LMULCost.getValue());
+    return LMULCost * Log;
+  }
   return getLMULCost(VT) * getLMULCost(VT);
 }
 
diff --git a/llvm/lib/Target/RISCV/RISCVProcessors.td b/llvm/lib/Target/RISCV/RISCVProcessors.td
index 6dfed7ddeb9f63..4f6c9a0229d51b 100644
--- a/llvm/lib/Target/RISCV/RISCVProcessors.td
+++ b/llvm/lib/Target/RISCV/RISCVProcessors.td
@@ -490,6 +490,7 @@ def TENSTORRENT_ASCALON_D8 : RISCVProcessorModel<"tt-ascalon-d8",
                                                   FeatureUnalignedScalarMem,
                                                   FeatureUnalignedVectorMem]),
                                                  [TuneNoDefaultUnroll,
+                                                  TuneFastVRGather,
                                                   TuneOptimizedZeroStrideLoad,
                                                   TunePostRAScheduler]>;
 
diff --git a/llvm/test/CodeGen/RISCV/features-info.ll b/llvm/test/CodeGen/RISCV/features-info.ll
index 70fbda47a14a14..dab9bf92cef17d 100644
--- a/llvm/test/CodeGen/RISCV/features-info.ll
+++ b/llvm/test/CodeGen/RISCV/features-info.ll
@@ -31,6 +31,7 @@
 ; CHECK:   experimental-zvbc32e             - 'Zvbc32e' (Vector Carryless Multiplication with 32-bits elements).
 ; CHECK:   experimental-zvkgs               - 'Zvkgs' (Vector-Scalar GCM instructions for Cryptography).
 ; CHECK:   f                                - 'F' (Single-Precision Floating-Point).
+; CHECK:   fast-vrgather                    - Has vrgather.vv with LMUL*log2(LMUL) latency
 ; CHECK:   forced-atomics                   - Assume that lock-free native-width atomics are available.
 ; CHECK:   h                                - 'H' (Hypervisor).
 ; CHECK:   i                                - 'I' (Base Integer Instruction Set).

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, could you add a new RUN line in test/Analysis/CostModel/RISCV/rvv/shuffle-permute that has the fast flag enabled?

@llvmbot llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Jan 28, 2025
@ppenzin
Copy link
Contributor Author

ppenzin commented Jan 28, 2025

@lukel97 does this work and should I try to remove undef from the test?

@github-actions
Copy link

github-actions bot commented Jan 28, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' cb4f24b0e5c4e7c463e59120af4f13ab81519047 8a48a6c6edf2c317d1a18e1a7db72a6cae84d154 llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVSubtarget.h llvm/test/Analysis/CostModel/RISCV/shuffle-permute.ll llvm/test/CodeGen/RISCV/features-info.ll

The following files introduce new uses of undef:

  • llvm/test/Analysis/CostModel/RISCV/shuffle-permute.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

@ppenzin
Copy link
Contributor Author

ppenzin commented Jan 29, 2025

I took a brief look, it doesn't look like it would be very difficult to convert undef into arguments, I can give that a go in a separate PR.

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can ignore the undef warning for now, pretty much every test in that directory is guilty of using undef!

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the LMULCost local here? No reason to call it 3 times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, addressed

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ppenzin ppenzin changed the title [WIP][RISCV] Tune flag for fast vrgather.vv [RISCV] Tune flag for fast vrgather.vv Jan 30, 2025
@ppenzin
Copy link
Contributor Author

ppenzin commented Jan 30, 2025

Since there were a few approvals, I've removed WIP. @topperc what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test larger LMUL? This is only testing up to LMUL=2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are testing 16 x i64 and 16 x double, which gets LMULCost of 8; with quadratic estimate it leads to total cost of 139 and with the fast flag it is 59.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 256-bit i64 vector cheaper than 256-bit i32 and i16 vectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because IndexCost is cheaper by one, same before this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking, but what if the vrgather.vv is not with LMUL*log2(LMUL) complexity?

Copy link
Contributor Author

@ppenzin ppenzin Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK, that is the only other alternative I know of personally. We can rename flag to be something like 'HasLog2Vrgather' instead of just "fast". Or change the description to say something like "non-quadratic". I don't have a strong preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding an enum to represent the cost model?

enum VRGatherCostModel {
    Quadratic, // Default
    Log2,
    ....
}
def TuneLog2VRGather
   : SubtargetFeature<"log2-vrgather", "VRGatherCostModel",
                      "Log2", "Has vrgather.vv with LMUL*log2(LMUL) latency">;

This is more extensible I think. I don't know the details but I remember that the Ventana's V2 has an optimized vrgather.vv implementation as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can do that. @topperc and @lukel97 any thoughts from maintainability point of view? Also, anyone from Ventana would like to share their input? (@mgudim ?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many different types of cost model would we have? I would have thought LMUL * log2(LMUL) would be as good as a uarch can get. I'm not strongly opinionated on this though, an enum sounds reasonable to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference here, and would tend to bias towards support what we know of now, then change later if multiple options show up.

Copy link
Collaborator

@topperc topperc Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early x280 cost is linear in VL. Later x280 is constant time for VL<= (VLEN/2)/SEW (single DLEN), but linear in VL for everything else. Later x280 is quadratic in LMUL*2 except for fractional LMUL or when element at a time would be faster (i.e. there are less elements in a destination DLEN than the number of possible source DLENs for them).

p470/p670 is quadratic in LMUL in the worst case. But there is effort made to be linear in number of source VLEN actually used with some overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum it is then, I suppose 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be addressed, PTAL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs to be updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Collaborator

@topperc topperc Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early x280 cost is linear in VL. Later x280 is constant time for VL<= (VLEN/2)/SEW (single DLEN), but linear in VL for everything else. Later x280 is quadratic in LMUL*2 except for fractional LMUL or when element at a time would be faster (i.e. there are less elements in a destination DLEN than the number of possible source DLENs for them).

p470/p670 is quadratic in LMUL in the worst case. But there is effort made to be linear in number of source VLEN actually used with some overhead.

@github-actions
Copy link

github-actions bot commented Feb 21, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff cb4f24b0e5c4e7c463e59120af4f13ab81519047 8a48a6c6edf2c317d1a18e1a7db72a6cae84d154 --extensions cpp,h -- llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVSubtarget.h
View the diff from clang-format here.
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 6d3241e79a..21a71f926a 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -2867,7 +2867,6 @@ InstructionCost RISCVTargetLowering::getLMULCost(MVT VT) const {
   return Cost;
 }
 
-
 /// Return the cost of a vrgather.vv instruction for the type VT.  vrgather.vv
 /// may be quadratic in the number of vreg implied by LMUL, and is assumed to
 /// be by default.  VRGatherCostModel reflects available options.  Note that
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index cc9aef2d52..d03afe1c23 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -160,7 +160,9 @@ public:
   /// initializeProperties().
   RISCVProcFamilyEnum getProcFamily() const { return RISCVProcFamily; }
 
-  RISCVVRGatherCostModelEnum getVRGatherCostModel() const { return RISCVVRGatherCostModel; }
+  RISCVVRGatherCostModelEnum getVRGatherCostModel() const {
+    return RISCVVRGatherCostModel;
+  }
 
 #define GET_SUBTARGETINFO_MACRO(ATTRIBUTE, DEFAULT, GETTER) \
   bool GETTER() const { return ATTRIBUTE; }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VRGarther -> VRGather

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something bad happened to this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, good catch. Should be addressed.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ppenzin
Copy link
Contributor Author

ppenzin commented Mar 3, 2025

Can anyone help me merge this? I can rebase if needed.

@mshockwave
Copy link
Member

Can anyone help me merge this? I can rebase if needed.

Could you fix the test failure on test/CodeGen/RISCV/features-info.ll? I think that is relevant

@ppenzin
Copy link
Contributor Author

ppenzin commented Mar 3, 2025

Should be addressed now, but I did add it a line to release notes which requires a rebase on top of current main.

@mshockwave
Copy link
Member

Should be addressed now, but I did add it a line to release notes which requires a rebase on top of current main.

LGTM, please rebase then I'll help you to merge

@mshockwave mshockwave merged commit b44fbde into llvm:main Mar 4, 2025
6 of 10 checks passed
@mshockwave
Copy link
Member

Can anyone help me merge this? I can rebase if needed.

I think there are several patches under your belt now. Feel free to request for commit access.

@ppenzin ppenzin deleted the fast-vrgather branch March 4, 2025 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:RISC-V llvm:analysis Includes value tracking, cost tables and constant folding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants