-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[RISCV] Tune flag for fast vrgather.vv #124664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-backend-risc-v Author: Petr Penzin (ppenzin) ChangesWIP, for review Add tune knob for N*Log2(N) vrgather.vv cost. Full diff: https://github.com/llvm/llvm-project/pull/124664.diff 4 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVFeatures.td b/llvm/lib/Target/RISCV/RISCVFeatures.td
index 4119dd77804f1a..966c56185b3fd9 100644
--- a/llvm/lib/Target/RISCV/RISCVFeatures.td
+++ b/llvm/lib/Target/RISCV/RISCVFeatures.td
@@ -1365,6 +1365,10 @@ def FeatureUnalignedVectorMem
"true", "Has reasonably performant unaligned vector "
"loads and stores">;
+def TuneFastVRGather
+ : SubtargetFeature<"fast-vrgather", "HasFastVRGather",
+ "true", "Has vrgather.vv with LMUL*log2(LMUL) latency">;
+
def TunePostRAScheduler : SubtargetFeature<"use-postra-scheduler",
"UsePostRAScheduler", "true", "Schedule again after register allocation">;
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 5e5bc0819a10cc..e49c1e7ce9edbc 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -2848,6 +2848,11 @@ InstructionCost RISCVTargetLowering::getLMULCost(MVT VT) const {
/// is generally quadratic in the number of vreg implied by LMUL. Note that
/// operand (index and possibly mask) are handled separately.
InstructionCost RISCVTargetLowering::getVRGatherVVCost(MVT VT) const {
+ auto LMULCost = getLMULCost(VT);
+ if (true && Subtarget.hasFastVRGather() && LMULCost.isValid()) {
+ unsigned Log = Log2_64(*LMULCost.getValue());
+ return LMULCost * Log;
+ }
return getLMULCost(VT) * getLMULCost(VT);
}
diff --git a/llvm/lib/Target/RISCV/RISCVProcessors.td b/llvm/lib/Target/RISCV/RISCVProcessors.td
index 6dfed7ddeb9f63..4f6c9a0229d51b 100644
--- a/llvm/lib/Target/RISCV/RISCVProcessors.td
+++ b/llvm/lib/Target/RISCV/RISCVProcessors.td
@@ -490,6 +490,7 @@ def TENSTORRENT_ASCALON_D8 : RISCVProcessorModel<"tt-ascalon-d8",
FeatureUnalignedScalarMem,
FeatureUnalignedVectorMem]),
[TuneNoDefaultUnroll,
+ TuneFastVRGather,
TuneOptimizedZeroStrideLoad,
TunePostRAScheduler]>;
diff --git a/llvm/test/CodeGen/RISCV/features-info.ll b/llvm/test/CodeGen/RISCV/features-info.ll
index 70fbda47a14a14..dab9bf92cef17d 100644
--- a/llvm/test/CodeGen/RISCV/features-info.ll
+++ b/llvm/test/CodeGen/RISCV/features-info.ll
@@ -31,6 +31,7 @@
; CHECK: experimental-zvbc32e - 'Zvbc32e' (Vector Carryless Multiplication with 32-bits elements).
; CHECK: experimental-zvkgs - 'Zvkgs' (Vector-Scalar GCM instructions for Cryptography).
; CHECK: f - 'F' (Single-Precision Floating-Point).
+; CHECK: fast-vrgather - Has vrgather.vv with LMUL*log2(LMUL) latency
; CHECK: forced-atomics - Assume that lock-free native-width atomics are available.
; CHECK: h - 'H' (Hypervisor).
; CHECK: i - 'I' (Base Integer Instruction Set).
|
lukel97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, could you add a new RUN line in test/Analysis/CostModel/RISCV/rvv/shuffle-permute that has the fast flag enabled?
|
@lukel97 does this work and should I try to remove undef from the test? |
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' cb4f24b0e5c4e7c463e59120af4f13ab81519047 8a48a6c6edf2c317d1a18e1a7db72a6cae84d154 llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVSubtarget.h llvm/test/Analysis/CostModel/RISCV/shuffle-permute.ll llvm/test/CodeGen/RISCV/features-info.llThe following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
}Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
}Please refer to the Undefined Behavior Manual for more information. |
|
I took a brief look, it doesn't look like it would be very difficult to convert |
lukel97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can ignore the undef warning for now, pretty much every test in that directory is guilty of using undef!
lukel97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the LMULCost local here? No reason to call it 3 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, addressed
preames
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Since there were a few approvals, I've removed WIP. @topperc what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test larger LMUL? This is only testing up to LMUL=2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are testing 16 x i64 and 16 x double, which gets LMULCost of 8; with quadratic estimate it leads to total cost of 139 and with the fast flag it is 59.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 256-bit i64 vector cheaper than 256-bit i32 and i16 vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because IndexCost is cheaper by one, same before this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking, but what if the vrgather.vv is not with LMUL*log2(LMUL) complexity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK, that is the only other alternative I know of personally. We can rename flag to be something like 'HasLog2Vrgather' instead of just "fast". Or change the description to say something like "non-quadratic". I don't have a strong preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding an enum to represent the cost model?
enum VRGatherCostModel {
Quadratic, // Default
Log2,
....
}def TuneLog2VRGather
: SubtargetFeature<"log2-vrgather", "VRGatherCostModel",
"Log2", "Has vrgather.vv with LMUL*log2(LMUL) latency">;
This is more extensible I think. I don't know the details but I remember that the Ventana's V2 has an optimized vrgather.vv implementation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many different types of cost model would we have? I would have thought LMUL * log2(LMUL) would be as good as a uarch can get. I'm not strongly opinionated on this though, an enum sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference here, and would tend to bias towards support what we know of now, then change later if multiple options show up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early x280 cost is linear in VL. Later x280 is constant time for VL<= (VLEN/2)/SEW (single DLEN), but linear in VL for everything else. Later x280 is quadratic in LMUL*2 except for fractional LMUL or when element at a time would be faster (i.e. there are less elements in a destination DLEN than the number of possible source DLENs for them).
p470/p670 is quadratic in LMUL in the worst case. But there is effort made to be linear in number of source VLEN actually used with some overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum it is then, I suppose 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be addressed, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early x280 cost is linear in VL. Later x280 is constant time for VL<= (VLEN/2)/SEW (single DLEN), but linear in VL for everything else. Later x280 is quadratic in LMUL*2 except for fractional LMUL or when element at a time would be faster (i.e. there are less elements in a destination DLEN than the number of possible source DLENs for them).
p470/p670 is quadratic in LMUL in the worst case. But there is effort made to be linear in number of source VLEN actually used with some overhead.
You can test this locally with the following command:git-clang-format --diff cb4f24b0e5c4e7c463e59120af4f13ab81519047 8a48a6c6edf2c317d1a18e1a7db72a6cae84d154 --extensions cpp,h -- llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVSubtarget.hView the diff from clang-format here.diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 6d3241e79a..21a71f926a 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -2867,7 +2867,6 @@ InstructionCost RISCVTargetLowering::getLMULCost(MVT VT) const {
return Cost;
}
-
/// Return the cost of a vrgather.vv instruction for the type VT. vrgather.vv
/// may be quadratic in the number of vreg implied by LMUL, and is assumed to
/// be by default. VRGatherCostModel reflects available options. Note that
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index cc9aef2d52..d03afe1c23 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -160,7 +160,9 @@ public:
/// initializeProperties().
RISCVProcFamilyEnum getProcFamily() const { return RISCVProcFamily; }
- RISCVVRGatherCostModelEnum getVRGatherCostModel() const { return RISCVVRGatherCostModel; }
+ RISCVVRGatherCostModelEnum getVRGatherCostModel() const {
+ return RISCVVRGatherCostModel;
+ }
#define GET_SUBTARGETINFO_MACRO(ATTRIBUTE, DEFAULT, GETTER) \
bool GETTER() const { return ATTRIBUTE; }
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VRGarther -> VRGather
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something bad happened to this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, good catch. Should be addressed.
wangpc-pp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
topperc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Can anyone help me merge this? I can rebase if needed. |
Could you fix the test failure on test/CodeGen/RISCV/features-info.ll? I think that is relevant |
|
Should be addressed now, but I did add it a line to release notes which requires a rebase on top of current main. |
LGTM, please rebase then I'll help you to merge |
Add tune knob for N*Log2(N) vrgather.vv cost.
Rename option to explicitly say "log", change tune flag to set one of the values.
I think there are several patches under your belt now. Feel free to request for commit access. |
Add tune knob for N*Log2(N) vrgather.vv cost.