-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[NVPTX] Improve support for {ex2,lg2}.approx #120519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-backend-nvptx Author: Princeton Ferro (Prince781) ChangesLower llvm.exp2 to ex2.approx for f32 and all vectors of f32. Full diff: https://github.com/llvm/llvm-project/pull/120519.diff 3 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 5c1f717694a4c7..a922ce0ae104f1 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -968,7 +968,8 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
setOperationAction(ISD::CopyToReg, MVT::i128, Custom);
setOperationAction(ISD::CopyFromReg, MVT::i128, Custom);
- // No FEXP2, FLOG2. The PTX ex2 and log2 functions are always approximate.
+ setOperationAction(ISD::FEXP2, MVT::f32, Legal);
+ // No FLOG2. The PTX log2 function is always approximate.
// No FPOW or FREM in PTX.
// Now deduce the information based on the above mentioned
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index abaf8e0b0ec1f8..6677a29e0d07d0 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -518,6 +518,19 @@ multiclass F3_fma_component<string OpcStr, SDNode OpNode> {
Requires<[hasBF16Math, noFMA]>;
}
+// Template for operations which take one f32 operand. Provides two
+// instructions: <OpcStr>.f32, and <OpcStr>.ftz.f32 (flush subnormal inputs and
+// results to zero).
+multiclass F1<string OpcStr, SDNode OpNode> {
+ def f32_ftz : NVPTXInst<(outs Float32Regs:$dst), (ins Float32Regs:$a),
+ !strconcat(OpcStr, ".ftz.f32 \t$dst, $a;"),
+ [(set Float32Regs:$dst, (OpNode Float32Regs:$a))]>,
+ Requires<[doF32FTZ]>;
+ def f32 : NVPTXInst<(outs Float32Regs:$dst), (ins Float32Regs:$a),
+ !strconcat(OpcStr, ".f32 \t$dst, $a;"),
+ [(set Float32Regs:$dst, (OpNode Float32Regs:$a))]>;
+}
+
// Template for operations which take two f32 or f64 operands. Provides three
// instructions: <OpcStr>.f64, <OpcStr>.f32, and <OpcStr>.ftz.f32 (flush
// subnormal inputs and results to zero).
@@ -1204,6 +1217,8 @@ defm FNEG_H: F2_Support_Half<"neg", fneg>;
defm FSQRT : F2<"sqrt.rn", fsqrt>;
+defm FEXP2 : F1<"ex2.approx", fexp2>;
+
//
// F16 NEG
//
diff --git a/llvm/test/CodeGen/NVPTX/fexp2.ll b/llvm/test/CodeGen/NVPTX/fexp2.ll
new file mode 100644
index 00000000000000..247629865cdd74
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fexp2.ll
@@ -0,0 +1,47 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_52 -mattr=+ptx86 | FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-12.6 %{ llc < %s -march=nvptx64 -mcpu=sm_52 -mattr=+ptx86 | %ptxas-verify -arch=sm_52 %}
+source_filename = "fexp2.ll"
+target datalayout = "e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-f128:128:128-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64-a:8:8"
+target triple = "nvptx64-nvidia-cuda"
+
+; CHECK-LABEL: exp2_test
+define ptx_kernel void @exp2_test(ptr %a, ptr %res) local_unnamed_addr {
+entry:
+ %in = load float, ptr %a, align 4
+ ; CHECK: ex2.approx.f32 [[D1:%f[0-9]+]], [[S1:%f[0-9]+]]
+ %exp2 = call float @llvm.exp2.f32(float %in)
+ ; CHECK: st.global.f32 {{.*}}, [[D1]]
+ store float %exp2, ptr %res, align 4
+ ret void
+}
+
+; CHECK-LABEL: exp2_ftz_test
+define ptx_kernel void @exp2_ftz_test(ptr %a, ptr %res) local_unnamed_addr #0 {
+entry:
+ %in = load float, ptr %a, align 4
+ ; CHECK: ex2.approx.ftz.f32 [[D1:%f[0-9]+]], [[S1:%f[0-9]+]]
+ %exp2 = call float @llvm.exp2.f32(float %in)
+ ; CHECK: st.global.f32 {{.*}}, [[D1]]
+ store float %exp2, ptr %res, align 4
+ ret void
+}
+
+; CHECK-LABEL: exp2_test_v
+define ptx_kernel void @exp2_test_v(ptr %a, ptr %res) local_unnamed_addr {
+entry:
+ %in = load <4 x float>, ptr %a, align 16
+ ; CHECK: ex2.approx.f32 [[D1:%f[0-9]+]], [[S1:%f[0-9]+]]
+ ; CHECK: ex2.approx.f32 [[D2:%f[0-9]+]], [[S2:%f[0-9]+]]
+ ; CHECK: ex2.approx.f32 [[D3:%f[0-9]+]], [[S3:%f[0-9]+]]
+ ; CHECK: ex2.approx.f32 [[D4:%f[0-9]+]], [[S4:%f[0-9]+]]
+ %exp2 = call <4 x float> @llvm.exp2.v4f32(<4 x float> %in)
+ ; CHECK: st.global.v4.f32 {{.*}}, {{[{]}}[[D4]], [[D3]], [[D2]], [[D1]]{{[}]}}
+ store <4 x float> %exp2, ptr %res, align 16
+ ret void
+}
+
+declare float @llvm.exp2.f32(float %val)
+
+declare <4 x float> @llvm.exp2.v4f32(<4 x float> %val)
+
+attributes #0 = {"denormal-fp-math"="preserve-sign"}
|
AlexMaclean
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, barring some minor stylistic issues to cleanup this looks good to me. Any chance you could add the (b)f16 variants as well?
5aeb9f8 to
d61ba61
Compare
d61ba61 to
99ddd72
Compare
AlexMaclean
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM
Artem-B
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that lowering fexp2 to ex2.approx is a good idea.
At the very least it should've been conditional to some sort of fast math flag allowing reduced precision.
I think it's not a bad idea since there is no non-approximate implementation in PTX, which is something users of NVPTX should know. Making the lowering only work for fast-math would break unoptimized code. Having to use inline PTX to access exp2() is too cumbersome, especially when using vectors. |
99ddd72 to
e1a68bf
Compare
3761255 to
bcc74fa
Compare
|
We have explicit flags to enable approximate reciprocal and sqrt and these instructions should follow a similar pattern.
I agree that enabling them automatically for fast-math may be confusing (though it may be worth checking if we have similar situations on other platforms that could give us some guidelines on how to handle this) Letting the user enable these instructions explicitly should work. Letting compiler generate low-precision results will likely break things at runtime (there's a lot of existing code assuming that host/device compilations will produce nearly identical result). I'd prefer things to fail early, in a painfully obvious way if compiler can't do something correctly. |
bcc74fa to
322982b
Compare
|
Okay, this feature is now behind the flags |
322982b to
1e5be93
Compare
0fc5638 to
5619291
Compare
|
Updates:
|
|
Updated with more improvements. Please see commit message / first comment for more details! |
faaca98 to
72f468e
Compare
53aae32 to
b42a67d
Compare
|
@AlexMaclean Tried with the It would be nice if SelectionDAG supported something like "preserve Anyway, I think these changes can be merged now. |
b42a67d to
ba0caf9
Compare
ba0caf9 to
f711117
Compare
f711117 to
6e95b75
Compare
- Add support for `@llvm.exp2()`:
- LLVM: `float` -> PTX: `ex2.approx{.ftz}.f32`
- LLVM: `half` -> PTX: `ex2.approx.f16`
- LLVM: `<2 x half>` -> PTX: `ex2.approx.f16x2`
- LLVM: `bfloat` -> PTX: `ex2.approx.ftz.bf16`
- LLVM: `<2 x bfloat>` -> PTX: `ex2.approx.ftz.bf16x2`
- Any operations with non-native vector widths are expanded. On
targets not supporting f16/bf16, values are promoted to f32.
- Add *CONDITIONAL* support for `@llvm.log2()` [^1]:
- LLVM: `float` -> PTX: `lg2.approx{.ftz}.f32`
- Support for f16/bf16 is emulated by promoting values to f32.
[1]: CUDA implements `exp2()` with `ex2.approx` but `log2()` is
implemented differently, so this is off by default. To enable, use the
flag `-nvptx-approx-log2f32`.
6e95b75 to
71d90aa
Compare
|
Ping |
AlexMaclean
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Pinging one of the code owners to merge this. |
|
Thanks @Artem-B! |
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/154/builds/10390 Here is the relevant piece of the build log for the reference |
|
Also, I thought I'd ask here: do you know how I can gain write access? I emailed Chris Lattner but he didn't respond. |
https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access though I believe they're in the process of updating that, something like requiring 5 commits and two existing contributors to +1. |
|
@Prince781 It appears that the tests are generating 32-bit PTX and it's no longer supported by recent CUDA versions. You can reproduce it by running the tests with |
The triple is just missing |
|
@jhuber6 Thank you! |
Add support for
@llvm.exp2():float-> PTX:ex2.approx{.ftz}.f32half-> PTX:ex2.approx.f16<2 x half>-> PTX:ex2.approx.f16x2bfloat-> PTX:ex2.approx.ftz.bf16<2 x bfloat>-> PTX:ex2.approx.ftz.bf16x2targets not supporting f16/bf16, values are promoted to f32.
Add CONDITIONAL support for
@llvm.log2()[^1]:float-> PTX:lg2.approx{.ftz}.f32[1]: CUDA implements
exp2()withex2.approxbutlog2()isimplemented differently, so this is off by default. To enable, use the
flag
-nvptx-approx-log2f32.