Skip to content

Conversation

@Prince781
Copy link
Contributor

@Prince781 Prince781 commented Dec 19, 2024

  • Add support for @llvm.exp2():

    • LLVM: float -> PTX: ex2.approx{.ftz}.f32
    • LLVM: half -> PTX: ex2.approx.f16
    • LLVM: <2 x half> -> PTX: ex2.approx.f16x2
    • LLVM: bfloat -> PTX: ex2.approx.ftz.bf16
    • LLVM: <2 x bfloat> -> PTX: ex2.approx.ftz.bf16x2
    • Any operations with non-native vector widths are expanded. On
      targets not supporting f16/bf16, values are promoted to f32.
  • Add CONDITIONAL support for @llvm.log2() [^1]:

    • LLVM: float -> PTX: lg2.approx{.ftz}.f32
    • Support for f16/bf16 is emulated by promoting values to f32.

[1]: CUDA implements exp2() with ex2.approx but log2() is
implemented differently, so this is off by default. To enable, use the
flag -nvptx-approx-log2f32.

@llvmbot
Copy link
Member

llvmbot commented Dec 19, 2024

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Princeton Ferro (Prince781)

Changes

Lower llvm.exp2 to ex2.approx for f32 and all vectors of f32.


Full diff: https://github.com/llvm/llvm-project/pull/120519.diff

3 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+2-1)
  • (modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+15)
  • (added) llvm/test/CodeGen/NVPTX/fexp2.ll (+47)
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 5c1f717694a4c7..a922ce0ae104f1 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -968,7 +968,8 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   setOperationAction(ISD::CopyToReg, MVT::i128, Custom);
   setOperationAction(ISD::CopyFromReg, MVT::i128, Custom);
 
-  // No FEXP2, FLOG2.  The PTX ex2 and log2 functions are always approximate.
+  setOperationAction(ISD::FEXP2, MVT::f32, Legal);
+  // No FLOG2. The PTX log2 function is always approximate.
   // No FPOW or FREM in PTX.
 
   // Now deduce the information based on the above mentioned
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index abaf8e0b0ec1f8..6677a29e0d07d0 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -518,6 +518,19 @@ multiclass F3_fma_component<string OpcStr, SDNode OpNode> {
                Requires<[hasBF16Math, noFMA]>;
 }
 
+// Template for operations which take one f32 operand.  Provides two
+// instructions: <OpcStr>.f32, and <OpcStr>.ftz.f32 (flush subnormal inputs and
+// results to zero).
+multiclass F1<string OpcStr, SDNode OpNode> {
+   def f32_ftz : NVPTXInst<(outs Float32Regs:$dst), (ins Float32Regs:$a),
+                           !strconcat(OpcStr, ".ftz.f32 \t$dst, $a;"),
+                           [(set Float32Regs:$dst, (OpNode Float32Regs:$a))]>,
+                           Requires<[doF32FTZ]>;
+   def f32 :     NVPTXInst<(outs Float32Regs:$dst), (ins Float32Regs:$a),
+                           !strconcat(OpcStr, ".f32 \t$dst, $a;"),
+                           [(set Float32Regs:$dst, (OpNode Float32Regs:$a))]>;
+}
+
 // Template for operations which take two f32 or f64 operands.  Provides three
 // instructions: <OpcStr>.f64, <OpcStr>.f32, and <OpcStr>.ftz.f32 (flush
 // subnormal inputs and results to zero).
@@ -1204,6 +1217,8 @@ defm FNEG_H: F2_Support_Half<"neg", fneg>;
 
 defm FSQRT : F2<"sqrt.rn", fsqrt>;
 
+defm FEXP2 : F1<"ex2.approx", fexp2>;
+
 //
 // F16 NEG
 //
diff --git a/llvm/test/CodeGen/NVPTX/fexp2.ll b/llvm/test/CodeGen/NVPTX/fexp2.ll
new file mode 100644
index 00000000000000..247629865cdd74
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fexp2.ll
@@ -0,0 +1,47 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_52 -mattr=+ptx86 | FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-12.6 %{ llc < %s -march=nvptx64 -mcpu=sm_52 -mattr=+ptx86 | %ptxas-verify -arch=sm_52 %}
+source_filename = "fexp2.ll"
+target datalayout = "e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-f128:128:128-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64-a:8:8"
+target triple = "nvptx64-nvidia-cuda"
+
+; CHECK-LABEL: exp2_test
+define ptx_kernel void @exp2_test(ptr %a, ptr %res) local_unnamed_addr {
+entry:
+  %in = load float, ptr %a, align 4
+  ; CHECK: ex2.approx.f32 [[D1:%f[0-9]+]], [[S1:%f[0-9]+]]
+  %exp2 = call float @llvm.exp2.f32(float %in)
+  ; CHECK: st.global.f32 {{.*}}, [[D1]]
+  store float %exp2, ptr %res, align 4
+  ret void
+}
+
+; CHECK-LABEL: exp2_ftz_test
+define ptx_kernel void @exp2_ftz_test(ptr %a, ptr %res) local_unnamed_addr #0 {
+entry:
+  %in = load float, ptr %a, align 4
+  ; CHECK: ex2.approx.ftz.f32 [[D1:%f[0-9]+]], [[S1:%f[0-9]+]]
+  %exp2 = call float @llvm.exp2.f32(float %in)
+  ; CHECK: st.global.f32 {{.*}}, [[D1]]
+  store float %exp2, ptr %res, align 4
+  ret void
+}
+
+; CHECK-LABEL: exp2_test_v
+define ptx_kernel void @exp2_test_v(ptr %a, ptr %res) local_unnamed_addr {
+entry:
+  %in = load <4 x float>, ptr %a, align 16
+  ; CHECK: ex2.approx.f32 [[D1:%f[0-9]+]], [[S1:%f[0-9]+]]
+  ; CHECK: ex2.approx.f32 [[D2:%f[0-9]+]], [[S2:%f[0-9]+]]
+  ; CHECK: ex2.approx.f32 [[D3:%f[0-9]+]], [[S3:%f[0-9]+]]
+  ; CHECK: ex2.approx.f32 [[D4:%f[0-9]+]], [[S4:%f[0-9]+]]
+  %exp2 = call <4 x float> @llvm.exp2.v4f32(<4 x float> %in)
+  ; CHECK: st.global.v4.f32 {{.*}}, {{[{]}}[[D4]], [[D3]], [[D2]], [[D1]]{{[}]}}
+  store <4 x float> %exp2, ptr %res, align 16
+  ret void
+}
+
+declare float @llvm.exp2.f32(float %val)
+
+declare <4 x float> @llvm.exp2.v4f32(<4 x float> %val)
+
+attributes #0 = {"denormal-fp-math"="preserve-sign"}

@AlexMaclean AlexMaclean requested a review from Artem-B December 19, 2024 04:48
Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, barring some minor stylistic issues to cleanup this looks good to me. Any chance you could add the (b)f16 variants as well?

@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch 4 times, most recently from 5aeb9f8 to d61ba61 Compare December 19, 2024 14:45
@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from d61ba61 to 99ddd72 Compare December 19, 2024 14:58
Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that lowering fexp2 to ex2.approx is a good idea.

At the very least it should've been conditional to some sort of fast math flag allowing reduced precision.

@Prince781
Copy link
Contributor Author

Prince781 commented Dec 19, 2024

I'm not sure that lowering fexp2 to ex2.approx is a good idea.

At the very least it should've been conditional to some sort of fast math flag allowing reduced precision.

I think it's not a bad idea since there is no non-approximate implementation in PTX, which is something users of NVPTX should know. Making the lowering only work for fast-math would break unoptimized code.

Having to use inline PTX to access exp2() is too cumbersome, especially when using vectors.

@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from 99ddd72 to e1a68bf Compare December 19, 2024 19:14
@Prince781 Prince781 changed the title [NVPTX] Support llvm.exp2 for f32 and vector of f32 [NVPTX] Support llvm.{exp2,log2} for f32 and vector of f32 Dec 19, 2024
@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch 2 times, most recently from 3761255 to bcc74fa Compare December 19, 2024 19:17
@Artem-B
Copy link
Member

Artem-B commented Dec 19, 2024

We have explicit flags to enable approximate reciprocal and sqrt and these instructions should follow a similar pattern.

"nvptx-prec-divf32", cl::Hidden,

I agree that enabling them automatically for fast-math may be confusing (though it may be worth checking if we have similar situations on other platforms that could give us some guidelines on how to handle this)

Letting the user enable these instructions explicitly should work.

Letting compiler generate low-precision results will likely break things at runtime (there's a lot of existing code assuming that host/device compilations will produce nearly identical result). I'd prefer things to fail early, in a painfully obvious way if compiler can't do something correctly.

@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from bcc74fa to 322982b Compare December 19, 2024 19:41
@Prince781
Copy link
Contributor Author

Okay, this feature is now behind the flags -nvptx-approx-exp2f32 and -nvptx-approx-log2f32, which are off by default.

@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from 322982b to 1e5be93 Compare December 19, 2024 20:35
@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch 2 times, most recently from 0fc5638 to 5619291 Compare December 19, 2024 23:14
@Prince781
Copy link
Contributor Author

Prince781 commented Dec 19, 2024

Updates:

  • Added f16 and bf16 variants which promote to f32
  • Support is off by default. User turns it on with either -nvptx-approx-{log2,exp2}f32 or -enable-unsafe-fp-math
  • Added expected failure tests when support is not requested.

@Prince781 Prince781 changed the title [NVPTX] Support exp2 and log2 for f32/f16/bf16 and vectors [NVPTX] Improve support for {ex2,lg2}.approx Dec 24, 2024
@Prince781
Copy link
Contributor Author

Updated with more improvements. Please see commit message / first comment for more details!

@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from faaca98 to 72f468e Compare December 24, 2024 10:28
@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch 3 times, most recently from 53aae32 to b42a67d Compare December 25, 2024 04:27
@Prince781
Copy link
Contributor Author

@AlexMaclean Tried with the afn flag on @llvm.log2(). This works only if you don't also have non-native operations that are expanded. e.g. f16 = flog2 afn t0 will be expanded to f16 = fptrunc (f32 flog2 (f32 fpextend t0)) where SelectionDAG drops afn on the new f32 flog2 node, causing a crash.

It would be nice if SelectionDAG supported something like "preserve afn".

Anyway, I think these changes can be merged now.

@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from b42a67d to ba0caf9 Compare January 3, 2025 03:24
@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from ba0caf9 to f711117 Compare January 3, 2025 19:10
@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from f711117 to 6e95b75 Compare January 3, 2025 21:17
- Add support for `@llvm.exp2()`:
  - LLVM: `float`        -> PTX: `ex2.approx{.ftz}.f32`
  - LLVM: `half`         -> PTX: `ex2.approx.f16`
  - LLVM: `<2 x half>`   -> PTX: `ex2.approx.f16x2`
  - LLVM: `bfloat`       -> PTX: `ex2.approx.ftz.bf16`
  - LLVM: `<2 x bfloat>` -> PTX: `ex2.approx.ftz.bf16x2`
  - Any operations with non-native vector widths are expanded. On
    targets not supporting f16/bf16, values are promoted to f32.

- Add *CONDITIONAL* support for `@llvm.log2()` [^1]:
  - LLVM: `float` -> PTX: `lg2.approx{.ftz}.f32`
  - Support for f16/bf16 is emulated by promoting values to f32.

[1]: CUDA implements `exp2()` with `ex2.approx` but `log2()` is
implemented differently, so this is off by default. To enable, use the
flag `-nvptx-approx-log2f32`.
@Prince781 Prince781 force-pushed the dev/pferro/nvptx-fexp2 branch from 6e95b75 to 71d90aa Compare January 3, 2025 21:27
@Prince781
Copy link
Contributor Author

Ping

Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Prince781
Copy link
Contributor Author

Pinging one of the code owners to merge this.

@Artem-B Artem-B merged commit 3ba339b into llvm:main Jan 16, 2025
8 checks passed
@Prince781
Copy link
Contributor Author

Thanks @Artem-B!

@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 16, 2025

LLVM Buildbot has detected a new failure on builder clang-armv8-quick running on linaro-clang-armv8-quick while building llvm at step 5 "ninja check 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/154/builds/10390

Here is the relevant piece of the build log for the reference
Step 5 (ninja check 1) failure: stage 1 checked (failure)
******************** TEST 'lit :: googletest-timeout.py' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 9
not env -u FILECHECK_OPTS "/usr/bin/python3.10" /home/tcwg-buildbot/worker/clang-armv8-quick/llvm/llvm/utils/lit/lit.py -j1 --order=lexical -v Inputs/googletest-timeout    --param gtest_filter=InfiniteLoopSubTest --timeout=1 > /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/Output/googletest-timeout.py.tmp.cmd.out
# executed command: not env -u FILECHECK_OPTS /usr/bin/python3.10 /home/tcwg-buildbot/worker/clang-armv8-quick/llvm/llvm/utils/lit/lit.py -j1 --order=lexical -v Inputs/googletest-timeout --param gtest_filter=InfiniteLoopSubTest --timeout=1
# .---command stderr------------
# | lit.py: /home/tcwg-buildbot/worker/clang-armv8-quick/llvm/llvm/utils/lit/lit/main.py:72: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 1 seconds was requested on the command line. Forcing timeout to be 1 seconds.
# `-----------------------------
# RUN: at line 11
FileCheck --check-prefix=CHECK-INF < /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/Output/googletest-timeout.py.tmp.cmd.out /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/googletest-timeout.py
# executed command: FileCheck --check-prefix=CHECK-INF /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/googletest-timeout.py
# .---command stderr------------
# | /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/googletest-timeout.py:34:14: error: CHECK-INF: expected string not found in input
# | # CHECK-INF: Timed Out: 1
# |              ^
# | <stdin>:13:29: note: scanning from here
# | Reached timeout of 1 seconds
# |                             ^
# | <stdin>:37:2: note: possible intended match here
# |  Timed Out: 2 (100.00%)
# |  ^
# | 
# | Input file: <stdin>
# | Check file: /home/tcwg-buildbot/worker/clang-armv8-quick/stage1/utils/lit/tests/googletest-timeout.py
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             .
# |             .
# |             .
# |             8:  
# |             9:  
# |            10: -- 
# |            11: exit: -9 
# |            12: -- 
# |            13: Reached timeout of 1 seconds 
# | check:34'0                                 X error: no match found
# |            14: ******************** 
# | check:34'0     ~~~~~~~~~~~~~~~~~~~~~
# |            15: TIMEOUT: googletest-timeout :: DummySubDir/OneTest.py/1/2 (2 of 2) 
# | check:34'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            16: ******************** TEST 'googletest-timeout :: DummySubDir/OneTest.py/1/2' FAILED ******************** 
# | check:34'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            17: Script(shard): 
# | check:34'0     ~~~~~~~~~~~~~~~
...

@Prince781
Copy link
Contributor Author

@Artem-B

Also, I thought I'd ask here: do you know how I can gain write access? I emailed Chris Lattner but he didn't respond.

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 16, 2025

@Artem-B

Also, I thought I'd ask here: do you know how I can gain write access? I emailed Chris Lattner but he didn't respond.

https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access though I believe they're in the process of updating that, something like requiring 5 commits and two existing contributors to +1.

@Prince781 Prince781 deleted the dev/pferro/nvptx-fexp2 branch January 17, 2025 09:21
@Artem-B
Copy link
Member

Artem-B commented Jan 21, 2025

@Prince781 It appears that the tests are generating 32-bit PTX and it's no longer supported by recent CUDA versions.

[  1] ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
[  2] ; RUN: llc < %s -mcpu=sm_20 -mattr=+ptx32 | FileCheck --check-prefixes=CHECK %s [OK]
llc < third_party/llvm/llvm-project/llvm/test/CodeGen/NVPTX/f32-lg2.ll -mcpu=sm_20 -mattr=+ptx32 | third_party/llvm/llvm-project/llvm/FileCheck --allow-unused-prefixes --check-prefixes=CHECK third_party/llvm/llvm-project/llvm/test/CodeGen/NVPTX/f32-lg2.ll
[  3] ; RUN: %if ptxas %{ llc < %s -mcpu=sm_20 -mattr=+ptx32 | %ptxas-verify %} [FAIL]
 llc < third_party/llvm/llvm-project/llvm/test/CodeGen/NVPTX/f32-lg2.ll -mcpu=sm_20 -mattr=+ptx32 | third_party/gpus/cuda/_virtual_includes/_stage_runtime/third_party/gpus/cuda/bin/ptxas -arch=sm_60 -c -o /dev/null - 
ptxas warning :  64 Bit host architecture (--machine) being used mismatches with .address_size of 32 bits
ptxas fatal   :  32-Bit compilation is no longer supported
Command failed: exit status 255

You can reproduce it by running the tests with LLVM_PTXAS_EXECUTABLE=/path/to/cuda-12.6.0/bin/ptxas

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 21, 2025

@Prince781 It appears that the tests are generating 32-bit PTX and it's no longer supported by recent CUDA versions.

[  1] ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
[  2] ; RUN: llc < %s -mcpu=sm_20 -mattr=+ptx32 | FileCheck --check-prefixes=CHECK %s [OK]
llc < third_party/llvm/llvm-project/llvm/test/CodeGen/NVPTX/f32-lg2.ll -mcpu=sm_20 -mattr=+ptx32 | third_party/llvm/llvm-project/llvm/FileCheck --allow-unused-prefixes --check-prefixes=CHECK third_party/llvm/llvm-project/llvm/test/CodeGen/NVPTX/f32-lg2.ll
[  3] ; RUN: %if ptxas %{ llc < %s -mcpu=sm_20 -mattr=+ptx32 | %ptxas-verify %} [FAIL]
 llc < third_party/llvm/llvm-project/llvm/test/CodeGen/NVPTX/f32-lg2.ll -mcpu=sm_20 -mattr=+ptx32 | third_party/gpus/cuda/_virtual_includes/_stage_runtime/third_party/gpus/cuda/bin/ptxas -arch=sm_60 -c -o /dev/null - 
ptxas warning :  64 Bit host architecture (--machine) being used mismatches with .address_size of 32 bits
ptxas fatal   :  32-Bit compilation is no longer supported
Command failed: exit status 255

You can reproduce it by running the tests with LLVM_PTXAS_EXECUTABLE=/path/to/cuda-12.6.0/bin/ptxas

The triple is just missing 64, I can probably fix it along with something else.

@Prince781
Copy link
Contributor Author

@jhuber6 Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants