Skip to content

Conversation

@jlebar
Copy link
Member

@jlebar jlebar commented Apr 9, 2025

PTX 7.1 introduces the concept of a "sink" register, _, which is a register
to which writes are ignored.

This patch makes us use sink registers where possible, instead of using
explicit temp registers.

This results in cleaner assembly, and also works around a problem we
encountered in some private workloads.

(Unfortunately the tablegen is not particularly clean. But then again, it's
tablegen...)

@llvmbot
Copy link
Member

llvmbot commented Apr 9, 2025

@llvm/pr-subscribers-backend-nvptx

Author: Justin Lebar (jlebar)

Changes

PTX 7.1 introduces the concept of a "sink" register, _, which is a register
to which writes are ignored.

This patch makes us use sink registers where possible, instead of using
explicit temp registers.

This results in cleaner assembly, and also works around a problem we
encountered in some private workloads.

(Unfortunately the tablegen is not particularly clean. But then again, it's
tablegen...)


Full diff: https://github.com/llvm/llvm-project/pull/134957.diff

2 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+45-6)
  • (modified) llvm/test/CodeGen/NVPTX/bswap.ll (+14-7)
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 8b34ce4f1001c..5be740e929bec 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -3463,10 +3463,36 @@ let hasSideEffects = false in {
                              "{{ .reg .b32 tmp; mov.b64 {$low, tmp}, $s; }}",
                              []>;
 
+  // PTX 7.1 lets you avoid a temp register and just use _ as a "sink" for the
+  // unused high/low part.
+  def I32toI16H_ptx71  : NVPTXInst<(outs Int16Regs:$high),
+                             (ins Int32Regs:$s),
+                             "mov.b32 \t{{_, $high}}, $s;",
+                             []>, Requires<[hasPTX<71>]>;
+  def I32toI16L_ptx71  : NVPTXInst<(outs Int16Regs:$low),
+                             (ins Int32Regs:$s),
+                             "mov.b32 \t{{$low, _}}, $s;",
+                             []>, Requires<[hasPTX<71>]>;
+  def I64toI32H_ptx71  : NVPTXInst<(outs Int32Regs:$high),
+                             (ins Int64Regs:$s),
+                             "mov.b64 \t{{_, $high}}, $s;",
+                             []>, Requires<[hasPTX<71>]>;
+  def I64toI32L_ptx71  : NVPTXInst<(outs Int32Regs:$low),
+                             (ins Int64Regs:$s),
+                             "mov.b64 \t{{$low, _}}, $s;",
+                             []>, Requires<[hasPTX<71>]>;
 }
 
-// Using partial vectorized move produces better SASS code for extraction of
-// upper/lower parts of an integer.
+def : Pat<(i16 (trunc (srl Int32Regs:$s, (i32 16)))),
+          (I32toI16H_ptx71 Int32Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i16 (trunc (sra Int32Regs:$s, (i32 16)))),
+          (I32toI16H_ptx71 Int32Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i32 (trunc (srl Int64Regs:$s, (i32 32)))),
+          (I64toI32H_ptx71 Int64Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i32 (trunc (sra Int64Regs:$s, (i32 32)))),
+          (I64toI32H_ptx71 Int64Regs:$s)>, Requires<[hasPTX<71>]>;
+
+// Fall back to the old way if we don't have PTX 7.1.
 def : Pat<(i16 (trunc (srl Int32Regs:$s, (i32 16)))),
           (I32toI16H Int32Regs:$s)>;
 def : Pat<(i16 (trunc (sra Int32Regs:$s, (i32 16)))),
@@ -3480,10 +3506,15 @@ def: Pat<(i32 (sext (extractelt (v2i16 Int32Regs:$src), 0))),
          (CVT_INREG_s32_s16 Int32Regs:$src)>;
 
 foreach vt = [v2f16, v2bf16, v2i16] in {
-def : Pat<(extractelt (vt Int32Regs:$src), 0),
-          (I32toI16L Int32Regs:$src)>;
-def : Pat<(extractelt (vt Int32Regs:$src), 1),
-          (I32toI16H Int32Regs:$src)>;
+  def : Pat<(extractelt (vt Int32Regs:$src), 0),
+            (I32toI16L_ptx71 Int32Regs:$src)>, Requires<[hasPTX<71>]>;
+  def : Pat<(extractelt (vt Int32Regs:$src), 1),
+            (I32toI16H_ptx71 Int32Regs:$src)>, Requires<[hasPTX<71>]>;
+
+  def : Pat<(extractelt (vt Int32Regs:$src), 0),
+            (I32toI16L Int32Regs:$src)>;
+  def : Pat<(extractelt (vt Int32Regs:$src), 1),
+            (I32toI16H Int32Regs:$src)>;
 }
 def : Pat<(v2f16 (build_vector (f16 Int16Regs:$a), (f16 Int16Regs:$b))),
           (V2I16toI32 Int16Regs:$a, Int16Regs:$b)>;
@@ -3866,6 +3897,14 @@ def : Pat <
   (v2i16 (bswap v2i16:$a)),
   (INT_NVVM_PRMT Int32Regs:$a, (i32 0), (i32 0x2301))>;
 
+def : Pat <
+  (i64 (bswap i64:$a)),
+  (V2I32toI64
+    (INT_NVVM_PRMT (I64toI32H_ptx71 Int64Regs:$a), (i32 0), (i32 0x0123)),
+    (INT_NVVM_PRMT (I64toI32L_ptx71 Int64Regs:$a), (i32 0), (i32 0x0123)))>,
+  Requires<[hasPTX<71>]>;
+
+// Fall back to the old way if we don't have PTX 7.1.
 def : Pat <
   (i64 (bswap i64:$a)),
   (V2I32toI64
diff --git a/llvm/test/CodeGen/NVPTX/bswap.ll b/llvm/test/CodeGen/NVPTX/bswap.ll
index 3f929ec6a75d0..60d776d1f09d0 100644
--- a/llvm/test/CodeGen/NVPTX/bswap.ll
+++ b/llvm/test/CodeGen/NVPTX/bswap.ll
@@ -1,6 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
-; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 | FileCheck %s
-; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 | %ptxas-verify %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx70 | FileCheck -check-prefixes CHECK,PTX70 %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx70 | %ptxas-verify %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx71 | FileCheck -check-prefixes CHECK,PTX71 %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx71 | %ptxas-verify %}
 
 target triple = "nvptx64-nvidia-cuda"
 
@@ -60,11 +62,16 @@ define i64 @bswap64(i64 %a) {
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u64 %rd1, [bswap64_param_0];
-; CHECK-NEXT:    { .reg .b32 tmp; mov.b64 {%r1, tmp}, %rd1; }
-; CHECK-NEXT:    prmt.b32 %r2, %r1, 0, 291;
-; CHECK-NEXT:    { .reg .b32 tmp; mov.b64 {tmp, %r3}, %rd1; }
-; CHECK-NEXT:    prmt.b32 %r4, %r3, 0, 291;
-; CHECK-NEXT:    mov.b64 %rd2, {%r4, %r2};
+; PTX70-NEXT:    { .reg .b32 tmp; mov.b64 {%r1, tmp}, %rd1; }
+; PTX70-NEXT:    prmt.b32 %r2, %r1, 0, 291;
+; PTX70-NEXT:    { .reg .b32 tmp; mov.b64 {tmp, %r3}, %rd1; }
+; PTX70-NEXT:    prmt.b32 %r4, %r3, 0, 291;
+; PTX70-NEXT:    mov.b64 %rd2, {%r4, %r2};
+; PTX71-NEXT:    mov.b64         {%r1, _}, %rd1;
+; PTX71-NEXT:    prmt.b32        %r2, %r1, 0, 291;
+; PTX71-NEXT:    mov.b64         {_, %r3}, %rd1;
+; PTX71-NEXT:    prmt.b32        %r4, %r3, 0, 291;
+; PTX71-NEXT:    mov.b64         %rd2, {%r4, %r2};
 ; CHECK-NEXT:    st.param.b64 [func_retval0+0], %rd2;
 ; CHECK-NEXT:    ret;
   %b = tail call i64 @llvm.bswap.i64(i64 %a)

@jlebar jlebar requested a review from Artem-B April 9, 2025 00:37
@jlebar
Copy link
Member Author

jlebar commented Apr 9, 2025

Updated the branch to HEAD.

@jlebar jlebar requested a review from peterbell10 April 9, 2025 01:16
@Artem-B Artem-B requested a review from AlexMaclean April 9, 2025 18:40
Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of nits.

Comment on lines +71 to +73
; PTX71-NEXT: mov.b64 {%r1, _}, %rd1;
; PTX71-NEXT: prmt.b32 %r2, %r1, 0, 291;
; PTX71-NEXT: mov.b64 {_, %r3}, %rd1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this patch's problem, but we're doing something silly here. those two mov instructions should've been just one splitting move.

PTX 7.1 introduces the concept of a "sink" register, `_`, which is a register
to which writes are ignored.

This patch makes us use sink registers where possible, instead of using
explicit temp registers.

This results in cleaner assembly, and also works around a problem we
encountered in some private workloads.

(Unfortunately the tablegen is not particularly clean.  But then again, it's
tablegen...)
@jlebar jlebar merged commit 337a4d5 into llvm:main Apr 10, 2025
11 checks passed
@jlebar
Copy link
Member Author

jlebar commented Apr 10, 2025

Merged, thank you!

@jlebar jlebar deleted the tmp-regs branch April 10, 2025 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants