[NVPTX] Use sink registers instead of temp registers where possible. #134957

jlebar · 2025-04-09T00:37:15Z

PTX 7.1 introduces the concept of a "sink" register, _, which is a register
to which writes are ignored.

This patch makes us use sink registers where possible, instead of using
explicit temp registers.

This results in cleaner assembly, and also works around a problem we
encountered in some private workloads.

(Unfortunately the tablegen is not particularly clean. But then again, it's
tablegen...)

llvmbot · 2025-04-09T00:37:49Z

@llvm/pr-subscribers-backend-nvptx

Author: Justin Lebar (jlebar)

Changes

PTX 7.1 introduces the concept of a "sink" register, _, which is a register
to which writes are ignored.

This patch makes us use sink registers where possible, instead of using
explicit temp registers.

This results in cleaner assembly, and also works around a problem we
encountered in some private workloads.

(Unfortunately the tablegen is not particularly clean. But then again, it's
tablegen...)

Full diff: https://github.com/llvm/llvm-project/pull/134957.diff

2 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+45-6)
(modified) llvm/test/CodeGen/NVPTX/bswap.ll (+14-7)

diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 8b34ce4f1001c..5be740e929bec 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -3463,10 +3463,36 @@ let hasSideEffects = false in {
                              "{{ .reg .b32 tmp; mov.b64 {$low, tmp}, $s; }}",
                              []>;
 
+  // PTX 7.1 lets you avoid a temp register and just use _ as a "sink" for the
+  // unused high/low part.
+  def I32toI16H_ptx71  : NVPTXInst<(outs Int16Regs:$high),
+                             (ins Int32Regs:$s),
+                             "mov.b32 \t{{_, $high}}, $s;",
+                             []>, Requires<[hasPTX<71>]>;
+  def I32toI16L_ptx71  : NVPTXInst<(outs Int16Regs:$low),
+                             (ins Int32Regs:$s),
+                             "mov.b32 \t{{$low, _}}, $s;",
+                             []>, Requires<[hasPTX<71>]>;
+  def I64toI32H_ptx71  : NVPTXInst<(outs Int32Regs:$high),
+                             (ins Int64Regs:$s),
+                             "mov.b64 \t{{_, $high}}, $s;",
+                             []>, Requires<[hasPTX<71>]>;
+  def I64toI32L_ptx71  : NVPTXInst<(outs Int32Regs:$low),
+                             (ins Int64Regs:$s),
+                             "mov.b64 \t{{$low, _}}, $s;",
+                             []>, Requires<[hasPTX<71>]>;
 }
 
-// Using partial vectorized move produces better SASS code for extraction of
-// upper/lower parts of an integer.
+def : Pat<(i16 (trunc (srl Int32Regs:$s, (i32 16)))),
+          (I32toI16H_ptx71 Int32Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i16 (trunc (sra Int32Regs:$s, (i32 16)))),
+          (I32toI16H_ptx71 Int32Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i32 (trunc (srl Int64Regs:$s, (i32 32)))),
+          (I64toI32H_ptx71 Int64Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i32 (trunc (sra Int64Regs:$s, (i32 32)))),
+          (I64toI32H_ptx71 Int64Regs:$s)>, Requires<[hasPTX<71>]>;
+
+// Fall back to the old way if we don't have PTX 7.1.
 def : Pat<(i16 (trunc (srl Int32Regs:$s, (i32 16)))),
           (I32toI16H Int32Regs:$s)>;
 def : Pat<(i16 (trunc (sra Int32Regs:$s, (i32 16)))),
@@ -3480,10 +3506,15 @@ def: Pat<(i32 (sext (extractelt (v2i16 Int32Regs:$src), 0))),
          (CVT_INREG_s32_s16 Int32Regs:$src)>;
 
 foreach vt = [v2f16, v2bf16, v2i16] in {
-def : Pat<(extractelt (vt Int32Regs:$src), 0),
-          (I32toI16L Int32Regs:$src)>;
-def : Pat<(extractelt (vt Int32Regs:$src), 1),
-          (I32toI16H Int32Regs:$src)>;
+  def : Pat<(extractelt (vt Int32Regs:$src), 0),
+            (I32toI16L_ptx71 Int32Regs:$src)>, Requires<[hasPTX<71>]>;
+  def : Pat<(extractelt (vt Int32Regs:$src), 1),
+            (I32toI16H_ptx71 Int32Regs:$src)>, Requires<[hasPTX<71>]>;
+
+  def : Pat<(extractelt (vt Int32Regs:$src), 0),
+            (I32toI16L Int32Regs:$src)>;
+  def : Pat<(extractelt (vt Int32Regs:$src), 1),
+            (I32toI16H Int32Regs:$src)>;
 }
 def : Pat<(v2f16 (build_vector (f16 Int16Regs:$a), (f16 Int16Regs:$b))),
           (V2I16toI32 Int16Regs:$a, Int16Regs:$b)>;
@@ -3866,6 +3897,14 @@ def : Pat <
   (v2i16 (bswap v2i16:$a)),
   (INT_NVVM_PRMT Int32Regs:$a, (i32 0), (i32 0x2301))>;
 
+def : Pat <
+  (i64 (bswap i64:$a)),
+  (V2I32toI64
+    (INT_NVVM_PRMT (I64toI32H_ptx71 Int64Regs:$a), (i32 0), (i32 0x0123)),
+    (INT_NVVM_PRMT (I64toI32L_ptx71 Int64Regs:$a), (i32 0), (i32 0x0123)))>,
+  Requires<[hasPTX<71>]>;
+
+// Fall back to the old way if we don't have PTX 7.1.
 def : Pat <
   (i64 (bswap i64:$a)),
   (V2I32toI64
diff --git a/llvm/test/CodeGen/NVPTX/bswap.ll b/llvm/test/CodeGen/NVPTX/bswap.ll
index 3f929ec6a75d0..60d776d1f09d0 100644
--- a/llvm/test/CodeGen/NVPTX/bswap.ll
+++ b/llvm/test/CodeGen/NVPTX/bswap.ll
@@ -1,6 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
-; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 | FileCheck %s
-; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 | %ptxas-verify %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx70 | FileCheck -check-prefixes CHECK,PTX70 %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx70 | %ptxas-verify %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx71 | FileCheck -check-prefixes CHECK,PTX71 %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx71 | %ptxas-verify %}
 
 target triple = "nvptx64-nvidia-cuda"
 
@@ -60,11 +62,16 @@ define i64 @bswap64(i64 %a) {
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u64 %rd1, [bswap64_param_0];
-; CHECK-NEXT:    { .reg .b32 tmp; mov.b64 {%r1, tmp}, %rd1; }
-; CHECK-NEXT:    prmt.b32 %r2, %r1, 0, 291;
-; CHECK-NEXT:    { .reg .b32 tmp; mov.b64 {tmp, %r3}, %rd1; }
-; CHECK-NEXT:    prmt.b32 %r4, %r3, 0, 291;
-; CHECK-NEXT:    mov.b64 %rd2, {%r4, %r2};
+; PTX70-NEXT:    { .reg .b32 tmp; mov.b64 {%r1, tmp}, %rd1; }
+; PTX70-NEXT:    prmt.b32 %r2, %r1, 0, 291;
+; PTX70-NEXT:    { .reg .b32 tmp; mov.b64 {tmp, %r3}, %rd1; }
+; PTX70-NEXT:    prmt.b32 %r4, %r3, 0, 291;
+; PTX70-NEXT:    mov.b64 %rd2, {%r4, %r2};
+; PTX71-NEXT:    mov.b64         {%r1, _}, %rd1;
+; PTX71-NEXT:    prmt.b32        %r2, %r1, 0, 291;
+; PTX71-NEXT:    mov.b64         {_, %r3}, %rd1;
+; PTX71-NEXT:    prmt.b32        %r4, %r3, 0, 291;
+; PTX71-NEXT:    mov.b64         %rd2, {%r4, %r2};
 ; CHECK-NEXT:    st.param.b64 [func_retval0+0], %rd2;
 ; CHECK-NEXT:    ret;
   %b = tail call i64 @llvm.bswap.i64(i64 %a)

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

jlebar · 2025-04-09T01:16:28Z

Updated the branch to HEAD.

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

Artem-B

LGTM with a couple of nits.

llvm/test/CodeGen/NVPTX/bswap.ll

Artem-B · 2025-04-09T18:51:54Z

llvm/test/CodeGen/NVPTX/bswap.ll

+; PTX71-NEXT:    mov.b64         {%r1, _}, %rd1;
+; PTX71-NEXT:    prmt.b32        %r2, %r1, 0, 291;
+; PTX71-NEXT:    mov.b64         {_, %r3}, %rd1;


Not this patch's problem, but we're doing something silly here. those two mov instructions should've been just one splitting move.

PTX 7.1 introduces the concept of a "sink" register, `_`, which is a register to which writes are ignored. This patch makes us use sink registers where possible, instead of using explicit temp registers. This results in cleaner assembly, and also works around a problem we encountered in some private workloads. (Unfortunately the tablegen is not particularly clean. But then again, it's tablegen...)

jlebar · 2025-04-10T18:13:38Z

Merged, thank you!

llvmbot added the backend:NVPTX label Apr 9, 2025

jlebar requested a review from Artem-B April 9, 2025 00:37

peterbell10 reviewed Apr 9, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td Show resolved Hide resolved

jlebar force-pushed the tmp-regs branch from a568efb to 2d56172 Compare April 9, 2025 01:16

jlebar requested a review from peterbell10 April 9, 2025 01:16

Prince781 reviewed Apr 9, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td Outdated Show resolved Hide resolved

Artem-B requested a review from AlexMaclean April 9, 2025 18:40

Artem-B approved these changes Apr 9, 2025

View reviewed changes

jlebar force-pushed the tmp-regs branch from 2d56172 to 066c154 Compare April 10, 2025 07:06

jlebar force-pushed the tmp-regs branch from 066c154 to e70c287 Compare April 10, 2025 07:08

jlebar merged commit 337a4d5 into llvm:main Apr 10, 2025
11 checks passed

jlebar deleted the tmp-regs branch April 10, 2025 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVPTX] Use sink registers instead of temp registers where possible. #134957

[NVPTX] Use sink registers instead of temp registers where possible. #134957

Uh oh!

jlebar commented Apr 9, 2025

Uh oh!

llvmbot commented Apr 9, 2025

Uh oh!

Uh oh!

jlebar commented Apr 9, 2025

Uh oh!

Uh oh!

Artem-B left a comment

Uh oh!

Uh oh!

Uh oh!

Artem-B Apr 9, 2025

Uh oh!

Uh oh!

jlebar commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[NVPTX] Use sink registers instead of temp registers where possible. #134957

[NVPTX] Use sink registers instead of temp registers where possible. #134957

Uh oh!

Conversation

jlebar commented Apr 9, 2025

Uh oh!

llvmbot commented Apr 9, 2025

Uh oh!

Uh oh!

jlebar commented Apr 9, 2025

Uh oh!

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Artem-B Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jlebar commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants