-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[NVPTX] Use sink registers instead of temp registers where possible. #134957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-backend-nvptx Author: Justin Lebar (jlebar) ChangesPTX 7.1 introduces the concept of a "sink" register, This patch makes us use sink registers where possible, instead of using This results in cleaner assembly, and also works around a problem we (Unfortunately the tablegen is not particularly clean. But then again, it's Full diff: https://github.com/llvm/llvm-project/pull/134957.diff 2 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 8b34ce4f1001c..5be740e929bec 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -3463,10 +3463,36 @@ let hasSideEffects = false in {
"{{ .reg .b32 tmp; mov.b64 {$low, tmp}, $s; }}",
[]>;
+ // PTX 7.1 lets you avoid a temp register and just use _ as a "sink" for the
+ // unused high/low part.
+ def I32toI16H_ptx71 : NVPTXInst<(outs Int16Regs:$high),
+ (ins Int32Regs:$s),
+ "mov.b32 \t{{_, $high}}, $s;",
+ []>, Requires<[hasPTX<71>]>;
+ def I32toI16L_ptx71 : NVPTXInst<(outs Int16Regs:$low),
+ (ins Int32Regs:$s),
+ "mov.b32 \t{{$low, _}}, $s;",
+ []>, Requires<[hasPTX<71>]>;
+ def I64toI32H_ptx71 : NVPTXInst<(outs Int32Regs:$high),
+ (ins Int64Regs:$s),
+ "mov.b64 \t{{_, $high}}, $s;",
+ []>, Requires<[hasPTX<71>]>;
+ def I64toI32L_ptx71 : NVPTXInst<(outs Int32Regs:$low),
+ (ins Int64Regs:$s),
+ "mov.b64 \t{{$low, _}}, $s;",
+ []>, Requires<[hasPTX<71>]>;
}
-// Using partial vectorized move produces better SASS code for extraction of
-// upper/lower parts of an integer.
+def : Pat<(i16 (trunc (srl Int32Regs:$s, (i32 16)))),
+ (I32toI16H_ptx71 Int32Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i16 (trunc (sra Int32Regs:$s, (i32 16)))),
+ (I32toI16H_ptx71 Int32Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i32 (trunc (srl Int64Regs:$s, (i32 32)))),
+ (I64toI32H_ptx71 Int64Regs:$s)>, Requires<[hasPTX<71>]>;
+def : Pat<(i32 (trunc (sra Int64Regs:$s, (i32 32)))),
+ (I64toI32H_ptx71 Int64Regs:$s)>, Requires<[hasPTX<71>]>;
+
+// Fall back to the old way if we don't have PTX 7.1.
def : Pat<(i16 (trunc (srl Int32Regs:$s, (i32 16)))),
(I32toI16H Int32Regs:$s)>;
def : Pat<(i16 (trunc (sra Int32Regs:$s, (i32 16)))),
@@ -3480,10 +3506,15 @@ def: Pat<(i32 (sext (extractelt (v2i16 Int32Regs:$src), 0))),
(CVT_INREG_s32_s16 Int32Regs:$src)>;
foreach vt = [v2f16, v2bf16, v2i16] in {
-def : Pat<(extractelt (vt Int32Regs:$src), 0),
- (I32toI16L Int32Regs:$src)>;
-def : Pat<(extractelt (vt Int32Regs:$src), 1),
- (I32toI16H Int32Regs:$src)>;
+ def : Pat<(extractelt (vt Int32Regs:$src), 0),
+ (I32toI16L_ptx71 Int32Regs:$src)>, Requires<[hasPTX<71>]>;
+ def : Pat<(extractelt (vt Int32Regs:$src), 1),
+ (I32toI16H_ptx71 Int32Regs:$src)>, Requires<[hasPTX<71>]>;
+
+ def : Pat<(extractelt (vt Int32Regs:$src), 0),
+ (I32toI16L Int32Regs:$src)>;
+ def : Pat<(extractelt (vt Int32Regs:$src), 1),
+ (I32toI16H Int32Regs:$src)>;
}
def : Pat<(v2f16 (build_vector (f16 Int16Regs:$a), (f16 Int16Regs:$b))),
(V2I16toI32 Int16Regs:$a, Int16Regs:$b)>;
@@ -3866,6 +3897,14 @@ def : Pat <
(v2i16 (bswap v2i16:$a)),
(INT_NVVM_PRMT Int32Regs:$a, (i32 0), (i32 0x2301))>;
+def : Pat <
+ (i64 (bswap i64:$a)),
+ (V2I32toI64
+ (INT_NVVM_PRMT (I64toI32H_ptx71 Int64Regs:$a), (i32 0), (i32 0x0123)),
+ (INT_NVVM_PRMT (I64toI32L_ptx71 Int64Regs:$a), (i32 0), (i32 0x0123)))>,
+ Requires<[hasPTX<71>]>;
+
+// Fall back to the old way if we don't have PTX 7.1.
def : Pat <
(i64 (bswap i64:$a)),
(V2I32toI64
diff --git a/llvm/test/CodeGen/NVPTX/bswap.ll b/llvm/test/CodeGen/NVPTX/bswap.ll
index 3f929ec6a75d0..60d776d1f09d0 100644
--- a/llvm/test/CodeGen/NVPTX/bswap.ll
+++ b/llvm/test/CodeGen/NVPTX/bswap.ll
@@ -1,6 +1,8 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
-; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 | FileCheck %s
-; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 | %ptxas-verify %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx70 | FileCheck -check-prefixes CHECK,PTX70 %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx70 | %ptxas-verify %}
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx71 | FileCheck -check-prefixes CHECK,PTX71 %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -mattr=+ptx71 | %ptxas-verify %}
target triple = "nvptx64-nvidia-cuda"
@@ -60,11 +62,16 @@ define i64 @bswap64(i64 %a) {
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: ld.param.u64 %rd1, [bswap64_param_0];
-; CHECK-NEXT: { .reg .b32 tmp; mov.b64 {%r1, tmp}, %rd1; }
-; CHECK-NEXT: prmt.b32 %r2, %r1, 0, 291;
-; CHECK-NEXT: { .reg .b32 tmp; mov.b64 {tmp, %r3}, %rd1; }
-; CHECK-NEXT: prmt.b32 %r4, %r3, 0, 291;
-; CHECK-NEXT: mov.b64 %rd2, {%r4, %r2};
+; PTX70-NEXT: { .reg .b32 tmp; mov.b64 {%r1, tmp}, %rd1; }
+; PTX70-NEXT: prmt.b32 %r2, %r1, 0, 291;
+; PTX70-NEXT: { .reg .b32 tmp; mov.b64 {tmp, %r3}, %rd1; }
+; PTX70-NEXT: prmt.b32 %r4, %r3, 0, 291;
+; PTX70-NEXT: mov.b64 %rd2, {%r4, %r2};
+; PTX71-NEXT: mov.b64 {%r1, _}, %rd1;
+; PTX71-NEXT: prmt.b32 %r2, %r1, 0, 291;
+; PTX71-NEXT: mov.b64 {_, %r3}, %rd1;
+; PTX71-NEXT: prmt.b32 %r4, %r3, 0, 291;
+; PTX71-NEXT: mov.b64 %rd2, {%r4, %r2};
; CHECK-NEXT: st.param.b64 [func_retval0+0], %rd2;
; CHECK-NEXT: ret;
%b = tail call i64 @llvm.bswap.i64(i64 %a)
|
|
Updated the branch to HEAD. |
Artem-B
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a couple of nits.
| ; PTX71-NEXT: mov.b64 {%r1, _}, %rd1; | ||
| ; PTX71-NEXT: prmt.b32 %r2, %r1, 0, 291; | ||
| ; PTX71-NEXT: mov.b64 {_, %r3}, %rd1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not this patch's problem, but we're doing something silly here. those two mov instructions should've been just one splitting move.
PTX 7.1 introduces the concept of a "sink" register, `_`, which is a register to which writes are ignored. This patch makes us use sink registers where possible, instead of using explicit temp registers. This results in cleaner assembly, and also works around a problem we encountered in some private workloads. (Unfortunately the tablegen is not particularly clean. But then again, it's tablegen...)
|
Merged, thank you! |
PTX 7.1 introduces the concept of a "sink" register,
_, which is a registerto which writes are ignored.
This patch makes us use sink registers where possible, instead of using
explicit temp registers.
This results in cleaner assembly, and also works around a problem we
encountered in some private workloads.
(Unfortunately the tablegen is not particularly clean. But then again, it's
tablegen...)