[NVPTX] legalize v2i32 to improve compatibility with v2f32 #153478

Prince781 · 2025-08-13T19:41:06Z

Since v2f32 is legal but v2i32 is not, this causes some sequences of operations like bitcast (build_vector) to be lowered inefficiently.

llvmbot · 2025-08-13T19:41:42Z

@llvm/pr-subscribers-backend-nvptx

Author: Princeton Ferro (Prince781)

Changes

Transform:

 v2f32 (bitcast (v2i32 build_vector i32:A, i32:B))
 ---&gt;
 v2f32 (build_vector (f32 (bitcast i32:A)), (f32 (bitcast i32:B)))

Since v2f32 is legal but v2i32 is not, v2i32 build_vector would be legalized as bitwise ops on i64, when we want each 32-bit element to be in its own register.

Fixes #153109

Full diff: https://github.com/llvm/llvm-project/pull/153478.diff

2 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+27-4)
(added) llvm/test/CodeGen/NVPTX/f32x2-inlineasm.ll (+64)

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 3daf25d551520..fcabe49e09c6c 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -892,10 +892,11 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   setOperationAction(ISD::UMUL_LOHI, MVT::i64, Expand);
 
   // We have some custom DAG combine patterns for these nodes
-  setTargetDAGCombine({ISD::ADD, ISD::AND, ISD::EXTRACT_VECTOR_ELT, ISD::FADD,
-                       ISD::MUL, ISD::SHL, ISD::SREM, ISD::UREM, ISD::VSELECT,
-                       ISD::BUILD_VECTOR, ISD::ADDRSPACECAST, ISD::LOAD,
-                       ISD::STORE, ISD::ZERO_EXTEND, ISD::SIGN_EXTEND});
+  setTargetDAGCombine({ISD::ADD, ISD::AND, ISD::BITCAST,
+                       ISD::EXTRACT_VECTOR_ELT, ISD::FADD, ISD::MUL, ISD::SHL,
+                       ISD::SREM, ISD::UREM, ISD::VSELECT, ISD::BUILD_VECTOR,
+                       ISD::ADDRSPACECAST, ISD::LOAD, ISD::STORE,
+                       ISD::ZERO_EXTEND, ISD::SIGN_EXTEND});
 
   // setcc for f16x2 and bf16x2 needs special handling to prevent
   // legalizer's attempt to scalarize it due to v2i1 not being legal.
@@ -5334,6 +5335,26 @@ static SDValue PerformANDCombine(SDNode *N,
   return SDValue();
 }
 
+static SDValue combineBitcast(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {
+  const SDValue &Input = N->getOperand(0);
+  const MVT FromVT = Input.getSimpleValueType();
+  const MVT ToVT = N->getSimpleValueType(0);
+  const SDLoc DL(N);
+
+  if (Input.getOpcode() != ISD::BUILD_VECTOR || ToVT != MVT::v2f32 ||
+      !(FromVT.isVector() &&
+        FromVT.getVectorNumElements() == ToVT.getVectorNumElements()))
+    return SDValue();
+
+  const MVT ToEltVT = ToVT.getVectorElementType();
+
+  // pull in build_vector through bitcast
+  return DCI.DAG.getBuildVector(
+      ToVT, DL,
+      {DCI.DAG.getBitcast(ToEltVT, Input.getOperand(0)),
+       DCI.DAG.getBitcast(ToEltVT, Input.getOperand(1))});
+}
+
 static SDValue PerformREMCombine(SDNode *N,
                                  TargetLowering::DAGCombinerInfo &DCI,
                                  CodeGenOptLevel OptLevel) {
@@ -6007,6 +6028,8 @@ SDValue NVPTXTargetLowering::PerformDAGCombine(SDNode *N,
     return combineADDRSPACECAST(N, DCI);
   case ISD::AND:
     return PerformANDCombine(N, DCI);
+  case ISD::BITCAST:
+    return combineBitcast(N, DCI);
   case ISD::SIGN_EXTEND:
   case ISD::ZERO_EXTEND:
     return combineMulWide(N, DCI, OptLevel);
diff --git a/llvm/test/CodeGen/NVPTX/f32x2-inlineasm.ll b/llvm/test/CodeGen/NVPTX/f32x2-inlineasm.ll
new file mode 100644
index 0000000000000..4d80ee68faac6
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/f32x2-inlineasm.ll
@@ -0,0 +1,64 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mcpu=sm_90a -O0 -disable-post-ra -frame-pointer=all          \
+; RUN: -verify-machineinstrs | FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-12.7 %{                                                      \
+; RUN:  llc < %s -mcpu=sm_90a -O0 -disable-post-ra -frame-pointer=all         \
+; RUN:  -verify-machineinstrs | %ptxas-verify -arch=sm_90a                    \
+; RUN: %}
+
+target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
+target triple = "nvptx64-nvidia-cuda"
+
+define ptx_kernel void @kernel1() {
+; CHECK-LABEL: kernel1(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<11>;
+; CHECK-NEXT:    .reg .b64 %rd<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0: // %bb
+; CHECK-NEXT:    mov.b32 %r3, 0;
+; CHECK-NEXT:    mov.b32 %r4, %r3;
+; CHECK-NEXT:    mov.b32 %r1, %r3;
+; CHECK-NEXT:    mov.b32 %r2, %r4;
+; CHECK-NEXT:    // begin inline asm
+; CHECK-NEXT:    { .reg .pred p; setp.ne.b32 p, 66, 0; wgmma.mma_async.sync.aligned.m64n64k16.f32.bf16.bf16 {%r1,%r2}, 64, 65, p, 67, 68, 69, 70; }
+; CHECK-EMPTY:
+; CHECK-NEXT:    // end inline asm
+; CHECK-NEXT:    mov.b32 %r5, %r3;
+; CHECK-NEXT:    // begin inline asm
+; CHECK-NEXT:     mbarrier.arrive.release.cta.shared::cta.b64 %rd1, [%r5], 1; // XXSTART
+; CHECK-NEXT:    // end inline asm
+; CHECK-NEXT:    wgmma.wait_group.sync.aligned 0;
+; CHECK-NEXT:    mov.b32 %r6, %r3;
+; CHECK-NEXT:    // begin inline asm
+; CHECK-NEXT:     mbarrier.arrive.release.cta.shared::cta.b64 %rd2, [%r6], 1; // XXEND
+; CHECK-NEXT:    // end inline asm
+; CHECK-NEXT:    mul.rn.f32 %r7, %r1, 0f00000000;
+; CHECK-NEXT:    mul.rn.f32 %r8, %r2, 0f00000000;
+; CHECK-NEXT:    add.rn.f32 %r9, %r8, %r7;
+; CHECK-NEXT:    shfl.sync.bfly.b32 %r10, %r9, 0, 0, 0;
+; CHECK-NEXT:    ret;
+bb:
+  %i = call { <1 x float>, <1 x float> } asm sideeffect "{ .reg .pred p; setp.ne.b32 p, 66, 0; wgmma.mma_async.sync.aligned.m64n64k16.f32.bf16.bf16 {$0,$1}, 64, 65, p, 67, 68, 69, 70; }\0A", "=f,=f,0,1"(<1 x float> zeroinitializer, <1 x float> zeroinitializer)
+  %i1 = extractvalue { <1 x float>, <1 x float> } %i, 0
+  %i2 = extractvalue { <1 x float>, <1 x float> } %i, 1
+  %i3 = call i64 asm sideeffect " mbarrier.arrive.release.cta.shared::cta.b64 $0, [$1], 1;  // XXSTART ", "=l,r"(ptr addrspace(3) null)
+  call void @llvm.nvvm.wgmma.wait_group.sync.aligned(i64 0)
+  %i4 = shufflevector <1 x float> %i1, <1 x float> %i2, <2 x i32> <i32 0, i32 1>
+  %i5 = call i64 asm sideeffect " mbarrier.arrive.release.cta.shared::cta.b64 $0, [$1], 1; // XXEND ", "=l,r"(ptr addrspace(3) null)
+  %i6 = fmul <2 x float> %i4, zeroinitializer
+  %i7 = extractelement <2 x float> %i6, i64 0
+  %i8 = extractelement <2 x float> %i6, i64 1
+  %i9 = fadd float %i8, %i7
+  %i10 = bitcast float %i9 to <1 x i32>
+  %i11 = extractelement <1 x i32> %i10, i64 0
+  %i12 = call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 0, i32 %i11, i32 0, i32 0)
+  ret void
+}
+
+declare void @llvm.nvvm.wgmma.wait_group.sync.aligned(i64 immarg) #0
+
+declare i32 @llvm.nvvm.shfl.sync.bfly.i32(i32, i32, i32, i32) #1
+
+attributes #0 = { convergent nounwind }
+attributes #1 = { convergent nocallback nounwind memory(inaccessiblemem: readwrite) }

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Prince781 · 2025-08-13T20:59:10Z

Updated the code and simplified test cases.

AlexMaclean

This looks reasonable to me, although I have not had a chance to dig into the issue this is trying to address very deeply. Please wait for @Artem-B to take a look as well.

P.S. I wonder if we should consider just making v2i32 a legal type.

Prince781 · 2025-08-13T22:32:10Z

@AlexMaclean

P.S. I wonder if we should consider just making v2i32 a legal type.

I thought about this as well. I'm hesitant to introduce another legal type and plumb it through NVPTX (although it should be far less effort than v2f32) when there aren't any instructions that support v2i32. If we see more bug reports related to this issue then I'll reconsider it.

Prince781 · 2025-08-13T22:35:24Z

Looks like CI is red because of a broken Arm/Thumb2 test, but this shouldn't have anything to do with this change.

Prince781 · 2025-08-14T20:49:41Z

@Artem-B ping for review.

llvm/test/CodeGen/NVPTX/f32x2-convert-i32x2.ll

Prince781 · 2025-08-20T22:45:00Z

Hi @Artem-B,

What are your thoughts on still having this checked in? This doesn't fix your bug but it still improves PTX codegen even in sm1xx. We don't want to be emitting these logical ops under any circumstances.

Artem-B · 2025-08-20T23:20:48Z

It sounds reasonable, but it looks like we're fixing one particular case while we potentially may want a more general fix for construction of small vectors that fit into b32/b64 regardless of their element type.

LGTM, but add a TODO that we may need a more general solution.

Prince781 · 2025-08-21T02:07:20Z

we potentially may want a more general fix for construction of small vectors that fit into b32/b64 regardless of their element type.

I think we shouldn't need to worry about v2i16/v2f16/v2bf16 as all of these types are legal, so bitcasting between them is hassle-free. I'll investigate legalizing v2i32 as @AlexMaclean had suggested.

github-actions · 2025-08-21T17:33:35Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

llvm/test/CodeGen/NVPTX/f32x2-instructions.ll

llvm/test/CodeGen/NVPTX/load-store-vectors-256.ll

Artem-B · 2025-09-24T22:04:18Z

I've gone ahead and legalized v2i32 and codegen looks better not only in this case but with how we lower 256-bit loads and stores.

I'm not sure I can see it. Do we need to add more test cases to show the benefits? All I see in the test changes is that we prefer b64/v2.b64 loads + splitting moves, over direct loads of v2/v4.b32 into 32-bit regs, and in some cases we're not using 256-bit loads where we should (I think). What am I missing?

Prince781 · 2025-09-25T21:21:12Z

I've gone ahead and legalized v2i32 and codegen looks better not only in this case but with how we lower 256-bit loads and stores.

I'm not sure I can see it.

Sorry, my comment was out of date. The "improvements" were miscompiles on <sm_100 that I fixed.

Artem-B

Still LGTM, but I'm not sure we need the O3 test run.

llvm/test/CodeGen/NVPTX/f32x2-instructions.ll

Since v2f32 is legal but v2i32 is not, this causes some sequences of operations like bitcast (build_vector) to be lowered inefficiently.

This reverts commit 34859a13d0e4b99d70c250d1dad95392291b61a0.

Artem-B · 2025-10-02T17:41:12Z

It appears that the patch may have introduced a bug: https://godbolt.org/z/EPPhGEGxT

Looks like we also need to expand trunc operation, too.

Artem-B · 2025-10-02T17:54:13Z

This patch appears to fix the crash.

That said, we should be able to actually lower the truncation v2i32->v2i16 into a single PRMT instruction. LLVM currently generates two PRMT instructions that could be combined. We could use the custom lowering for the op as it would help on all GPU variants.

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 3ac7c2874408..48e539037dcc 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -638,6 +638,9 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   // No support for these operations with v2f32/v2i32
   setOperationAction(ISD::INSERT_VECTOR_ELT, {MVT::v2f32, MVT::v2i32}, Expand);
   setOperationAction(ISD::VECTOR_SHUFFLE, {MVT::v2f32, MVT::v2i32}, Expand);
+
+  setOperationAction(ISD::TRUNCATE, MVT::v2i16, Expand);
+
   // Need custom lowering in case the index is dynamic.
   if (STI.hasF32x2Instructions())
     setOperationAction(ISD::EXTRACT_VECTOR_ELT, {MVT::v2f32, MVT::v2i32},

llvm#153478 made v2i32 legal on newer GPUs, but we can not lower all operations yet. Expand the `trunc` operation until we implement efficient lowering.

#153478 made v2i32 legal on newer GPUs, but we can not lower all operations yet. Expand the `trunc/ext` operation until we implement efficient lowering.

Since v2f32 is legal but v2i32 is not, this causes some sequences of operations like bitcast (build_vector) to be lowered inefficiently.

llvm#153478 made v2i32 legal on newer GPUs, but we can not lower all operations yet. Expand the `trunc/ext` operation until we implement efficient lowering.

…162391) Follow-up on #153478 and #161715. v2i32 register class exists mostly to facilitate v2f32's use of integer registers. There are no actual instructions that can apply to v2i32 directly (except bitwise logical ops). Everything else must be done elementwise.

…lvm#162391) Follow-up on llvm#153478 and llvm#161715. v2i32 register class exists mostly to facilitate v2f32's use of integer registers. There are no actual instructions that can apply to v2i32 directly (except bitwise logical ops). Everything else must be done elementwise.

Prince781 requested review from AlexMaclean and Artem-B August 13, 2025 19:41

Prince781 self-assigned this Aug 13, 2025

llvmbot added the backend:NVPTX label Aug 13, 2025

Prince781 force-pushed the dev/pferro/improve-shufflevector-v2f32-nvptx branch from 1e355f1 to ce79a0d Compare August 13, 2025 19:48

AlexMaclean reviewed Aug 13, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Outdated Show resolved Hide resolved

Prince781 force-pushed the dev/pferro/improve-shufflevector-v2f32-nvptx branch 2 times, most recently from e13ac9f to 99a54b3 Compare August 13, 2025 20:46

Prince781 force-pushed the dev/pferro/improve-shufflevector-v2f32-nvptx branch from 99a54b3 to 8db1673 Compare August 13, 2025 21:41

AlexMaclean approved these changes Aug 13, 2025

View reviewed changes

Prince781 force-pushed the dev/pferro/improve-shufflevector-v2f32-nvptx branch from 6b7a81a to ddab65c Compare August 13, 2025 22:34

Artem-B reviewed Aug 14, 2025

View reviewed changes

llvm/test/CodeGen/NVPTX/f32x2-convert-i32x2.ll Outdated Show resolved Hide resolved

Artem-B mentioned this pull request Aug 15, 2025

[NVPTX] Performance regression in IR that uses <1 x float> #153109

Closed

Prince781 linked an issue Aug 20, 2025 that may be closed by this pull request

[NVPTX] Performance regression in IR that uses <1 x float> #153109

Closed

Prince781 removed a link to an issue Aug 20, 2025

[NVPTX] Performance regression in IR that uses <1 x float> #153109

Closed

Artem-B approved these changes Aug 20, 2025

View reviewed changes

Prince781 force-pushed the dev/pferro/improve-shufflevector-v2f32-nvptx branch from ddab65c to 3155ed2 Compare August 21, 2025 17:07

Prince781 force-pushed the dev/pferro/improve-shufflevector-v2f32-nvptx branch 2 times, most recently from 0449a2d to d75d057 Compare August 21, 2025 19:56

Prince781 enabled auto-merge (squash) August 21, 2025 19:57

Prince781 disabled auto-merge August 21, 2025 19:57

durga4github reviewed Sep 22, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Outdated Show resolved Hide resolved

durga4github reviewed Sep 22, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Show resolved Hide resolved

Artem-B reviewed Sep 24, 2025

View reviewed changes

llvm/test/CodeGen/NVPTX/f32x2-instructions.ll Show resolved Hide resolved

llvm/test/CodeGen/NVPTX/load-store-vectors-256.ll Show resolved Hide resolved

Prince781 force-pushed the dev/pferro/improve-shufflevector-v2f32-nvptx branch 2 times, most recently from 68ffcd7 to dbf16b6 Compare September 25, 2025 23:13

Artem-B approved these changes Sep 25, 2025

View reviewed changes

llvm/test/CodeGen/NVPTX/f32x2-instructions.ll Outdated Show resolved Hide resolved

Prince781 added 5 commits September 25, 2025 16:49

[NVPTX] legalize v2i32 to improve codegen of v2f32 ops

995ba77

Since v2f32 is legal but v2i32 is not, this causes some sequences of operations like bitcast (build_vector) to be lowered inefficiently.

address reviewer comments

c21fd6f

[NVPTX] add -O3 test for f32x2 instructions

5e60c62

address review comments x2

e147891

Revert "[NVPTX] add -O3 test for f32x2 instructions"

9213919

This reverts commit 34859a13d0e4b99d70c250d1dad95392291b61a0.

Prince781 force-pushed the dev/pferro/improve-shufflevector-v2f32-nvptx branch from c2ed0e4 to 9213919 Compare September 25, 2025 23:49

Prince781 merged commit e05bad4 into llvm:main Sep 26, 2025
9 checks passed

Artem-B mentioned this pull request Oct 2, 2025

[NVPTX] expand trunc/ext on v2i32 #161715

Merged

Artem-B added a commit to Artem-B/llvm-project that referenced this pull request Oct 2, 2025

[NVPTX] expand trunc v2i32->v2i16

0c20751

llvm#153478 made v2i32 legal on newer GPUs, but we can not lower all operations yet. Expand the `trunc` operation until we implement efficient lowering.

Artem-B added a commit that referenced this pull request Oct 2, 2025

[NVPTX] expand trunc/ext on v2i32 (#161715)

a2b6602

#153478 made v2i32 legal on newer GPUs, but we can not lower all operations yet. Expand the `trunc/ext` operation until we implement efficient lowering.

Artem-B mentioned this pull request Oct 7, 2025

[NVPTX] properly expand operations that we do not support on v2i32 #162391

Merged

[NVPTX] legalize v2i32 to improve compatibility with v2f32 #153478

[NVPTX] legalize v2i32 to improve compatibility with v2f32 #153478

Uh oh!

Conversation

Prince781 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Prince781 commented Aug 13, 2025

Uh oh!

AlexMaclean left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Prince781 commented Aug 13, 2025

Uh oh!

Prince781 commented Aug 13, 2025

Uh oh!

Prince781 commented Aug 14, 2025

Uh oh!

Uh oh!

Prince781 commented Aug 20, 2025

Uh oh!

Artem-B commented Aug 20, 2025

Uh oh!

Prince781 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Artem-B commented Sep 24, 2025

Uh oh!

Prince781 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Artem-B commented Oct 2, 2025

Uh oh!

Artem-B commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Prince781 commented Aug 13, 2025 •

edited

Loading

AlexMaclean left a comment •

edited

Loading

Prince781 commented Aug 21, 2025 •

edited

Loading

github-actions bot commented Aug 21, 2025 •

edited

Loading

Prince781 commented Sep 25, 2025 •

edited

Loading