Skip to content

Conversation

@linuxrocks123
Copy link
Contributor

This is to demonstrate that setOperationAction does not work to solve SWDEV-562122. See #165626

@github-actions
Copy link

github-actions bot commented Nov 8, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions cpp,h -- llvm/lib/Target/AMDGPU/SIISelLowering.cpp llvm/lib/Target/AMDGPU/SIISelLowering.h --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index b39c705ff..0ae07e378 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -190,7 +190,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
 
     // We don't want the default expansion of 16-bit ABS since we can
     // sign-extend and use the 32-bit ABS operation for 16-bit ABS with SGPRs
-    setOperationAction(ISD::ABS, {MVT::i8,MVT::i16}, Custom);
+    setOperationAction(ISD::ABS, {MVT::i8, MVT::i16}, Custom);
     setOperationAction(ISD::SUB, {MVT::i8}, Custom);
   }
 
@@ -7301,7 +7301,7 @@ static SDValue lowerLaneOp(const SITargetLowering &TLI, SDNode *N,
 void SITargetLowering::ReplaceNodeResults(SDNode *N,
                                           SmallVectorImpl<SDValue> &Results,
                                           SelectionDAG &DAG) const {
-  switch (N->getOpcode()) {    
+  switch (N->getOpcode()) {
   case ISD::INSERT_VECTOR_ELT: {
     if (SDValue Res = lowerINSERT_VECTOR_ELT(SDValue(N, 0), DAG))
       Results.push_back(Res);
@@ -16912,7 +16912,7 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
   switch (N->getOpcode()) {
   case ISD::ABS:
     if (N->getValueType(0) == MVT::i16 || N->getValueType(0) == MVT::i8)
-      return lowerABSi16(SDValue(N,0), DCI.DAG);
+      return lowerABSi16(SDValue(N, 0), DCI.DAG);
     break;
   case ISD::ADD:
     return performAddCombine(N, DCI);


// divergent means will not end up using SGPRs
if (Op->isDivergent())
return SDValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the opposite of what you want:

Suggested change
return SDValue();
return Op;

To get the default expansion, you return the original node. Return SDValue() means treat as legal

Copy link
Contributor Author

@linuxrocks123 linuxrocks123 Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm are you sure about that? See https://github.com/linuxrocks123/llvm-project/blob/97c9dddc96f3576fed0762344ce84b2c48e16671/llvm/lib/Target/AMDGPU/SIISelLowering.cpp#L16891. I don't see how -O0 could be working for us if SDValue() means "treat as legal."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is PerformDAGCombine, which is an optional optimization. This is required lowering. The custom lowering falls back on the default expansion:

if (SDValue Res = TLI.LowerOperation(SDValue(Node, 0), DAG)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or it's the other way around

@linuxrocks123
Copy link
Contributor Author

Hi @arsenm, in this PR, I have redundantly used every target-specific hook suggested by you or @RKSimon to demonstrate that none of them are functioning for our intended purpose. If I am using one of those hooks incorrectly, a review comment explaining how that hook could be made to work for i8 would allow me to change the implementation of #165626 back to using the appropriate hook. If I'm not, then we know the current implementation approach used by #165626 is optimal.

Thank you for your other comments, but, since this PR is not intended for merge, it does not need any review except to verify that I have not missed a way to make one of these hooks work. For that reason, I have resolved all existing conversations to ensure that I don't miss a comment from you or @RKSimon explaining how a hook could be made to function.

Regarding i8 and i16 versus just i16, I don't care whether we fix abs in one PR or two, but, even if we only do i16 in #165626, the approach used there should be an approach that will work for both types in order to avoid unnecessary churn later on. Thanks again for your help.

@arsenm
Copy link
Contributor

arsenm commented Nov 12, 2025

What is the issue here? This looks like it's working fine?

@@ -0,0 +1,1478 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=amdgcn -mcpu=tahiti -o - < %s | FileCheck %s --check-prefixes=SDAG6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use explicit -global-isel=0 for the dag run lines

@arsenm
Copy link
Contributor

arsenm commented Nov 12, 2025

I checked out this branch and it seems to work fine? I don't see what the issue is

@llvmbot
Copy link
Member

llvmbot commented Nov 12, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Patrick Simmons (linuxrocks123)

Changes

This is to demonstrate that setOperationAction does not work to solve SWDEV-562122. See #165626


Patch is 85.67 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/167064.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+43-2)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+1)
  • (removed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll (-600)
  • (modified) llvm/test/CodeGen/AMDGPU/absdiff.ll (+2-6)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.abs.ll (+1478)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 8bb28084159e8..95910b68f9c7d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -42,6 +42,7 @@
 #include "llvm/IR/IntrinsicsR600.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/Support/CommandLine.h"
+#include "llvm/Support/Compiler.h"
 #include "llvm/Support/KnownBits.h"
 #include "llvm/Support/ModRef.h"
 #include "llvm/Transforms/Utils/LowerAtomic.h"
@@ -177,6 +178,10 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     addRegisterClass(MVT::v32i16, &AMDGPU::SGPR_512RegClass);
     addRegisterClass(MVT::v32f16, &AMDGPU::SGPR_512RegClass);
     addRegisterClass(MVT::v32bf16, &AMDGPU::SGPR_512RegClass);
+
+    // We don't want the default expansion of 16-bit ABS since we can
+    // sign-extend and use the 32-bit ABS operation for 16-bit ABS with SGPRs
+    setOperationAction(ISD::ABS, {MVT::i8,MVT::i16}, Custom);
   }
 
   addRegisterClass(MVT::v32i32, &AMDGPU::VReg_1024RegClass);
@@ -974,7 +979,8 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
                        Custom);
   }
 
-  setTargetDAGCombine({ISD::ADD,
+  setTargetDAGCombine({ISD::ABS,
+                       ISD::ADD,
                        ISD::PTRADD,
                        ISD::UADDO_CARRY,
                        ISD::SUB,
@@ -6774,6 +6780,9 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
   case ISD::DEBUGTRAP:
     return lowerDEBUGTRAP(Op, DAG);
   case ISD::ABS:
+    if (Op.getValueType() == MVT::i16 || Op.getValueType() == MVT::i8)
+      return lowerABSi16(Op, DAG);
+    LLVM_FALLTHROUGH;
   case ISD::FABS:
   case ISD::FNEG:
   case ISD::FCANONICALIZE:
@@ -7272,7 +7281,7 @@ static SDValue lowerLaneOp(const SITargetLowering &TLI, SDNode *N,
 void SITargetLowering::ReplaceNodeResults(SDNode *N,
                                           SmallVectorImpl<SDValue> &Results,
                                           SelectionDAG &DAG) const {
-  switch (N->getOpcode()) {
+  switch (N->getOpcode()) {    
   case ISD::INSERT_VECTOR_ELT: {
     if (SDValue Res = lowerINSERT_VECTOR_ELT(SDValue(N, 0), DAG))
       Results.push_back(Res);
@@ -7450,6 +7459,15 @@ void SITargetLowering::ReplaceNodeResults(SDNode *N,
     Results.push_back(lowerFSQRTF16(SDValue(N, 0), DAG));
     break;
   }
+  case ISD::ABS:
+    if (N->getValueType(0) == MVT::i16 || N->getValueType(0) == MVT::i8) {
+      SDValue result = lowerABSi16(SDValue(N, 0), DAG);
+      if(result!=SDValue()) {
+        Results.push_back(result);
+        return;
+      }
+    }
+    LLVM_FALLTHROUGH;
   default:
     AMDGPUTargetLowering::ReplaceNodeResults(N, Results, DAG);
     break;
@@ -8139,6 +8157,25 @@ SDValue SITargetLowering::lowerDEBUGTRAP(SDValue Op, SelectionDAG &DAG) const {
   return DAG.getNode(AMDGPUISD::TRAP, SL, MVT::Other, Ops);
 }
 
+// sign-extend and use the 32-bit ABS operation for 16-bit ABS with SGPRs
+SDValue SITargetLowering::lowerABSi16(SDValue Op, SelectionDAG &DAG) const {
+  assert(Op.getOpcode() == ISD::ABS &&
+         "Tried to select abs with non-abs opcode.");
+  assert((Op.getValueType() == MVT::i16 || Op.getValueType() == MVT::i8) &&
+         "Tried to select abs i16 lowering with non-i16 type.");
+
+  // divergent means will not end up using SGPRs
+  if (Op->isDivergent())
+    return SDValue();
+
+  //(abs i16 (i16 op1)) -> (trunc i16 (abs i32 (sext i32 (i16 op1))))
+  SDValue Src = Op.getOperand(0);
+  SDLoc DL(Src);
+  SDValue SExtSrc = DAG.getSExtOrTrunc(Src, DL, MVT::i32);
+  SDValue ExtAbs = DAG.getNode(ISD::ABS, DL, MVT::i32, SExtSrc);
+  return DAG.getNode(ISD::TRUNCATE, DL, Op.getValueType(), ExtAbs);
+}
+
 SDValue SITargetLowering::getSegmentAperture(unsigned AS, const SDLoc &DL,
                                              SelectionDAG &DAG) const {
   if (Subtarget->hasApertureRegs()) {
@@ -16855,6 +16892,10 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
     return SDValue();
 
   switch (N->getOpcode()) {
+  case ISD::ABS:
+    if (N->getValueType(0) == MVT::i16 || N->getValueType(0) == MVT::i8)
+      return lowerABSi16(SDValue(N,0), DCI.DAG);
+    break;
   case ISD::ADD:
     return performAddCombine(N, DCI);
   case ISD::PTRADD:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 74e58f4272e10..25e94851c24df 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -184,6 +184,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   SDValue lowerTrapHsaQueuePtr(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerTrapHsa(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerDEBUGTRAP(SDValue Op, SelectionDAG &DAG) const;
+  SDValue lowerABSi16(SDValue Op, SelectionDAG &DAG) const;
 
   SDNode *adjustWritemask(MachineSDNode *&N, SelectionDAG &DAG) const;
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll
deleted file mode 100644
index 6facdfdec64ae..0000000000000
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll
+++ /dev/null
@@ -1,600 +0,0 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tahiti -o - < %s | FileCheck %s --check-prefixes=GFX,GFX6
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=fiji -o - < %s | FileCheck %s --check-prefixes=GFX,GFX8
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -o - < %s | FileCheck %s --check-prefixes=GFX,GFX10
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1250 -o - < %s | FileCheck %s --check-prefixes=GFX,GFX1250
-
-declare i16 @llvm.abs.i16(i16, i1)
-declare i32 @llvm.abs.i32(i32, i1)
-declare i64 @llvm.abs.i64(i64, i1)
-declare <2 x i8> @llvm.abs.v2i8(<2 x i8>, i1)
-declare <3 x i8> @llvm.abs.v3i8(<3 x i8>, i1)
-declare <2 x i16> @llvm.abs.v2i16(<2 x i16>, i1)
-declare <3 x i16> @llvm.abs.v3i16(<3 x i16>, i1)
-declare <4 x i32> @llvm.abs.v4i32(<4 x i32>, i1)
-
-define amdgpu_cs i16 @abs_sgpr_i16(i16 inreg %arg) {
-; GFX6-LABEL: abs_sgpr_i16:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    s_sext_i32_i16 s0, s0
-; GFX6-NEXT:    s_abs_i32 s0, s0
-; GFX6-NEXT:    ; return to shader part epilog
-;
-; GFX8-LABEL: abs_sgpr_i16:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_sext_i32_i16 s0, s0
-; GFX8-NEXT:    s_abs_i32 s0, s0
-; GFX8-NEXT:    ; return to shader part epilog
-;
-; GFX10-LABEL: abs_sgpr_i16:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_sext_i32_i16 s0, s0
-; GFX10-NEXT:    s_abs_i32 s0, s0
-; GFX10-NEXT:    ; return to shader part epilog
-;
-; GFX1250-LABEL: abs_sgpr_i16:
-; GFX1250:       ; %bb.0:
-; GFX1250-NEXT:    s_sext_i32_i16 s0, s0
-; GFX1250-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX1250-NEXT:    s_abs_i32 s0, s0
-; GFX1250-NEXT:    ; return to shader part epilog
-  %res = call i16 @llvm.abs.i16(i16 %arg, i1 false)
-  ret i16 %res
-}
-
-define amdgpu_cs i32 @abs_sgpr_i32(i32 inreg %arg) {
-; GFX-LABEL: abs_sgpr_i32:
-; GFX:       ; %bb.0:
-; GFX-NEXT:    s_abs_i32 s0, s0
-; GFX-NEXT:    ; return to shader part epilog
-  %res = call i32 @llvm.abs.i32(i32 %arg, i1 false)
-  ret i32 %res
-}
-
-define amdgpu_cs i64 @abs_sgpr_i64(i64 inreg %arg) {
-; GFX6-LABEL: abs_sgpr_i64:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    s_ashr_i32 s2, s1, 31
-; GFX6-NEXT:    s_add_u32 s0, s0, s2
-; GFX6-NEXT:    s_mov_b32 s3, s2
-; GFX6-NEXT:    s_addc_u32 s1, s1, s2
-; GFX6-NEXT:    s_xor_b64 s[0:1], s[0:1], s[2:3]
-; GFX6-NEXT:    ; return to shader part epilog
-;
-; GFX8-LABEL: abs_sgpr_i64:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_ashr_i32 s2, s1, 31
-; GFX8-NEXT:    s_add_u32 s0, s0, s2
-; GFX8-NEXT:    s_mov_b32 s3, s2
-; GFX8-NEXT:    s_addc_u32 s1, s1, s2
-; GFX8-NEXT:    s_xor_b64 s[0:1], s[0:1], s[2:3]
-; GFX8-NEXT:    ; return to shader part epilog
-;
-; GFX10-LABEL: abs_sgpr_i64:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_ashr_i32 s2, s1, 31
-; GFX10-NEXT:    s_add_u32 s0, s0, s2
-; GFX10-NEXT:    s_mov_b32 s3, s2
-; GFX10-NEXT:    s_addc_u32 s1, s1, s2
-; GFX10-NEXT:    s_xor_b64 s[0:1], s[0:1], s[2:3]
-; GFX10-NEXT:    ; return to shader part epilog
-;
-; GFX1250-LABEL: abs_sgpr_i64:
-; GFX1250:       ; %bb.0:
-; GFX1250-NEXT:    s_ashr_i32 s2, s1, 31
-; GFX1250-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1250-NEXT:    s_mov_b32 s3, s2
-; GFX1250-NEXT:    s_add_nc_u64 s[0:1], s[0:1], s[2:3]
-; GFX1250-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX1250-NEXT:    s_xor_b64 s[0:1], s[0:1], s[2:3]
-; GFX1250-NEXT:    ; return to shader part epilog
-  %res = call i64 @llvm.abs.i64(i64 %arg, i1 false)
-  ret i64 %res
-}
-
-define amdgpu_cs <4 x i32> @abs_sgpr_v4i32(<4 x i32> inreg %arg) {
-; GFX-LABEL: abs_sgpr_v4i32:
-; GFX:       ; %bb.0:
-; GFX-NEXT:    s_abs_i32 s0, s0
-; GFX-NEXT:    s_abs_i32 s1, s1
-; GFX-NEXT:    s_abs_i32 s2, s2
-; GFX-NEXT:    s_abs_i32 s3, s3
-; GFX-NEXT:    ; return to shader part epilog
-  %res = call <4 x i32> @llvm.abs.v4i32(<4 x i32> %arg, i1 false)
-  ret <4 x i32> %res
-}
-
-define i16 @abs_vgpr_i16(i16 %arg) {
-; GFX6-LABEL: abs_vgpr_i16:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    v_bfe_i32 v0, v0, 0, 16
-; GFX6-NEXT:    v_sub_i32_e32 v1, vcc, 0, v0
-; GFX6-NEXT:    v_max_i32_e32 v0, v0, v1
-; GFX6-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_i16:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_sub_u16_e32 v1, 0, v0
-; GFX8-NEXT:    v_max_i16_e32 v0, v0, v1
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_i16:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_sub_nc_u16 v1, 0, v0
-; GFX10-NEXT:    v_max_i16 v0, v0, v1
-; GFX10-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_i16:
-; GFX1250:       ; %bb.0:
-; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT:    s_wait_kmcnt 0x0
-; GFX1250-NEXT:    v_sub_nc_u16 v1, 0, v0
-; GFX1250-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1250-NEXT:    v_max_i16 v0, v0, v1
-; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
-  %res = call i16 @llvm.abs.i16(i16 %arg, i1 false)
-  ret i16 %res
-}
-
-define i32 @abs_vgpr_i32(i32 %arg) {
-; GFX6-LABEL: abs_vgpr_i32:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    v_sub_i32_e32 v1, vcc, 0, v0
-; GFX6-NEXT:    v_max_i32_e32 v0, v0, v1
-; GFX6-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_i32:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_sub_u32_e32 v1, vcc, 0, v0
-; GFX8-NEXT:    v_max_i32_e32 v0, v0, v1
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_i32:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_sub_nc_u32_e32 v1, 0, v0
-; GFX10-NEXT:    v_max_i32_e32 v0, v0, v1
-; GFX10-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_i32:
-; GFX1250:       ; %bb.0:
-; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT:    s_wait_kmcnt 0x0
-; GFX1250-NEXT:    v_sub_nc_u32_e32 v1, 0, v0
-; GFX1250-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1250-NEXT:    v_max_i32_e32 v0, v0, v1
-; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
-  %res = call i32 @llvm.abs.i32(i32 %arg, i1 false)
-  ret i32 %res
-}
-
-define i64 @abs_vgpr_i64(i64 %arg) {
-; GFX6-LABEL: abs_vgpr_i64:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    v_ashrrev_i32_e32 v2, 31, v1
-; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
-; GFX6-NEXT:    v_addc_u32_e32 v1, vcc, v1, v2, vcc
-; GFX6-NEXT:    v_xor_b32_e32 v0, v0, v2
-; GFX6-NEXT:    v_xor_b32_e32 v1, v1, v2
-; GFX6-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_i64:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_ashrrev_i32_e32 v2, 31, v1
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v0, v2
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, v1, v2, vcc
-; GFX8-NEXT:    v_xor_b32_e32 v0, v0, v2
-; GFX8-NEXT:    v_xor_b32_e32 v1, v1, v2
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_i64:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_ashrrev_i32_e32 v2, 31, v1
-; GFX10-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v2, vcc_lo
-; GFX10-NEXT:    v_xor_b32_e32 v0, v0, v2
-; GFX10-NEXT:    v_xor_b32_e32 v1, v1, v2
-; GFX10-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_i64:
-; GFX1250:       ; %bb.0:
-; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT:    s_wait_kmcnt 0x0
-; GFX1250-NEXT:    v_ashrrev_i32_e32 v2, 31, v1
-; GFX1250-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT:    v_mov_b32_e32 v3, v2
-; GFX1250-NEXT:    v_add_nc_u64_e32 v[0:1], v[0:1], v[2:3]
-; GFX1250-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX1250-NEXT:    v_xor_b32_e32 v0, v0, v2
-; GFX1250-NEXT:    v_xor_b32_e32 v1, v1, v2
-; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
-  %res = call i64 @llvm.abs.i64(i64 %arg, i1 false)
-  ret i64 %res
-}
-
-define <4 x i32> @abs_vgpr_v4i32(<4 x i32> %arg) {
-; GFX6-LABEL: abs_vgpr_v4i32:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0, v0
-; GFX6-NEXT:    v_max_i32_e32 v0, v0, v4
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0, v1
-; GFX6-NEXT:    v_max_i32_e32 v1, v1, v4
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0, v2
-; GFX6-NEXT:    v_max_i32_e32 v2, v2, v4
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0, v3
-; GFX6-NEXT:    v_max_i32_e32 v3, v3, v4
-; GFX6-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_v4i32:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0, v0
-; GFX8-NEXT:    v_max_i32_e32 v0, v0, v4
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0, v1
-; GFX8-NEXT:    v_max_i32_e32 v1, v1, v4
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0, v2
-; GFX8-NEXT:    v_max_i32_e32 v2, v2, v4
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0, v3
-; GFX8-NEXT:    v_max_i32_e32 v3, v3, v4
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_v4i32:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_sub_nc_u32_e32 v4, 0, v0
-; GFX10-NEXT:    v_sub_nc_u32_e32 v5, 0, v1
-; GFX10-NEXT:    v_sub_nc_u32_e32 v6, 0, v2
-; GFX10-NEXT:    v_sub_nc_u32_e32 v7, 0, v3
-; GFX10-NEXT:    v_max_i32_e32 v0, v0, v4
-; GFX10-NEXT:    v_max_i32_e32 v1, v1, v5
-; GFX10-NEXT:    v_max_i32_e32 v2, v2, v6
-; GFX10-NEXT:    v_max_i32_e32 v3, v3, v7
-; GFX10-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_v4i32:
-; GFX1250:       ; %bb.0:
-; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT:    s_wait_kmcnt 0x0
-; GFX1250-NEXT:    v_dual_sub_nc_u32 v4, 0, v0 :: v_dual_sub_nc_u32 v5, 0, v1
-; GFX1250-NEXT:    v_dual_sub_nc_u32 v6, 0, v2 :: v_dual_sub_nc_u32 v7, 0, v3
-; GFX1250-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX1250-NEXT:    v_max_i32_e32 v0, v0, v4
-; GFX1250-NEXT:    v_max_i32_e32 v1, v1, v5
-; GFX1250-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX1250-NEXT:    v_max_i32_e32 v2, v2, v6
-; GFX1250-NEXT:    v_max_i32_e32 v3, v3, v7
-; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
-  %res = call <4 x i32> @llvm.abs.v4i32(<4 x i32> %arg, i1 false)
-  ret <4 x i32> %res
-}
-
-define amdgpu_cs <2 x i8> @abs_sgpr_v2i8(<2 x i8> inreg %arg) {
-; GFX-LABEL: abs_sgpr_v2i8:
-; GFX:       ; %bb.0:
-; GFX-NEXT:    s_sext_i32_i8 s0, s0
-; GFX-NEXT:    s_sext_i32_i8 s1, s1
-; GFX-NEXT:    s_abs_i32 s0, s0
-; GFX-NEXT:    s_abs_i32 s1, s1
-; GFX-NEXT:    ; return to shader part epilog
-  %res = call <2 x i8> @llvm.abs.v2i8(<2 x i8> %arg, i1 false)
-  ret <2 x i8> %res
-}
-
-define <2 x i8> @abs_vgpr_v2i8(<2 x i8> %arg) {
-; GFX6-LABEL: abs_vgpr_v2i8:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    v_bfe_i32 v0, v0, 0, 8
-; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, 0, v0
-; GFX6-NEXT:    v_bfe_i32 v1, v1, 0, 8
-; GFX6-NEXT:    v_max_i32_e32 v0, v0, v2
-; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, 0, v1
-; GFX6-NEXT:    v_max_i32_e32 v1, v1, v2
-; GFX6-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_v2i8:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_mov_b32_e32 v2, 0
-; GFX8-NEXT:    v_sub_u16_sdwa v3, v2, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT:    v_sub_u16_sdwa v2, v2, sext(v1) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT:    v_max_i16_sdwa v0, sext(v0), v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT:    v_max_i16_sdwa v1, sext(v1), v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_v2i8:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_bfe_i32 v0, v0, 0, 8
-; GFX10-NEXT:    v_bfe_i32 v1, v1, 0, 8
-; GFX10-NEXT:    v_sub_nc_u16 v2, 0, v0
-; GFX10-NEXT:    v_sub_nc_u16 v3, 0, v1
-; GFX10-NEXT:    v_max_i16 v0, v0, v2
-; GFX10-NEXT:    v_max_i16 v1, v1, v3
-; GFX10-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_v2i8:
-; GFX1250:       ; %bb.0:
-; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT:    s_wait_kmcnt 0x0
-; GFX1250-NEXT:    v_bfe_i32 v0, v0, 0, 8
-; GFX1250-NEXT:    v_bfe_i32 v1, v1, 0, 8
-; GFX1250-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX1250-NEXT:    v_sub_nc_u16 v2, 0, v0
-; GFX1250-NEXT:    v_sub_nc_u16 v3, 0, v1
-; GFX1250-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX1250-NEXT:    v_max_i16 v0, v0, v2
-; GFX1250-NEXT:    v_max_i16 v1, v1, v3
-; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
-  %res = call <2 x i8> @llvm.abs.v2i8(<2 x i8> %arg, i1 false)
-  ret <2 x i8> %res
-}
-
-define amdgpu_cs <3 x i8> @abs_sgpr_v3i8(<3 x i8> inreg %arg) {
-; GFX-LABEL: abs_sgpr_v3i8:
-; GFX:       ; %bb.0:
-; GFX-NEXT:    s_sext_i32_i8 s0, s0
-; GFX-NEXT:    s_sext_i32_i8 s1, s1
-; GFX-NEXT:    s_sext_i32_i8 s2, s2
-; GFX-NEXT:    s_abs_i32 s0, s0
-; GFX-NEXT:    s_abs_i32 s1, s1
-; GFX-NEXT:    s_abs_i32 s2, s2
-; GFX-NEXT:    ; return to shader part epilog
-  %res = call <3 x i8> @llvm.abs.v3i8(<3 x i8> %arg, i1 false)
-  ret <3 x i8> %res
-}
-
-define <3 x i8> @abs_vgpr_v3i8(<3 x i8>  %arg) {
-; GFX6-LABEL: abs_vgpr_v3i8:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    v_bfe_i32 v0, v0, 0, 8
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0, v0
-; GFX6-NEXT:    v_bfe_i32 v1, v1, 0, 8
-; GFX6-NEXT:    v_max_i32_e32 v0, v0, v3
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0, v1
-; GFX6-NEXT:    v_bfe_i32 v2, v2, 0, 8
-; GFX6-NEXT:    v_max_i32_e32 v1, v1, v3
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0, v2
-; GFX6-NEXT:    v_max_i32_e32 v2, v2, v3
-; GFX6-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_v3i8:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_mov_b32_e32 v3, 0
-; GFX8-NEXT:    v_sub_u16_sdwa v4, v3, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT:    v_max_i16_sdwa v0, sext(v0), v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT:    v_sub_u16_sdwa v4, v3, sext(v1) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT:    v_sub_u16_sdwa v3, v3, sext(v2) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT:    v_max_i16_sdwa v1, sext(v1), v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT:    v_max_i16_sdwa v2, sext(v2), v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_v3i8:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_bfe_i32 v0, v0, 0, 8
-; GFX10-NEXT:    v_bfe_i32 v1, v1, 0, 8
-; GF...
[truncated]

@linuxrocks123
Copy link
Contributor Author

@arsenm here are the outputs of llvm.abs.ll for PR #165626 (good.txt) and this PR (bad.txt). You can compare them in a tool such as meld to see that the code for abs_sgpr_i8_neg is significantly worse in this PR for every sub-architecture because the target-independent expandABS is transforming it.

fiji.bad.txt
fiji.good.txt
gfx1010.bad.txt
gfx1010.good.txt
gfx1250.bad.txt
gfx1250.good.txt
tahiti.bad.txt
tahiti.good.txt

@arsenm
Copy link
Contributor

arsenm commented Nov 12, 2025

This PR is missing any handling of the neg (abs) case. I'd expect that to be a separate combine rooted on the neg, not something that changes the abs lowering

@linuxrocks123
Copy link
Contributor Author

@arsenm ideally the negative case should not need separate code since (sub 0, (abs x)) contains an instance of (abs x) which can be transformed exactly the same way as the positive case. However, perhaps overriding the hook for sub i8 / sub i16 with a thunk to the existing transformation will make it work. I'll try that.

@arsenm
Copy link
Contributor

arsenm commented Nov 13, 2025

@arsenm ideally the negative case should not need separate code since (sub 0, (abs x)) contains an instance of (abs x) which can be transformed exactly the same way as the positive case. However, perhaps overriding the hook for sub i8 / sub i16 with a thunk to the existing transformation will make it work. I'll try that.

You don't need to make the full transform, just do the reassociate to push the neg through the abs in a combine without touching any lowering. The risk is if something is actively undoing that, in which case you may need one of the TargetLowering hooks to avoid infinite looping the combiner

@jayfoad
Copy link
Contributor

jayfoad commented Nov 13, 2025

@arsenm ideally the negative case should not need separate code since (sub 0, (abs x)) contains an instance of (abs x) which can be transformed exactly the same way as the positive case. However, perhaps overriding the hook for sub i8 / sub i16 with a thunk to the existing transformation will make it work. I'll try that.

You don't need to make the full transform, just do the reassociate to push the neg through the abs in a combine without touching any lowering. The risk is if something is actively undoing that, in which case you may need one of the TargetLowering hooks to avoid infinite looping the combiner

You can't push a neg through an abs. (neg (abs x)) is totally not the same as (abs (neg x)).

@linuxrocks123
Copy link
Contributor Author

@arsenm @jayfoad this latest commit is my best effort to make abs_i8_neg generate good code using the SelectionDAG hooks. It still doesn't work.

If we strongly want to do this without overriding expandABS, I had a thought that maybe it would be possible to change the upstream expandABS to perform this transformation on all targets, but I haven't looked into that. Please let me know if you have any opinions on that.

Comment on lines +6808 to +6816
if (Op.getValueType() == MVT::i8)
if (isNullConstant(Op.getOperand(0)) &&
Op.getOperand(1).getOpcode() == ISD::ABS)
return DAG.getNode(
ISD::TRUNCATE, SDLoc(Op), MVT::i8,
DAG.getNode(ISD::SUB, SDLoc(Op), MVT::i32,
DAG.getConstant(0, SDLoc(Op), MVT::i32),
lowerABSi16(Op.getOperand(1), DAG).getOperand(0)));
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks essentially correct, but it's in the wrong place. This is not mandatory lowering, so this should go in PerformDAGCombine. Doesn't this also need the divergence check?

Copy link
Contributor Author

@linuxrocks123 linuxrocks123 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also in performSubCombine. No worky there, either. And yes it's missing divergence check, but adding inhibitions to the optimization wouldn't make it worky.

//(abs i16 (i16 op1)) -> (trunc i16 (abs i32 (sext i32 (i16 op1))))
SDValue Src = Op.getOperand(0);
SDLoc DL(Src);
SDValue SExtSrc = DAG.getSExtOrTrunc(Src, DL, MVT::i32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can slightly generalize this to work on vectors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does our ISA have a vector equivalent of s_sext_i16 / s_sext_i8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's v_bfe_i32

@arsenm
Copy link
Contributor

arsenm commented Nov 15, 2025

If we strongly want to do this without overriding expandABS, I had a thought that maybe it would be possible to change the upstream expandABS to perform this transformation on all targets, but I haven't looked into that. Please let me know if you have any opinions on that.

It doesn't generally apply though, due to the conditional on the divergence so I'm not sure touching the generic code helps at all

@linuxrocks123
Copy link
Contributor Author

linuxrocks123 commented Nov 19, 2025

@arsenm resolved comments unrelated to the "this approach doesn't work for -abs(i8)" demonstration value of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants