-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[AMDGPU] [DO NOT MERGE] Nonsuccessful Attempt At Using SelectionDAG Hooks for abs i8/i16 #167064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
You can test this locally with the following command:git-clang-format --diff origin/main HEAD --extensions cpp,h -- llvm/lib/Target/AMDGPU/SIISelLowering.cpp llvm/lib/Target/AMDGPU/SIISelLowering.h --diff_from_common_commit
View the diff from clang-format here.diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index b39c705ff..0ae07e378 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -190,7 +190,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
// We don't want the default expansion of 16-bit ABS since we can
// sign-extend and use the 32-bit ABS operation for 16-bit ABS with SGPRs
- setOperationAction(ISD::ABS, {MVT::i8,MVT::i16}, Custom);
+ setOperationAction(ISD::ABS, {MVT::i8, MVT::i16}, Custom);
setOperationAction(ISD::SUB, {MVT::i8}, Custom);
}
@@ -7301,7 +7301,7 @@ static SDValue lowerLaneOp(const SITargetLowering &TLI, SDNode *N,
void SITargetLowering::ReplaceNodeResults(SDNode *N,
SmallVectorImpl<SDValue> &Results,
SelectionDAG &DAG) const {
- switch (N->getOpcode()) {
+ switch (N->getOpcode()) {
case ISD::INSERT_VECTOR_ELT: {
if (SDValue Res = lowerINSERT_VECTOR_ELT(SDValue(N, 0), DAG))
Results.push_back(Res);
@@ -16912,7 +16912,7 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
switch (N->getOpcode()) {
case ISD::ABS:
if (N->getValueType(0) == MVT::i16 || N->getValueType(0) == MVT::i8)
- return lowerABSi16(SDValue(N,0), DCI.DAG);
+ return lowerABSi16(SDValue(N, 0), DCI.DAG);
break;
case ISD::ADD:
return performAddCombine(N, DCI);
|
|
|
||
| // divergent means will not end up using SGPRs | ||
| if (Op->isDivergent()) | ||
| return SDValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the opposite of what you want:
| return SDValue(); | |
| return Op; |
To get the default expansion, you return the original node. Return SDValue() means treat as legal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm are you sure about that? See https://github.com/linuxrocks123/llvm-project/blob/97c9dddc96f3576fed0762344ce84b2c48e16671/llvm/lib/Target/AMDGPU/SIISelLowering.cpp#L16891. I don't see how -O0 could be working for us if SDValue() means "treat as legal."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is PerformDAGCombine, which is an optional optimization. This is required lowering. The custom lowering falls back on the default expansion:
| if (SDValue Res = TLI.LowerOperation(SDValue(Node, 0), DAG)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or it's the other way around
|
Hi @arsenm, in this PR, I have redundantly used every target-specific hook suggested by you or @RKSimon to demonstrate that none of them are functioning for our intended purpose. If I am using one of those hooks incorrectly, a review comment explaining how that hook could be made to work for Thank you for your other comments, but, since this PR is not intended for merge, it does not need any review except to verify that I have not missed a way to make one of these hooks work. For that reason, I have resolved all existing conversations to ensure that I don't miss a comment from you or @RKSimon explaining how a hook could be made to function. Regarding |
|
What is the issue here? This looks like it's working fine? |
| @@ -0,0 +1,1478 @@ | |||
| ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py | |||
| ; RUN: llc -mtriple=amdgcn -mcpu=tahiti -o - < %s | FileCheck %s --check-prefixes=SDAG6 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use explicit -global-isel=0 for the dag run lines
|
I checked out this branch and it seems to work fine? I don't see what the issue is |
|
@llvm/pr-subscribers-backend-amdgpu Author: Patrick Simmons (linuxrocks123) ChangesThis is to demonstrate that setOperationAction does not work to solve SWDEV-562122. See #165626 Patch is 85.67 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/167064.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 8bb28084159e8..95910b68f9c7d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -42,6 +42,7 @@
#include "llvm/IR/IntrinsicsR600.h"
#include "llvm/IR/MDBuilder.h"
#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/Compiler.h"
#include "llvm/Support/KnownBits.h"
#include "llvm/Support/ModRef.h"
#include "llvm/Transforms/Utils/LowerAtomic.h"
@@ -177,6 +178,10 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
addRegisterClass(MVT::v32i16, &AMDGPU::SGPR_512RegClass);
addRegisterClass(MVT::v32f16, &AMDGPU::SGPR_512RegClass);
addRegisterClass(MVT::v32bf16, &AMDGPU::SGPR_512RegClass);
+
+ // We don't want the default expansion of 16-bit ABS since we can
+ // sign-extend and use the 32-bit ABS operation for 16-bit ABS with SGPRs
+ setOperationAction(ISD::ABS, {MVT::i8,MVT::i16}, Custom);
}
addRegisterClass(MVT::v32i32, &AMDGPU::VReg_1024RegClass);
@@ -974,7 +979,8 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
Custom);
}
- setTargetDAGCombine({ISD::ADD,
+ setTargetDAGCombine({ISD::ABS,
+ ISD::ADD,
ISD::PTRADD,
ISD::UADDO_CARRY,
ISD::SUB,
@@ -6774,6 +6780,9 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
case ISD::DEBUGTRAP:
return lowerDEBUGTRAP(Op, DAG);
case ISD::ABS:
+ if (Op.getValueType() == MVT::i16 || Op.getValueType() == MVT::i8)
+ return lowerABSi16(Op, DAG);
+ LLVM_FALLTHROUGH;
case ISD::FABS:
case ISD::FNEG:
case ISD::FCANONICALIZE:
@@ -7272,7 +7281,7 @@ static SDValue lowerLaneOp(const SITargetLowering &TLI, SDNode *N,
void SITargetLowering::ReplaceNodeResults(SDNode *N,
SmallVectorImpl<SDValue> &Results,
SelectionDAG &DAG) const {
- switch (N->getOpcode()) {
+ switch (N->getOpcode()) {
case ISD::INSERT_VECTOR_ELT: {
if (SDValue Res = lowerINSERT_VECTOR_ELT(SDValue(N, 0), DAG))
Results.push_back(Res);
@@ -7450,6 +7459,15 @@ void SITargetLowering::ReplaceNodeResults(SDNode *N,
Results.push_back(lowerFSQRTF16(SDValue(N, 0), DAG));
break;
}
+ case ISD::ABS:
+ if (N->getValueType(0) == MVT::i16 || N->getValueType(0) == MVT::i8) {
+ SDValue result = lowerABSi16(SDValue(N, 0), DAG);
+ if(result!=SDValue()) {
+ Results.push_back(result);
+ return;
+ }
+ }
+ LLVM_FALLTHROUGH;
default:
AMDGPUTargetLowering::ReplaceNodeResults(N, Results, DAG);
break;
@@ -8139,6 +8157,25 @@ SDValue SITargetLowering::lowerDEBUGTRAP(SDValue Op, SelectionDAG &DAG) const {
return DAG.getNode(AMDGPUISD::TRAP, SL, MVT::Other, Ops);
}
+// sign-extend and use the 32-bit ABS operation for 16-bit ABS with SGPRs
+SDValue SITargetLowering::lowerABSi16(SDValue Op, SelectionDAG &DAG) const {
+ assert(Op.getOpcode() == ISD::ABS &&
+ "Tried to select abs with non-abs opcode.");
+ assert((Op.getValueType() == MVT::i16 || Op.getValueType() == MVT::i8) &&
+ "Tried to select abs i16 lowering with non-i16 type.");
+
+ // divergent means will not end up using SGPRs
+ if (Op->isDivergent())
+ return SDValue();
+
+ //(abs i16 (i16 op1)) -> (trunc i16 (abs i32 (sext i32 (i16 op1))))
+ SDValue Src = Op.getOperand(0);
+ SDLoc DL(Src);
+ SDValue SExtSrc = DAG.getSExtOrTrunc(Src, DL, MVT::i32);
+ SDValue ExtAbs = DAG.getNode(ISD::ABS, DL, MVT::i32, SExtSrc);
+ return DAG.getNode(ISD::TRUNCATE, DL, Op.getValueType(), ExtAbs);
+}
+
SDValue SITargetLowering::getSegmentAperture(unsigned AS, const SDLoc &DL,
SelectionDAG &DAG) const {
if (Subtarget->hasApertureRegs()) {
@@ -16855,6 +16892,10 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
return SDValue();
switch (N->getOpcode()) {
+ case ISD::ABS:
+ if (N->getValueType(0) == MVT::i16 || N->getValueType(0) == MVT::i8)
+ return lowerABSi16(SDValue(N,0), DCI.DAG);
+ break;
case ISD::ADD:
return performAddCombine(N, DCI);
case ISD::PTRADD:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 74e58f4272e10..25e94851c24df 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -184,6 +184,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
SDValue lowerTrapHsaQueuePtr(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerTrapHsa(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerDEBUGTRAP(SDValue Op, SelectionDAG &DAG) const;
+ SDValue lowerABSi16(SDValue Op, SelectionDAG &DAG) const;
SDNode *adjustWritemask(MachineSDNode *&N, SelectionDAG &DAG) const;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll
deleted file mode 100644
index 6facdfdec64ae..0000000000000
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll
+++ /dev/null
@@ -1,600 +0,0 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tahiti -o - < %s | FileCheck %s --check-prefixes=GFX,GFX6
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=fiji -o - < %s | FileCheck %s --check-prefixes=GFX,GFX8
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -o - < %s | FileCheck %s --check-prefixes=GFX,GFX10
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1250 -o - < %s | FileCheck %s --check-prefixes=GFX,GFX1250
-
-declare i16 @llvm.abs.i16(i16, i1)
-declare i32 @llvm.abs.i32(i32, i1)
-declare i64 @llvm.abs.i64(i64, i1)
-declare <2 x i8> @llvm.abs.v2i8(<2 x i8>, i1)
-declare <3 x i8> @llvm.abs.v3i8(<3 x i8>, i1)
-declare <2 x i16> @llvm.abs.v2i16(<2 x i16>, i1)
-declare <3 x i16> @llvm.abs.v3i16(<3 x i16>, i1)
-declare <4 x i32> @llvm.abs.v4i32(<4 x i32>, i1)
-
-define amdgpu_cs i16 @abs_sgpr_i16(i16 inreg %arg) {
-; GFX6-LABEL: abs_sgpr_i16:
-; GFX6: ; %bb.0:
-; GFX6-NEXT: s_sext_i32_i16 s0, s0
-; GFX6-NEXT: s_abs_i32 s0, s0
-; GFX6-NEXT: ; return to shader part epilog
-;
-; GFX8-LABEL: abs_sgpr_i16:
-; GFX8: ; %bb.0:
-; GFX8-NEXT: s_sext_i32_i16 s0, s0
-; GFX8-NEXT: s_abs_i32 s0, s0
-; GFX8-NEXT: ; return to shader part epilog
-;
-; GFX10-LABEL: abs_sgpr_i16:
-; GFX10: ; %bb.0:
-; GFX10-NEXT: s_sext_i32_i16 s0, s0
-; GFX10-NEXT: s_abs_i32 s0, s0
-; GFX10-NEXT: ; return to shader part epilog
-;
-; GFX1250-LABEL: abs_sgpr_i16:
-; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_sext_i32_i16 s0, s0
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX1250-NEXT: s_abs_i32 s0, s0
-; GFX1250-NEXT: ; return to shader part epilog
- %res = call i16 @llvm.abs.i16(i16 %arg, i1 false)
- ret i16 %res
-}
-
-define amdgpu_cs i32 @abs_sgpr_i32(i32 inreg %arg) {
-; GFX-LABEL: abs_sgpr_i32:
-; GFX: ; %bb.0:
-; GFX-NEXT: s_abs_i32 s0, s0
-; GFX-NEXT: ; return to shader part epilog
- %res = call i32 @llvm.abs.i32(i32 %arg, i1 false)
- ret i32 %res
-}
-
-define amdgpu_cs i64 @abs_sgpr_i64(i64 inreg %arg) {
-; GFX6-LABEL: abs_sgpr_i64:
-; GFX6: ; %bb.0:
-; GFX6-NEXT: s_ashr_i32 s2, s1, 31
-; GFX6-NEXT: s_add_u32 s0, s0, s2
-; GFX6-NEXT: s_mov_b32 s3, s2
-; GFX6-NEXT: s_addc_u32 s1, s1, s2
-; GFX6-NEXT: s_xor_b64 s[0:1], s[0:1], s[2:3]
-; GFX6-NEXT: ; return to shader part epilog
-;
-; GFX8-LABEL: abs_sgpr_i64:
-; GFX8: ; %bb.0:
-; GFX8-NEXT: s_ashr_i32 s2, s1, 31
-; GFX8-NEXT: s_add_u32 s0, s0, s2
-; GFX8-NEXT: s_mov_b32 s3, s2
-; GFX8-NEXT: s_addc_u32 s1, s1, s2
-; GFX8-NEXT: s_xor_b64 s[0:1], s[0:1], s[2:3]
-; GFX8-NEXT: ; return to shader part epilog
-;
-; GFX10-LABEL: abs_sgpr_i64:
-; GFX10: ; %bb.0:
-; GFX10-NEXT: s_ashr_i32 s2, s1, 31
-; GFX10-NEXT: s_add_u32 s0, s0, s2
-; GFX10-NEXT: s_mov_b32 s3, s2
-; GFX10-NEXT: s_addc_u32 s1, s1, s2
-; GFX10-NEXT: s_xor_b64 s[0:1], s[0:1], s[2:3]
-; GFX10-NEXT: ; return to shader part epilog
-;
-; GFX1250-LABEL: abs_sgpr_i64:
-; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_ashr_i32 s2, s1, 31
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1250-NEXT: s_mov_b32 s3, s2
-; GFX1250-NEXT: s_add_nc_u64 s[0:1], s[0:1], s[2:3]
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX1250-NEXT: s_xor_b64 s[0:1], s[0:1], s[2:3]
-; GFX1250-NEXT: ; return to shader part epilog
- %res = call i64 @llvm.abs.i64(i64 %arg, i1 false)
- ret i64 %res
-}
-
-define amdgpu_cs <4 x i32> @abs_sgpr_v4i32(<4 x i32> inreg %arg) {
-; GFX-LABEL: abs_sgpr_v4i32:
-; GFX: ; %bb.0:
-; GFX-NEXT: s_abs_i32 s0, s0
-; GFX-NEXT: s_abs_i32 s1, s1
-; GFX-NEXT: s_abs_i32 s2, s2
-; GFX-NEXT: s_abs_i32 s3, s3
-; GFX-NEXT: ; return to shader part epilog
- %res = call <4 x i32> @llvm.abs.v4i32(<4 x i32> %arg, i1 false)
- ret <4 x i32> %res
-}
-
-define i16 @abs_vgpr_i16(i16 %arg) {
-; GFX6-LABEL: abs_vgpr_i16:
-; GFX6: ; %bb.0:
-; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT: v_bfe_i32 v0, v0, 0, 16
-; GFX6-NEXT: v_sub_i32_e32 v1, vcc, 0, v0
-; GFX6-NEXT: v_max_i32_e32 v0, v0, v1
-; GFX6-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_i16:
-; GFX8: ; %bb.0:
-; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT: v_sub_u16_e32 v1, 0, v0
-; GFX8-NEXT: v_max_i16_e32 v0, v0, v1
-; GFX8-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_i16:
-; GFX10: ; %bb.0:
-; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT: v_sub_nc_u16 v1, 0, v0
-; GFX10-NEXT: v_max_i16 v0, v0, v1
-; GFX10-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_i16:
-; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: v_sub_nc_u16 v1, 0, v0
-; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1250-NEXT: v_max_i16 v0, v0, v1
-; GFX1250-NEXT: s_set_pc_i64 s[30:31]
- %res = call i16 @llvm.abs.i16(i16 %arg, i1 false)
- ret i16 %res
-}
-
-define i32 @abs_vgpr_i32(i32 %arg) {
-; GFX6-LABEL: abs_vgpr_i32:
-; GFX6: ; %bb.0:
-; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT: v_sub_i32_e32 v1, vcc, 0, v0
-; GFX6-NEXT: v_max_i32_e32 v0, v0, v1
-; GFX6-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_i32:
-; GFX8: ; %bb.0:
-; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT: v_sub_u32_e32 v1, vcc, 0, v0
-; GFX8-NEXT: v_max_i32_e32 v0, v0, v1
-; GFX8-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_i32:
-; GFX10: ; %bb.0:
-; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT: v_sub_nc_u32_e32 v1, 0, v0
-; GFX10-NEXT: v_max_i32_e32 v0, v0, v1
-; GFX10-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_i32:
-; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: v_sub_nc_u32_e32 v1, 0, v0
-; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1250-NEXT: v_max_i32_e32 v0, v0, v1
-; GFX1250-NEXT: s_set_pc_i64 s[30:31]
- %res = call i32 @llvm.abs.i32(i32 %arg, i1 false)
- ret i32 %res
-}
-
-define i64 @abs_vgpr_i64(i64 %arg) {
-; GFX6-LABEL: abs_vgpr_i64:
-; GFX6: ; %bb.0:
-; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT: v_ashrrev_i32_e32 v2, 31, v1
-; GFX6-NEXT: v_add_i32_e32 v0, vcc, v0, v2
-; GFX6-NEXT: v_addc_u32_e32 v1, vcc, v1, v2, vcc
-; GFX6-NEXT: v_xor_b32_e32 v0, v0, v2
-; GFX6-NEXT: v_xor_b32_e32 v1, v1, v2
-; GFX6-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_i64:
-; GFX8: ; %bb.0:
-; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT: v_ashrrev_i32_e32 v2, 31, v1
-; GFX8-NEXT: v_add_u32_e32 v0, vcc, v0, v2
-; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v1, v2, vcc
-; GFX8-NEXT: v_xor_b32_e32 v0, v0, v2
-; GFX8-NEXT: v_xor_b32_e32 v1, v1, v2
-; GFX8-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_i64:
-; GFX10: ; %bb.0:
-; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT: v_ashrrev_i32_e32 v2, 31, v1
-; GFX10-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX10-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, v1, v2, vcc_lo
-; GFX10-NEXT: v_xor_b32_e32 v0, v0, v2
-; GFX10-NEXT: v_xor_b32_e32 v1, v1, v2
-; GFX10-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_i64:
-; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: v_ashrrev_i32_e32 v2, 31, v1
-; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_mov_b32_e32 v3, v2
-; GFX1250-NEXT: v_add_nc_u64_e32 v[0:1], v[0:1], v[2:3]
-; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX1250-NEXT: v_xor_b32_e32 v0, v0, v2
-; GFX1250-NEXT: v_xor_b32_e32 v1, v1, v2
-; GFX1250-NEXT: s_set_pc_i64 s[30:31]
- %res = call i64 @llvm.abs.i64(i64 %arg, i1 false)
- ret i64 %res
-}
-
-define <4 x i32> @abs_vgpr_v4i32(<4 x i32> %arg) {
-; GFX6-LABEL: abs_vgpr_v4i32:
-; GFX6: ; %bb.0:
-; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT: v_sub_i32_e32 v4, vcc, 0, v0
-; GFX6-NEXT: v_max_i32_e32 v0, v0, v4
-; GFX6-NEXT: v_sub_i32_e32 v4, vcc, 0, v1
-; GFX6-NEXT: v_max_i32_e32 v1, v1, v4
-; GFX6-NEXT: v_sub_i32_e32 v4, vcc, 0, v2
-; GFX6-NEXT: v_max_i32_e32 v2, v2, v4
-; GFX6-NEXT: v_sub_i32_e32 v4, vcc, 0, v3
-; GFX6-NEXT: v_max_i32_e32 v3, v3, v4
-; GFX6-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_v4i32:
-; GFX8: ; %bb.0:
-; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT: v_sub_u32_e32 v4, vcc, 0, v0
-; GFX8-NEXT: v_max_i32_e32 v0, v0, v4
-; GFX8-NEXT: v_sub_u32_e32 v4, vcc, 0, v1
-; GFX8-NEXT: v_max_i32_e32 v1, v1, v4
-; GFX8-NEXT: v_sub_u32_e32 v4, vcc, 0, v2
-; GFX8-NEXT: v_max_i32_e32 v2, v2, v4
-; GFX8-NEXT: v_sub_u32_e32 v4, vcc, 0, v3
-; GFX8-NEXT: v_max_i32_e32 v3, v3, v4
-; GFX8-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_v4i32:
-; GFX10: ; %bb.0:
-; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT: v_sub_nc_u32_e32 v4, 0, v0
-; GFX10-NEXT: v_sub_nc_u32_e32 v5, 0, v1
-; GFX10-NEXT: v_sub_nc_u32_e32 v6, 0, v2
-; GFX10-NEXT: v_sub_nc_u32_e32 v7, 0, v3
-; GFX10-NEXT: v_max_i32_e32 v0, v0, v4
-; GFX10-NEXT: v_max_i32_e32 v1, v1, v5
-; GFX10-NEXT: v_max_i32_e32 v2, v2, v6
-; GFX10-NEXT: v_max_i32_e32 v3, v3, v7
-; GFX10-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_v4i32:
-; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: v_dual_sub_nc_u32 v4, 0, v0 :: v_dual_sub_nc_u32 v5, 0, v1
-; GFX1250-NEXT: v_dual_sub_nc_u32 v6, 0, v2 :: v_dual_sub_nc_u32 v7, 0, v3
-; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX1250-NEXT: v_max_i32_e32 v0, v0, v4
-; GFX1250-NEXT: v_max_i32_e32 v1, v1, v5
-; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX1250-NEXT: v_max_i32_e32 v2, v2, v6
-; GFX1250-NEXT: v_max_i32_e32 v3, v3, v7
-; GFX1250-NEXT: s_set_pc_i64 s[30:31]
- %res = call <4 x i32> @llvm.abs.v4i32(<4 x i32> %arg, i1 false)
- ret <4 x i32> %res
-}
-
-define amdgpu_cs <2 x i8> @abs_sgpr_v2i8(<2 x i8> inreg %arg) {
-; GFX-LABEL: abs_sgpr_v2i8:
-; GFX: ; %bb.0:
-; GFX-NEXT: s_sext_i32_i8 s0, s0
-; GFX-NEXT: s_sext_i32_i8 s1, s1
-; GFX-NEXT: s_abs_i32 s0, s0
-; GFX-NEXT: s_abs_i32 s1, s1
-; GFX-NEXT: ; return to shader part epilog
- %res = call <2 x i8> @llvm.abs.v2i8(<2 x i8> %arg, i1 false)
- ret <2 x i8> %res
-}
-
-define <2 x i8> @abs_vgpr_v2i8(<2 x i8> %arg) {
-; GFX6-LABEL: abs_vgpr_v2i8:
-; GFX6: ; %bb.0:
-; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT: v_bfe_i32 v0, v0, 0, 8
-; GFX6-NEXT: v_sub_i32_e32 v2, vcc, 0, v0
-; GFX6-NEXT: v_bfe_i32 v1, v1, 0, 8
-; GFX6-NEXT: v_max_i32_e32 v0, v0, v2
-; GFX6-NEXT: v_sub_i32_e32 v2, vcc, 0, v1
-; GFX6-NEXT: v_max_i32_e32 v1, v1, v2
-; GFX6-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_v2i8:
-; GFX8: ; %bb.0:
-; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT: v_mov_b32_e32 v2, 0
-; GFX8-NEXT: v_sub_u16_sdwa v3, v2, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT: v_sub_u16_sdwa v2, v2, sext(v1) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT: v_max_i16_sdwa v0, sext(v0), v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT: v_max_i16_sdwa v1, sext(v1), v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_v2i8:
-; GFX10: ; %bb.0:
-; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT: v_bfe_i32 v0, v0, 0, 8
-; GFX10-NEXT: v_bfe_i32 v1, v1, 0, 8
-; GFX10-NEXT: v_sub_nc_u16 v2, 0, v0
-; GFX10-NEXT: v_sub_nc_u16 v3, 0, v1
-; GFX10-NEXT: v_max_i16 v0, v0, v2
-; GFX10-NEXT: v_max_i16 v1, v1, v3
-; GFX10-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX1250-LABEL: abs_vgpr_v2i8:
-; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: v_bfe_i32 v0, v0, 0, 8
-; GFX1250-NEXT: v_bfe_i32 v1, v1, 0, 8
-; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX1250-NEXT: v_sub_nc_u16 v2, 0, v0
-; GFX1250-NEXT: v_sub_nc_u16 v3, 0, v1
-; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX1250-NEXT: v_max_i16 v0, v0, v2
-; GFX1250-NEXT: v_max_i16 v1, v1, v3
-; GFX1250-NEXT: s_set_pc_i64 s[30:31]
- %res = call <2 x i8> @llvm.abs.v2i8(<2 x i8> %arg, i1 false)
- ret <2 x i8> %res
-}
-
-define amdgpu_cs <3 x i8> @abs_sgpr_v3i8(<3 x i8> inreg %arg) {
-; GFX-LABEL: abs_sgpr_v3i8:
-; GFX: ; %bb.0:
-; GFX-NEXT: s_sext_i32_i8 s0, s0
-; GFX-NEXT: s_sext_i32_i8 s1, s1
-; GFX-NEXT: s_sext_i32_i8 s2, s2
-; GFX-NEXT: s_abs_i32 s0, s0
-; GFX-NEXT: s_abs_i32 s1, s1
-; GFX-NEXT: s_abs_i32 s2, s2
-; GFX-NEXT: ; return to shader part epilog
- %res = call <3 x i8> @llvm.abs.v3i8(<3 x i8> %arg, i1 false)
- ret <3 x i8> %res
-}
-
-define <3 x i8> @abs_vgpr_v3i8(<3 x i8> %arg) {
-; GFX6-LABEL: abs_vgpr_v3i8:
-; GFX6: ; %bb.0:
-; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT: v_bfe_i32 v0, v0, 0, 8
-; GFX6-NEXT: v_sub_i32_e32 v3, vcc, 0, v0
-; GFX6-NEXT: v_bfe_i32 v1, v1, 0, 8
-; GFX6-NEXT: v_max_i32_e32 v0, v0, v3
-; GFX6-NEXT: v_sub_i32_e32 v3, vcc, 0, v1
-; GFX6-NEXT: v_bfe_i32 v2, v2, 0, 8
-; GFX6-NEXT: v_max_i32_e32 v1, v1, v3
-; GFX6-NEXT: v_sub_i32_e32 v3, vcc, 0, v2
-; GFX6-NEXT: v_max_i32_e32 v2, v2, v3
-; GFX6-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: abs_vgpr_v3i8:
-; GFX8: ; %bb.0:
-; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT: v_mov_b32_e32 v3, 0
-; GFX8-NEXT: v_sub_u16_sdwa v4, v3, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT: v_max_i16_sdwa v0, sext(v0), v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT: v_sub_u16_sdwa v4, v3, sext(v1) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT: v_sub_u16_sdwa v3, v3, sext(v2) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
-; GFX8-NEXT: v_max_i16_sdwa v1, sext(v1), v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT: v_max_i16_sdwa v2, sext(v2), v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
-; GFX8-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: abs_vgpr_v3i8:
-; GFX10: ; %bb.0:
-; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT: v_bfe_i32 v0, v0, 0, 8
-; GFX10-NEXT: v_bfe_i32 v1, v1, 0, 8
-; GF...
[truncated]
|
…rom this approach.
|
@arsenm here are the outputs of fiji.bad.txt |
97c9ddd to
043a068
Compare
|
This PR is missing any handling of the neg (abs) case. I'd expect that to be a separate combine rooted on the neg, not something that changes the abs lowering |
|
@arsenm ideally the negative case should not need separate code since |
You don't need to make the full transform, just do the reassociate to push the neg through the abs in a combine without touching any lowering. The risk is if something is actively undoing that, in which case you may need one of the TargetLowering hooks to avoid infinite looping the combiner |
You can't push a neg through an abs. |
|
@arsenm @jayfoad this latest commit is my best effort to make If we strongly want to do this without overriding |
| if (Op.getValueType() == MVT::i8) | ||
| if (isNullConstant(Op.getOperand(0)) && | ||
| Op.getOperand(1).getOpcode() == ISD::ABS) | ||
| return DAG.getNode( | ||
| ISD::TRUNCATE, SDLoc(Op), MVT::i8, | ||
| DAG.getNode(ISD::SUB, SDLoc(Op), MVT::i32, | ||
| DAG.getConstant(0, SDLoc(Op), MVT::i32), | ||
| lowerABSi16(Op.getOperand(1), DAG).getOperand(0))); | ||
| else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks essentially correct, but it's in the wrong place. This is not mandatory lowering, so this should go in PerformDAGCombine. Doesn't this also need the divergence check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also in performSubCombine. No worky there, either. And yes it's missing divergence check, but adding inhibitions to the optimization wouldn't make it worky.
| //(abs i16 (i16 op1)) -> (trunc i16 (abs i32 (sext i32 (i16 op1)))) | ||
| SDValue Src = Op.getOperand(0); | ||
| SDLoc DL(Src); | ||
| SDValue SExtSrc = DAG.getSExtOrTrunc(Src, DL, MVT::i32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can slightly generalize this to work on vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does our ISA have a vector equivalent of s_sext_i16 / s_sext_i8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's v_bfe_i32
It doesn't generally apply though, due to the conditional on the divergence so I'm not sure touching the generic code helps at all |
|
@arsenm resolved comments unrelated to the "this approach doesn't work for |
This is to demonstrate that setOperationAction does not work to solve SWDEV-562122. See #165626