-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[X86] Use an FP-based expansion for v4i32 ctlz on SSE2-only targets #167034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-backend-x86 Author: None (NishiB137) ChangesFixes #161746 This pull request implements a new optimization for the
Benchmark Results (
|
| Implementation | Instructions <br/> (Per Iteration) | Total Cycles <br/> (Per Iteration) | IPC <br/> (Average) | Block RThroughput <br/> (Average) |
|---|---|---|---|---|
Old SSE2 (Integer Fallback) |
39 <br/> (3900 / 100) | 45.01 <br/> (4501 / 100) | 0.87 | 7.3 |
New SSE2 (FP Fallback) |
18 <br/> (1800 / 100) | 21.04 <br/> (2104 / 100) | 0.86 | 4.0 |
The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt
Full diff: https://github.com/llvm/llvm-project/pull/167034.diff
6 Files Affected:
- (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+5)
- (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+47)
- (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+13)
- (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll (+26)
- (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll (+28)
- (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll (+23)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 98565f423df3e..bb58f48cfdb5c 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -5543,6 +5543,11 @@ class LLVM_ABI TargetLowering : public TargetLoweringBase {
/// \returns The expansion result or SDValue() if it fails.
SDValue expandVPCTLZ(SDNode *N, SelectionDAG &DAG) const;
+ /// Expands a CTLZ node into a sequence of floating point operations.
+ /// \param N Node to expand
+ /// \returns The expansion result or SDValue() if it fails.
+ SDValue expandCTLZWithFP(SDNode *N, SelectionDAG &DAG) const;
+
/// Expand CTTZ via Table Lookup.
/// \param N Node to expand
/// \returns The expansion result or SDValue() if it fails.
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b51d6649af2ec..d6ab6b2fe77e2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9480,6 +9480,53 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
}
+
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+ SDLoc dl(Node);
+ SDValue Op = Node->getOperand(0);
+ EVT VT = Op.getValueType();
+
+ assert(VT.isVector() && "This expansion is intended for vectors");
+
+ EVT EltVT = VT.getVectorElementType();
+ EVT FloatVT, CmpVT;
+ unsigned BitWidth, MantissaBits, ExponentBias;
+
+ // Converting to float type
+ if (EltVT == MVT::i32) {
+ FloatVT = VT.changeVectorElementType(MVT::f32);
+ BitWidth = 32;
+ MantissaBits = 23;
+ ExponentBias = 127;
+ }
+ else {
+ return SDValue();
+ }
+
+ // Handling the case for when Op == 0 which is stored in ZeroRes
+ CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
+ SDValue Zero = DAG.getConstant(0, dl, VT);
+ SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ);
+ SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT);
+
+ // Handling the case for Non-zero inputs using the algorithm mentioned below
+ SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
+ SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
+ SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
+ SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+ SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+ //Returns the respective DAG Node based on the input being zero or non-zero
+ return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
+
+ // pseudocode :
+ // if(x==0) return 32;
+ // float f = (float) x;
+ // int i = bitcast<int>(f);
+ // int ilog2 = (i >> 23) - 127;
+ // return 31 - ilog2;
+}
+
SDValue TargetLowering::CTTZTableLookup(SDNode *Node, SelectionDAG &DAG,
const SDLoc &DL, EVT VT, SDValue Op,
unsigned BitWidth) const {
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 05a854a0bf3fa..bdea6c4734908 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1348,6 +1348,12 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
setOperationAction(ISD::SUB, MVT::i32, Custom);
}
+ if (!Subtarget.useSoftFloat() && Subtarget.hasSSE2() &&
+ !Subtarget.hasSSSE3()) {
+ setOperationAction(ISD::CTLZ, MVT::v4i32, Custom);
+ setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::v4i32, Custom);
+ }
+
if (!Subtarget.useSoftFloat() && Subtarget.hasSSE41()) {
for (MVT RoundedTy : {MVT::f32, MVT::f64, MVT::v4f32, MVT::v2f64}) {
setOperationAction(ISD::FFLOOR, RoundedTy, Legal);
@@ -29039,6 +29045,13 @@ static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,
if (VT.is512BitVector() && !Subtarget.hasBWI())
return splitVectorIntUnary(Op, DAG, DL);
+ if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+ const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+ SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
+ if (New.getNode())
+ return New;
+ }
+
assert(Subtarget.hasSSSE3() && "Expected SSSE3 support for PSHUFB");
return LowerVectorCTLZInRegLUT(Op, DL, Subtarget, DAG);
}
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
new file mode 100644
index 0000000000000..20467b3799875
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
@@ -0,0 +1,26 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2:
+; CHECK: # %bb.0:
+
+; Zero test (strict CTLZ needs select)
+; CHECK-DAG: pcmpeqd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Exponent extraction + bias arithmetic (order-free)
+; CHECK-DAG: psrld {{\$}}23, %xmm{{[0-9]+}}
+; CHECK-DAG: psubd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Select/merge (could be por/pandn etc.)
+; CHECK: por %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Must NOT use SSSE3 LUT path
+; CHECK-NOT: pshufb
+
+; CHECK: retq
+ %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+ ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
new file mode 100644
index 0000000000000..6949fe4110e58
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
@@ -0,0 +1,28 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2_zero_undef(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2_zero_undef:
+
+; zero check
+; CHECK-DAG: pcmpeqd
+
+; FP-based mantissa/exponent steps (order may vary)
+; CHECK-DAG: psrld $16
+; CHECK-DAG: subps
+; CHECK-DAG: psrld $23
+; CHECK-DAG: psubd
+
+; merge/select
+; CHECK: pandn
+; CHECK: por
+
+; CHECK-NOT: pshufb
+
+; CHECK: retq
+
+ %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 true)
+ ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
new file mode 100644
index 0000000000000..8d10c17223a21
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
@@ -0,0 +1,23 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+ssse3 -o - | FileCheck %s
+
+; This verifies that **with SSSE3 enabled**, we use the LUT-based `pshufb`
+; implementation and *not* the floating-point exponent trick.
+
+define <4 x i32> @test_v4i32_ssse3(<4 x i32> %a) {
+; CHECK-LABEL: test_v4i32_ssse3:
+; CHECK: # %bb.0:
+
+; Must use SSSE3 table LUT:
+; CHECK: pshufb
+
+; Must NOT use FP exponent trick:
+; CHECK-NOT: cvtdq2ps
+; CHECK-NOT: psrld $23
+; CHECK-NOT: psubd
+
+; CHECK: retq
+ %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+ ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
|
@llvm/pr-subscribers-llvm-selectiondag Author: None (NishiB137) ChangesFixes #161746 This pull request implements a new optimization for the
Benchmark Results (
|
| Implementation | Instructions <br/> (Per Iteration) | Total Cycles <br/> (Per Iteration) | IPC <br/> (Average) | Block RThroughput <br/> (Average) |
|---|---|---|---|---|
Old SSE2 (Integer Fallback) |
39 <br/> (3900 / 100) | 45.01 <br/> (4501 / 100) | 0.87 | 7.3 |
New SSE2 (FP Fallback) |
18 <br/> (1800 / 100) | 21.04 <br/> (2104 / 100) | 0.86 | 4.0 |
The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt
Full diff: https://github.com/llvm/llvm-project/pull/167034.diff
6 Files Affected:
- (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+5)
- (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+47)
- (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+13)
- (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll (+26)
- (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll (+28)
- (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll (+23)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 98565f423df3e..bb58f48cfdb5c 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -5543,6 +5543,11 @@ class LLVM_ABI TargetLowering : public TargetLoweringBase {
/// \returns The expansion result or SDValue() if it fails.
SDValue expandVPCTLZ(SDNode *N, SelectionDAG &DAG) const;
+ /// Expands a CTLZ node into a sequence of floating point operations.
+ /// \param N Node to expand
+ /// \returns The expansion result or SDValue() if it fails.
+ SDValue expandCTLZWithFP(SDNode *N, SelectionDAG &DAG) const;
+
/// Expand CTTZ via Table Lookup.
/// \param N Node to expand
/// \returns The expansion result or SDValue() if it fails.
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b51d6649af2ec..d6ab6b2fe77e2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9480,6 +9480,53 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
}
+
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+ SDLoc dl(Node);
+ SDValue Op = Node->getOperand(0);
+ EVT VT = Op.getValueType();
+
+ assert(VT.isVector() && "This expansion is intended for vectors");
+
+ EVT EltVT = VT.getVectorElementType();
+ EVT FloatVT, CmpVT;
+ unsigned BitWidth, MantissaBits, ExponentBias;
+
+ // Converting to float type
+ if (EltVT == MVT::i32) {
+ FloatVT = VT.changeVectorElementType(MVT::f32);
+ BitWidth = 32;
+ MantissaBits = 23;
+ ExponentBias = 127;
+ }
+ else {
+ return SDValue();
+ }
+
+ // Handling the case for when Op == 0 which is stored in ZeroRes
+ CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
+ SDValue Zero = DAG.getConstant(0, dl, VT);
+ SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ);
+ SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT);
+
+ // Handling the case for Non-zero inputs using the algorithm mentioned below
+ SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
+ SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
+ SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
+ SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+ SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+ //Returns the respective DAG Node based on the input being zero or non-zero
+ return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
+
+ // pseudocode :
+ // if(x==0) return 32;
+ // float f = (float) x;
+ // int i = bitcast<int>(f);
+ // int ilog2 = (i >> 23) - 127;
+ // return 31 - ilog2;
+}
+
SDValue TargetLowering::CTTZTableLookup(SDNode *Node, SelectionDAG &DAG,
const SDLoc &DL, EVT VT, SDValue Op,
unsigned BitWidth) const {
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 05a854a0bf3fa..bdea6c4734908 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1348,6 +1348,12 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
setOperationAction(ISD::SUB, MVT::i32, Custom);
}
+ if (!Subtarget.useSoftFloat() && Subtarget.hasSSE2() &&
+ !Subtarget.hasSSSE3()) {
+ setOperationAction(ISD::CTLZ, MVT::v4i32, Custom);
+ setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::v4i32, Custom);
+ }
+
if (!Subtarget.useSoftFloat() && Subtarget.hasSSE41()) {
for (MVT RoundedTy : {MVT::f32, MVT::f64, MVT::v4f32, MVT::v2f64}) {
setOperationAction(ISD::FFLOOR, RoundedTy, Legal);
@@ -29039,6 +29045,13 @@ static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,
if (VT.is512BitVector() && !Subtarget.hasBWI())
return splitVectorIntUnary(Op, DAG, DL);
+ if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+ const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+ SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
+ if (New.getNode())
+ return New;
+ }
+
assert(Subtarget.hasSSSE3() && "Expected SSSE3 support for PSHUFB");
return LowerVectorCTLZInRegLUT(Op, DL, Subtarget, DAG);
}
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
new file mode 100644
index 0000000000000..20467b3799875
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
@@ -0,0 +1,26 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2:
+; CHECK: # %bb.0:
+
+; Zero test (strict CTLZ needs select)
+; CHECK-DAG: pcmpeqd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Exponent extraction + bias arithmetic (order-free)
+; CHECK-DAG: psrld {{\$}}23, %xmm{{[0-9]+}}
+; CHECK-DAG: psubd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Select/merge (could be por/pandn etc.)
+; CHECK: por %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Must NOT use SSSE3 LUT path
+; CHECK-NOT: pshufb
+
+; CHECK: retq
+ %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+ ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
new file mode 100644
index 0000000000000..6949fe4110e58
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
@@ -0,0 +1,28 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2_zero_undef(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2_zero_undef:
+
+; zero check
+; CHECK-DAG: pcmpeqd
+
+; FP-based mantissa/exponent steps (order may vary)
+; CHECK-DAG: psrld $16
+; CHECK-DAG: subps
+; CHECK-DAG: psrld $23
+; CHECK-DAG: psubd
+
+; merge/select
+; CHECK: pandn
+; CHECK: por
+
+; CHECK-NOT: pshufb
+
+; CHECK: retq
+
+ %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 true)
+ ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
new file mode 100644
index 0000000000000..8d10c17223a21
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
@@ -0,0 +1,23 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+ssse3 -o - | FileCheck %s
+
+; This verifies that **with SSSE3 enabled**, we use the LUT-based `pshufb`
+; implementation and *not* the floating-point exponent trick.
+
+define <4 x i32> @test_v4i32_ssse3(<4 x i32> %a) {
+; CHECK-LABEL: test_v4i32_ssse3:
+; CHECK: # %bb.0:
+
+; Must use SSSE3 table LUT:
+; CHECK: pshufb
+
+; Must NOT use FP exponent trick:
+; CHECK-NOT: cvtdq2ps
+; CHECK-NOT: psrld $23
+; CHECK-NOT: psubd
+
+; CHECK: retq
+ %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+ ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
| BitWidth = 32; | ||
| MantissaBits = 23; | ||
| ExponentBias = 127; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should get these from the fltSemantics instead of hardcoding it
| // pseudocode : | ||
| // if(x==0) return 32; | ||
| // float f = (float) x; | ||
| // int i = bitcast<int>(f); | ||
| // int ilog2 = (i >> 23) - 127; | ||
| // return 31 - ilog2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to the start, not the end
| SDValue Op = Node->getOperand(0); | ||
| EVT VT = Op.getValueType(); | ||
|
|
||
| assert(VT.isVector() && "This expansion is intended for vectors"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just write to not have this limitation
e712022 to
078cc7a
Compare
|
Thank you for the review!
Regarding the limitation, can you please explain a little more about what exactly we need to do? Do we need to use this function for scalars as well? Thanks! |
|
In the issue description i32 was converted to f64, are you sure it works in all cases with f32? |
You can test this locally with the following command:git-clang-format --diff origin/main HEAD --extensions cpp,h -- llvm/include/llvm/CodeGen/TargetLowering.h llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp llvm/lib/Target/X86/X86ISelLowering.cpp --diff_from_common_commit
View the diff from clang-format here.diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b1aa884f5..97f026bcf 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9481,8 +9481,8 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
}
-
-SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node,
+ SelectionDAG &DAG) const {
// pseudocode :
// if(x==0) return 32;
// float f = (float) x;
@@ -9508,8 +9508,7 @@ SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const
MantissaBits = APFloat::semanticsPrecision(Sem) - 1;
ExponentBias =
static_cast<unsigned>(-APFloat::semanticsMinExponent(Sem) + 1);
- }
- else {
+ } else {
return SDValue();
}
@@ -9522,11 +9521,14 @@ SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const
// Handling the case for Non-zero inputs using the algorithm mentioned below
SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
- SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
- SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
- SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
-
- //Returns the respective DAG Node based on the input being zero or non-zero
+ SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits,
+ DAG.getConstant(MantissaBits, dl, VT));
+ SDValue MSBIndex =
+ DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+ SDValue NonZeroRes = DAG.getNode(
+ ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+ // Returns the respective DAG Node based on the input being zero or non-zero
return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
}
|
| // Handling the case for when Op == 0 which is stored in ZeroRes | ||
| CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT); | ||
| SDValue Zero = DAG.getConstant(0, dl, VT); | ||
| SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ); | ||
| SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can omit the Op == 0 handling for CTLZ_ZERO_UNDEF.
| SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT)); | ||
| SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest emitting (BitWidth - 1 + ExponentBias) - Exp directly here, instead of emitting two SUBs and relying on SelectionDAG to combine them for you.
| @@ -0,0 +1,26 @@ | |||
| ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all 3 of these new test files can be dropped - just regenerate the vector-lzcnt-128.ll checks with the update script
| } | ||
| else { | ||
| return SDValue(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | |
| else { | |
| return SDValue(); | |
| } | |
| } else { | |
| return SDValue(); | |
| } |
| // Handling the case for Non-zero inputs using the algorithm mentioned below | ||
| SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op); | ||
| SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float); | ||
| SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use getShiftAmountTy / getShiftAmountConstant
| if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) { | ||
| const TargetLowering &TLI = DAG.getTargetLoweringInfo(); | ||
| SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG); | ||
| if (New.getNode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (New.getNode()) | |
| if (New) |
| return splitVectorIntUnary(Op, DAG, DL); | ||
|
|
||
| if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) { | ||
| const TargetLowering &TLI = DAG.getTargetLoweringInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const TargetLowering &TLI = DAG.getTargetLoweringInfo(); |
|
|
||
| if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) { | ||
| const TargetLowering &TLI = DAG.getTargetLoweringInfo(); | ||
| SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG); | |
| SDValue New = expandCTLZWithFP(Op.getNode(), DAG); |
RKSimon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vXi32 types must be converted to vXf64 - not vXf32 - only smaller types (or if you can prove the upper 16 bits are zero) can use vXf32
Fixes #161746
This pull request implements a new optimization for the
ISD::CTLZoperation, as suggested in the issue. It uses a floating-point-based algorithm forv4i32vectors on x86 targets that haveSSE2but do not haveSSSE3.TargetLowering::expandCTLZWithFP, was added. This function implements theclz(x) = 31 - (bitcast<int>((float)x) >> 23) - 127algorithm, complete with aVSELECTto handle theclz(0) == 32edge case.X86TargetLowering::LowerVectorCTLZfunction was updated. It now checks for the+sse2,-ssse3feature combination and callsexpandCTLZWithFPforv4i32vectors.v4i32: the generic integerExpandlogic (forSSE2).Benchmark Results (
llvm-mcaoncore2)The
llvm-mcaanalysis from the testcases (ctlz-v4i32-fp-1.ll,ctlz-v4i32-fp-2.ll) confirms the benefit of this change.(Per Iteration)
(Per Iteration)
(Average)
(Average)
SSE2(Integer Fallback)(3900 / 100)
(4501 / 100)
SSE2(FP Fallback)(1800 / 100)
(2104 / 100)
The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt