[X86] Use an FP-based expansion for v4i32 ctlz on SSE2-only targets #167034

NishiB137 · 2025-11-07T22:47:07Z

This pull request implements a new optimization for the ISD::CTLZ operation, as suggested in the issue. It uses a floating-point-based algorithm for v4i32 vectors on x86 targets that have SSE2 but do not have SSSE3.

A new generic function, TargetLowering::expandCTLZWithFP, was added. This function implements the clz(x) = 31 - (bitcast<int>((float)x) >> 23) - 127 algorithm, complete with a VSELECT to handle the clz(0) == 32 edge case.
The X86TargetLowering::LowerVectorCTLZ function was updated. It now checks for the +sse2,-ssse3 feature combination and calls expandCTLZWithFP for v4i32 vectors.
The new implementation was benchmarked against the existing fallback for v4i32: the generic integer Expand logic (for SSE2).

Benchmark Results (`llvm-mca` on `core2`)

The llvm-mca analysis from the testcases (ctlz-v4i32-fp-1.ll, ctlz-v4i32-fp-2.ll) confirms the benefit of this change.

Implementation	Instructions (Per Iteration)	Total Cycles (Per Iteration)	IPC (Average)	Block RThroughput (Average)
Old `SSE2` (Integer Fallback)	39 (3900 / 100)	45.01 (4501 / 100)	0.87	7.3
New `SSE2` (FP Fallback)	18 (1800 / 100)	21.04 (2104 / 100)	0.86	4.0

The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt

…i32 types

github-actions · 2025-11-07T22:47:29Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-11-07T22:47:58Z

@llvm/pr-subscribers-backend-x86

Author: None (NishiB137)

Changes

Fixes #161746

This pull request implements a new optimization for the ISD::CTLZ operation, as suggested in the issue. It uses a floating-point-based algorithm for v4i32 vectors on x86 targets that have SSE2 but do not have SSSE3.

A new generic function, TargetLowering::expandCTLZWithFP, was added. This function implements the clz(x) = 31 - (bitcast<int>((float)x) >> 23) - 127 algorithm, complete with a VSELECT to handle the clz(0) == 32 edge case.
The X86TargetLowering::LowerVectorCTLZ function was updated. It now checks for the +sse2,-ssse3 feature combination and calls expandCTLZWithFP for v4i32 vectors.
The new implementation was benchmarked against the existing fallback for v4i32: the generic integer Expand logic (for SSE2).

Benchmark Results (`llvm-mca` on `core2`)

The llvm-mca analysis from the testcases (ctlz-v4i32-fp-1.ll, ctlz-v4i32-fp-2.ll) confirms the benefit of this change.

Implementation	Instructions <br/> (Per Iteration)	Total Cycles <br/> (Per Iteration)	IPC <br/> (Average)	Block RThroughput <br/> (Average)
Old `SSE2` (Integer Fallback)	39 <br/> (3900 / 100)	45.01 <br/> (4501 / 100)	0.87	7.3
New `SSE2` (FP Fallback)	18 <br/> (1800 / 100)	21.04 <br/> (2104 / 100)	0.86	4.0

The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt

Full diff: https://github.com/llvm/llvm-project/pull/167034.diff

6 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+5)
(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+47)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+13)
(added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll (+26)
(added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll (+28)
(added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll (+23)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 98565f423df3e..bb58f48cfdb5c 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -5543,6 +5543,11 @@ class LLVM_ABI TargetLowering : public TargetLoweringBase {
   /// \returns The expansion result or SDValue() if it fails.
   SDValue expandVPCTLZ(SDNode *N, SelectionDAG &DAG) const;
 
+  /// Expands a CTLZ node into a sequence of floating point operations.
+  /// \param N Node to expand
+  /// \returns The expansion result or SDValue() if it fails.
+  SDValue expandCTLZWithFP(SDNode *N, SelectionDAG &DAG) const;
+
   /// Expand CTTZ via Table Lookup.
   /// \param N Node to expand
   /// \returns The expansion result or SDValue() if it fails.
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b51d6649af2ec..d6ab6b2fe77e2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9480,6 +9480,53 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
   return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
 }
 
+
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+  SDLoc dl(Node);
+  SDValue Op = Node->getOperand(0);
+  EVT VT = Op.getValueType();
+
+  assert(VT.isVector() && "This expansion is intended for vectors");
+
+  EVT EltVT = VT.getVectorElementType();
+  EVT FloatVT, CmpVT;
+  unsigned BitWidth, MantissaBits, ExponentBias;
+
+  // Converting to float type
+  if (EltVT == MVT::i32) {
+    FloatVT = VT.changeVectorElementType(MVT::f32);
+    BitWidth = 32;
+    MantissaBits = 23;
+    ExponentBias = 127;
+  } 
+  else {
+    return SDValue();
+  }
+
+  // Handling the case for when Op == 0 which is stored in ZeroRes
+  CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
+  SDValue Zero = DAG.getConstant(0, dl, VT);
+  SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ);
+  SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT);
+
+  // Handling the case for Non-zero inputs using the algorithm mentioned below
+  SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
+  SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
+  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
+  SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+  SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+  //Returns the respective DAG Node based on the input being zero or non-zero
+  return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
+
+  // pseudocode : 
+  // if(x==0) return 32;
+  // float f = (float) x;
+  // int i = bitcast<int>(f);
+  // int ilog2 = (i >> 23) - 127;
+  // return 31 - ilog2;
+}
+
 SDValue TargetLowering::CTTZTableLookup(SDNode *Node, SelectionDAG &DAG,
                                         const SDLoc &DL, EVT VT, SDValue Op,
                                         unsigned BitWidth) const {
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 05a854a0bf3fa..bdea6c4734908 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1348,6 +1348,12 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
     setOperationAction(ISD::SUB,                MVT::i32, Custom);
   }
 
+  if (!Subtarget.useSoftFloat() && Subtarget.hasSSE2() &&
+      !Subtarget.hasSSSE3()) {
+    setOperationAction(ISD::CTLZ, MVT::v4i32, Custom);
+    setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::v4i32, Custom);
+  }
+
   if (!Subtarget.useSoftFloat() && Subtarget.hasSSE41()) {
     for (MVT RoundedTy : {MVT::f32, MVT::f64, MVT::v4f32, MVT::v2f64}) {
       setOperationAction(ISD::FFLOOR,            RoundedTy,  Legal);
@@ -29039,6 +29045,13 @@ static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,
   if (VT.is512BitVector() && !Subtarget.hasBWI())
     return splitVectorIntUnary(Op, DAG, DL);
 
+  if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+    const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+    SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
+    if (New.getNode())
+      return New;
+  }
+
   assert(Subtarget.hasSSSE3() && "Expected SSSE3 support for PSHUFB");
   return LowerVectorCTLZInRegLUT(Op, DL, Subtarget, DAG);
 }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
new file mode 100644
index 0000000000000..20467b3799875
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
@@ -0,0 +1,26 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2:
+; CHECK:       # %bb.0:
+
+; Zero test (strict CTLZ needs select)
+; CHECK-DAG:   pcmpeqd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Exponent extraction + bias arithmetic (order-free)
+; CHECK-DAG:   psrld {{\$}}23, %xmm{{[0-9]+}}
+; CHECK-DAG:   psubd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Select/merge (could be por/pandn etc.)
+; CHECK:       por %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Must NOT use SSSE3 LUT path
+; CHECK-NOT:   pshufb
+
+; CHECK:       retq
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
new file mode 100644
index 0000000000000..6949fe4110e58
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
@@ -0,0 +1,28 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2_zero_undef(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2_zero_undef:
+
+; zero check
+; CHECK-DAG:   pcmpeqd
+
+; FP-based mantissa/exponent steps (order may vary)
+; CHECK-DAG:   psrld     $16
+; CHECK-DAG:   subps
+; CHECK-DAG:   psrld     $23
+; CHECK-DAG:   psubd
+
+; merge/select
+; CHECK:       pandn
+; CHECK:       por
+
+; CHECK-NOT:   pshufb
+
+; CHECK: retq
+
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 true)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
new file mode 100644
index 0000000000000..8d10c17223a21
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
@@ -0,0 +1,23 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+ssse3 -o - | FileCheck %s
+
+; This verifies that **with SSSE3 enabled**, we use the LUT-based `pshufb`
+; implementation and *not* the floating-point exponent trick.
+
+define <4 x i32> @test_v4i32_ssse3(<4 x i32> %a) {
+; CHECK-LABEL: test_v4i32_ssse3:
+; CHECK:       # %bb.0:
+
+; Must use SSSE3 table LUT:
+; CHECK:       pshufb
+
+; Must NOT use FP exponent trick:
+; CHECK-NOT:   cvtdq2ps
+; CHECK-NOT:   psrld $23
+; CHECK-NOT:   psubd
+
+; CHECK:       retq
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)

llvmbot · 2025-11-07T22:47:58Z

@llvm/pr-subscribers-llvm-selectiondag

Author: None (NishiB137)

Changes

Fixes #161746

This pull request implements a new optimization for the ISD::CTLZ operation, as suggested in the issue. It uses a floating-point-based algorithm for v4i32 vectors on x86 targets that have SSE2 but do not have SSSE3.

A new generic function, TargetLowering::expandCTLZWithFP, was added. This function implements the clz(x) = 31 - (bitcast<int>((float)x) >> 23) - 127 algorithm, complete with a VSELECT to handle the clz(0) == 32 edge case.
The X86TargetLowering::LowerVectorCTLZ function was updated. It now checks for the +sse2,-ssse3 feature combination and calls expandCTLZWithFP for v4i32 vectors.
The new implementation was benchmarked against the existing fallback for v4i32: the generic integer Expand logic (for SSE2).

Benchmark Results (`llvm-mca` on `core2`)

The llvm-mca analysis from the testcases (ctlz-v4i32-fp-1.ll, ctlz-v4i32-fp-2.ll) confirms the benefit of this change.

Implementation	Instructions <br/> (Per Iteration)	Total Cycles <br/> (Per Iteration)	IPC <br/> (Average)	Block RThroughput <br/> (Average)
Old `SSE2` (Integer Fallback)	39 <br/> (3900 / 100)	45.01 <br/> (4501 / 100)	0.87	7.3
New `SSE2` (FP Fallback)	18 <br/> (1800 / 100)	21.04 <br/> (2104 / 100)	0.86	4.0

The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt

Full diff: https://github.com/llvm/llvm-project/pull/167034.diff

6 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+5)
(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+47)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+13)
(added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll (+26)
(added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll (+28)
(added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll (+23)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 98565f423df3e..bb58f48cfdb5c 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -5543,6 +5543,11 @@ class LLVM_ABI TargetLowering : public TargetLoweringBase {
   /// \returns The expansion result or SDValue() if it fails.
   SDValue expandVPCTLZ(SDNode *N, SelectionDAG &DAG) const;
 
+  /// Expands a CTLZ node into a sequence of floating point operations.
+  /// \param N Node to expand
+  /// \returns The expansion result or SDValue() if it fails.
+  SDValue expandCTLZWithFP(SDNode *N, SelectionDAG &DAG) const;
+
   /// Expand CTTZ via Table Lookup.
   /// \param N Node to expand
   /// \returns The expansion result or SDValue() if it fails.
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b51d6649af2ec..d6ab6b2fe77e2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9480,6 +9480,53 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
   return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
 }
 
+
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+  SDLoc dl(Node);
+  SDValue Op = Node->getOperand(0);
+  EVT VT = Op.getValueType();
+
+  assert(VT.isVector() && "This expansion is intended for vectors");
+
+  EVT EltVT = VT.getVectorElementType();
+  EVT FloatVT, CmpVT;
+  unsigned BitWidth, MantissaBits, ExponentBias;
+
+  // Converting to float type
+  if (EltVT == MVT::i32) {
+    FloatVT = VT.changeVectorElementType(MVT::f32);
+    BitWidth = 32;
+    MantissaBits = 23;
+    ExponentBias = 127;
+  } 
+  else {
+    return SDValue();
+  }
+
+  // Handling the case for when Op == 0 which is stored in ZeroRes
+  CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
+  SDValue Zero = DAG.getConstant(0, dl, VT);
+  SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ);
+  SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT);
+
+  // Handling the case for Non-zero inputs using the algorithm mentioned below
+  SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
+  SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
+  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
+  SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+  SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+  //Returns the respective DAG Node based on the input being zero or non-zero
+  return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
+
+  // pseudocode : 
+  // if(x==0) return 32;
+  // float f = (float) x;
+  // int i = bitcast<int>(f);
+  // int ilog2 = (i >> 23) - 127;
+  // return 31 - ilog2;
+}
+
 SDValue TargetLowering::CTTZTableLookup(SDNode *Node, SelectionDAG &DAG,
                                         const SDLoc &DL, EVT VT, SDValue Op,
                                         unsigned BitWidth) const {
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 05a854a0bf3fa..bdea6c4734908 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1348,6 +1348,12 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
     setOperationAction(ISD::SUB,                MVT::i32, Custom);
   }
 
+  if (!Subtarget.useSoftFloat() && Subtarget.hasSSE2() &&
+      !Subtarget.hasSSSE3()) {
+    setOperationAction(ISD::CTLZ, MVT::v4i32, Custom);
+    setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::v4i32, Custom);
+  }
+
   if (!Subtarget.useSoftFloat() && Subtarget.hasSSE41()) {
     for (MVT RoundedTy : {MVT::f32, MVT::f64, MVT::v4f32, MVT::v2f64}) {
       setOperationAction(ISD::FFLOOR,            RoundedTy,  Legal);
@@ -29039,6 +29045,13 @@ static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,
   if (VT.is512BitVector() && !Subtarget.hasBWI())
     return splitVectorIntUnary(Op, DAG, DL);
 
+  if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+    const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+    SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
+    if (New.getNode())
+      return New;
+  }
+
   assert(Subtarget.hasSSSE3() && "Expected SSSE3 support for PSHUFB");
   return LowerVectorCTLZInRegLUT(Op, DL, Subtarget, DAG);
 }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
new file mode 100644
index 0000000000000..20467b3799875
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
@@ -0,0 +1,26 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2:
+; CHECK:       # %bb.0:
+
+; Zero test (strict CTLZ needs select)
+; CHECK-DAG:   pcmpeqd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Exponent extraction + bias arithmetic (order-free)
+; CHECK-DAG:   psrld {{\$}}23, %xmm{{[0-9]+}}
+; CHECK-DAG:   psubd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Select/merge (could be por/pandn etc.)
+; CHECK:       por %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Must NOT use SSSE3 LUT path
+; CHECK-NOT:   pshufb
+
+; CHECK:       retq
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
new file mode 100644
index 0000000000000..6949fe4110e58
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
@@ -0,0 +1,28 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2_zero_undef(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2_zero_undef:
+
+; zero check
+; CHECK-DAG:   pcmpeqd
+
+; FP-based mantissa/exponent steps (order may vary)
+; CHECK-DAG:   psrld     $16
+; CHECK-DAG:   subps
+; CHECK-DAG:   psrld     $23
+; CHECK-DAG:   psubd
+
+; merge/select
+; CHECK:       pandn
+; CHECK:       por
+
+; CHECK-NOT:   pshufb
+
+; CHECK: retq
+
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 true)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
new file mode 100644
index 0000000000000..8d10c17223a21
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
@@ -0,0 +1,23 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+ssse3 -o - | FileCheck %s
+
+; This verifies that **with SSSE3 enabled**, we use the LUT-based `pshufb`
+; implementation and *not* the floating-point exponent trick.
+
+define <4 x i32> @test_v4i32_ssse3(<4 x i32> %a) {
+; CHECK-LABEL: test_v4i32_ssse3:
+; CHECK:       # %bb.0:
+
+; Must use SSSE3 table LUT:
+; CHECK:       pshufb
+
+; Must NOT use FP exponent trick:
+; CHECK-NOT:   cvtdq2ps
+; CHECK-NOT:   psrld $23
+; CHECK-NOT:   psubd
+
+; CHECK:       retq
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)

arsenm · 2025-11-07T22:51:07Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

+    BitWidth = 32;
+    MantissaBits = 23;
+    ExponentBias = 127;


Should get these from the fltSemantics instead of hardcoding it

arsenm · 2025-11-07T22:51:18Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

+  // pseudocode : 
+  // if(x==0) return 32;
+  // float f = (float) x;
+  // int i = bitcast<int>(f);
+  // int ilog2 = (i >> 23) - 127;
+  // return 31 - ilog2;


Move to the start, not the end

arsenm · 2025-11-08T04:50:46Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

+  SDValue Op = Node->getOperand(0);
+  EVT VT = Op.getValueType();
+
+  assert(VT.isVector() && "This expansion is intended for vectors");


Can just write to not have this limitation

NishiB137 · 2025-11-08T09:49:12Z

Thank you for the review!

replaced the hardcoded constants with values from fltSemantics
moved the pseudocode comment to the start of the function

Regarding the limitation, can you please explain a little more about what exactly we need to do? Do we need to use this function for scalars as well?

Thanks!

RKSimon · 2025-11-08T12:16:06Z

In the issue description i32 was converted to f64, are you sure it works in all cases with f32?

github-actions · 2025-11-10T10:34:16Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff origin/main HEAD --extensions cpp,h -- llvm/include/llvm/CodeGen/TargetLowering.h llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp llvm/lib/Target/X86/X86ISelLowering.cpp --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.

diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b1aa884f5..97f026bcf 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9481,8 +9481,8 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
   return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
 }
 
-
-SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node,
+                                         SelectionDAG &DAG) const {
   // pseudocode :
   // if(x==0) return 32;
   // float f = (float) x;
@@ -9508,8 +9508,7 @@ SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const
     MantissaBits = APFloat::semanticsPrecision(Sem) - 1;
     ExponentBias =
         static_cast<unsigned>(-APFloat::semanticsMinExponent(Sem) + 1);
-  } 
-  else {
+  } else {
     return SDValue();
   }
 
@@ -9522,11 +9521,14 @@ SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const
   // Handling the case for Non-zero inputs using the algorithm mentioned below
   SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
   SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
-  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
-  SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
-  SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
-
-  //Returns the respective DAG Node based on the input being zero or non-zero
+  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits,
+                            DAG.getConstant(MantissaBits, dl, VT));
+  SDValue MSBIndex =
+      DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+  SDValue NonZeroRes = DAG.getNode(
+      ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+  // Returns the respective DAG Node based on the input being zero or non-zero
   return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
 }

jayfoad · 2025-11-10T11:40:59Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

+  // Handling the case for when Op == 0 which is stored in ZeroRes
+  CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
+  SDValue Zero = DAG.getConstant(0, dl, VT);
+  SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ);
+  SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT);


You can omit the Op == 0 handling for CTLZ_ZERO_UNDEF.

jayfoad · 2025-11-10T11:42:23Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

+  SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+  SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);


Suggest emitting (BitWidth - 1 + ExponentBias) - Exp directly here, instead of emitting two SUBs and relying on SelectionDAG to combine them for you.

RKSimon · 2025-11-10T12:20:23Z

llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll

@@ -0,0 +1,26 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s


all 3 of these new test files can be dropped - just regenerate the vector-lzcnt-128.ll checks with the update script

arsenm · 2025-11-10T18:04:21Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

+  } 
+  else {
+    return SDValue();
+  }


Suggested change

}

else {

return SDValue();

}

} else {

return SDValue();

}

arsenm · 2025-11-10T18:05:04Z

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

+  // Handling the case for Non-zero inputs using the algorithm mentioned below
+  SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
+  SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
+  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));


Should use getShiftAmountTy / getShiftAmountConstant

arsenm · 2025-11-10T18:05:19Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+  if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+    const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+    SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
+    if (New.getNode())


Suggested change

if (New.getNode())

if (New)

arsenm · 2025-11-10T18:05:28Z

llvm/lib/Target/X86/X86ISelLowering.cpp

    return splitVectorIntUnary(Op, DAG, DL);

+  if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+    const TargetLowering &TLI = DAG.getTargetLoweringInfo();


Suggested change

const TargetLowering &TLI = DAG.getTargetLoweringInfo();

arsenm · 2025-11-10T18:05:38Z

llvm/lib/Target/X86/X86ISelLowering.cpp


+  if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+    const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+    SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);


Suggested change

SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);

SDValue New = expandCTLZWithFP(Op.getNode(), DAG);

RKSimon

vXi32 types must be converted to vXf64 - not vXf32 - only smaller types (or if you can prove the upper 16 bits are zero) can use vXf32

[CodeGen] Add expandCTLZWithFP helper to TargetLowering supporting vX…

8b1bb2f

…i32 types

llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Nov 7, 2025

arsenm reviewed Nov 7, 2025

View reviewed changes

arsenm reviewed Nov 8, 2025

View reviewed changes

RKSimon self-requested a review November 8, 2025 08:42

[X86] Add SSE2 FP-based v4i32 CTLZ lowering and tests

078cc7a

VindhyaP312 force-pushed the fp-ctlz-lowering branch from e712022 to 078cc7a Compare November 8, 2025 09:38

jayfoad reviewed Nov 10, 2025

View reviewed changes

RKSimon requested changes Nov 10, 2025

View reviewed changes

arsenm reviewed Nov 10, 2025

View reviewed changes

RKSimon requested changes Nov 10, 2025

View reviewed changes

		SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
		SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);

		@@ -0,0 +1,26 @@
		; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - \| FileCheck %s

	SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
	SDValue New = expandCTLZWithFP(Op.getNode(), DAG);

[X86] Use an FP-based expansion for v4i32 ctlz on SSE2-only targets #167034

Are you sure you want to change the base?

[X86] Use an FP-based expansion for v4i32 ctlz on SSE2-only targets #167034

Conversation

NishiB137 commented Nov 7, 2025

Benchmark Results (llvm-mca on core2)

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

llvmbot commented Nov 7, 2025

Benchmark Results (llvm-mca on core2)

Uh oh!

llvmbot commented Nov 7, 2025

Benchmark Results (llvm-mca on core2)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NishiB137 commented Nov 8, 2025

Uh oh!

RKSimon commented Nov 8, 2025

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Benchmark Results (`llvm-mca` on `core2`)

Benchmark Results (`llvm-mca` on `core2`)

Benchmark Results (`llvm-mca` on `core2`)