Skip to content

Conversation

@NishiB137
Copy link

Fixes #161746

This pull request implements a new optimization for the ISD::CTLZ operation, as suggested in the issue. It uses a floating-point-based algorithm for v4i32 vectors on x86 targets that have SSE2 but do not have SSSE3.

  1. A new generic function, TargetLowering::expandCTLZWithFP, was added. This function implements the clz(x) = 31 - (bitcast<int>((float)x) >> 23) - 127 algorithm, complete with a VSELECT to handle the clz(0) == 32 edge case.
  2. The X86TargetLowering::LowerVectorCTLZ function was updated. It now checks for the +sse2,-ssse3 feature combination and calls expandCTLZWithFP for v4i32 vectors.
  3. The new implementation was benchmarked against the existing fallback for v4i32: the generic integer Expand logic (for SSE2).

Benchmark Results (llvm-mca on core2)

The llvm-mca analysis from the testcases (ctlz-v4i32-fp-1.ll, ctlz-v4i32-fp-2.ll) confirms the benefit of this change.

Implementation Instructions
(Per Iteration)
Total Cycles
(Per Iteration)
IPC
(Average)
Block RThroughput
(Average)
Old SSE2 (Integer Fallback) 39
(3900 / 100)
45.01
(4501 / 100)
0.87 7.3
New SSE2 (FP Fallback) 18
(1800 / 100)
21.04
(2104 / 100)
0.86 4.0

The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Nov 7, 2025
@llvmbot
Copy link
Member

llvmbot commented Nov 7, 2025

@llvm/pr-subscribers-backend-x86

Author: None (NishiB137)

Changes

Fixes #161746

This pull request implements a new optimization for the ISD::CTLZ operation, as suggested in the issue. It uses a floating-point-based algorithm for v4i32 vectors on x86 targets that have SSE2 but do not have SSSE3.

  1. A new generic function, TargetLowering::expandCTLZWithFP, was added. This function implements the clz(x) = 31 - (bitcast&lt;int&gt;((float)x) &gt;&gt; 23) - 127 algorithm, complete with a VSELECT to handle the clz(0) == 32 edge case.
  2. The X86TargetLowering::LowerVectorCTLZ function was updated. It now checks for the +sse2,-ssse3 feature combination and calls expandCTLZWithFP for v4i32 vectors.
  3. The new implementation was benchmarked against the existing fallback for v4i32: the generic integer Expand logic (for SSE2).

Benchmark Results (llvm-mca on core2)

The llvm-mca analysis from the testcases (ctlz-v4i32-fp-1.ll, ctlz-v4i32-fp-2.ll) confirms the benefit of this change.

Implementation Instructions <br/> (Per Iteration) Total Cycles <br/> (Per Iteration) IPC <br/> (Average) Block RThroughput <br/> (Average)
Old SSE2 (Integer Fallback) 39 <br/> (3900 / 100) 45.01 <br/> (4501 / 100) 0.87 7.3
New SSE2 (FP Fallback) 18 <br/> (1800 / 100) 21.04 <br/> (2104 / 100) 0.86 4.0

The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt


Full diff: https://github.com/llvm/llvm-project/pull/167034.diff

6 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+5)
  • (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+47)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+13)
  • (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll (+26)
  • (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll (+28)
  • (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll (+23)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 98565f423df3e..bb58f48cfdb5c 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -5543,6 +5543,11 @@ class LLVM_ABI TargetLowering : public TargetLoweringBase {
   /// \returns The expansion result or SDValue() if it fails.
   SDValue expandVPCTLZ(SDNode *N, SelectionDAG &DAG) const;
 
+  /// Expands a CTLZ node into a sequence of floating point operations.
+  /// \param N Node to expand
+  /// \returns The expansion result or SDValue() if it fails.
+  SDValue expandCTLZWithFP(SDNode *N, SelectionDAG &DAG) const;
+
   /// Expand CTTZ via Table Lookup.
   /// \param N Node to expand
   /// \returns The expansion result or SDValue() if it fails.
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b51d6649af2ec..d6ab6b2fe77e2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9480,6 +9480,53 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
   return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
 }
 
+
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+  SDLoc dl(Node);
+  SDValue Op = Node->getOperand(0);
+  EVT VT = Op.getValueType();
+
+  assert(VT.isVector() && "This expansion is intended for vectors");
+
+  EVT EltVT = VT.getVectorElementType();
+  EVT FloatVT, CmpVT;
+  unsigned BitWidth, MantissaBits, ExponentBias;
+
+  // Converting to float type
+  if (EltVT == MVT::i32) {
+    FloatVT = VT.changeVectorElementType(MVT::f32);
+    BitWidth = 32;
+    MantissaBits = 23;
+    ExponentBias = 127;
+  } 
+  else {
+    return SDValue();
+  }
+
+  // Handling the case for when Op == 0 which is stored in ZeroRes
+  CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
+  SDValue Zero = DAG.getConstant(0, dl, VT);
+  SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ);
+  SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT);
+
+  // Handling the case for Non-zero inputs using the algorithm mentioned below
+  SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
+  SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
+  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
+  SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+  SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+  //Returns the respective DAG Node based on the input being zero or non-zero
+  return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
+
+  // pseudocode : 
+  // if(x==0) return 32;
+  // float f = (float) x;
+  // int i = bitcast<int>(f);
+  // int ilog2 = (i >> 23) - 127;
+  // return 31 - ilog2;
+}
+
 SDValue TargetLowering::CTTZTableLookup(SDNode *Node, SelectionDAG &DAG,
                                         const SDLoc &DL, EVT VT, SDValue Op,
                                         unsigned BitWidth) const {
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 05a854a0bf3fa..bdea6c4734908 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1348,6 +1348,12 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
     setOperationAction(ISD::SUB,                MVT::i32, Custom);
   }
 
+  if (!Subtarget.useSoftFloat() && Subtarget.hasSSE2() &&
+      !Subtarget.hasSSSE3()) {
+    setOperationAction(ISD::CTLZ, MVT::v4i32, Custom);
+    setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::v4i32, Custom);
+  }
+
   if (!Subtarget.useSoftFloat() && Subtarget.hasSSE41()) {
     for (MVT RoundedTy : {MVT::f32, MVT::f64, MVT::v4f32, MVT::v2f64}) {
       setOperationAction(ISD::FFLOOR,            RoundedTy,  Legal);
@@ -29039,6 +29045,13 @@ static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,
   if (VT.is512BitVector() && !Subtarget.hasBWI())
     return splitVectorIntUnary(Op, DAG, DL);
 
+  if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+    const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+    SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
+    if (New.getNode())
+      return New;
+  }
+
   assert(Subtarget.hasSSSE3() && "Expected SSSE3 support for PSHUFB");
   return LowerVectorCTLZInRegLUT(Op, DL, Subtarget, DAG);
 }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
new file mode 100644
index 0000000000000..20467b3799875
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
@@ -0,0 +1,26 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2:
+; CHECK:       # %bb.0:
+
+; Zero test (strict CTLZ needs select)
+; CHECK-DAG:   pcmpeqd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Exponent extraction + bias arithmetic (order-free)
+; CHECK-DAG:   psrld {{\$}}23, %xmm{{[0-9]+}}
+; CHECK-DAG:   psubd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Select/merge (could be por/pandn etc.)
+; CHECK:       por %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Must NOT use SSSE3 LUT path
+; CHECK-NOT:   pshufb
+
+; CHECK:       retq
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
new file mode 100644
index 0000000000000..6949fe4110e58
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
@@ -0,0 +1,28 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2_zero_undef(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2_zero_undef:
+
+; zero check
+; CHECK-DAG:   pcmpeqd
+
+; FP-based mantissa/exponent steps (order may vary)
+; CHECK-DAG:   psrld     $16
+; CHECK-DAG:   subps
+; CHECK-DAG:   psrld     $23
+; CHECK-DAG:   psubd
+
+; merge/select
+; CHECK:       pandn
+; CHECK:       por
+
+; CHECK-NOT:   pshufb
+
+; CHECK: retq
+
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 true)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
new file mode 100644
index 0000000000000..8d10c17223a21
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
@@ -0,0 +1,23 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+ssse3 -o - | FileCheck %s
+
+; This verifies that **with SSSE3 enabled**, we use the LUT-based `pshufb`
+; implementation and *not* the floating-point exponent trick.
+
+define <4 x i32> @test_v4i32_ssse3(<4 x i32> %a) {
+; CHECK-LABEL: test_v4i32_ssse3:
+; CHECK:       # %bb.0:
+
+; Must use SSSE3 table LUT:
+; CHECK:       pshufb
+
+; Must NOT use FP exponent trick:
+; CHECK-NOT:   cvtdq2ps
+; CHECK-NOT:   psrld $23
+; CHECK-NOT:   psubd
+
+; CHECK:       retq
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)

@llvmbot
Copy link
Member

llvmbot commented Nov 7, 2025

@llvm/pr-subscribers-llvm-selectiondag

Author: None (NishiB137)

Changes

Fixes #161746

This pull request implements a new optimization for the ISD::CTLZ operation, as suggested in the issue. It uses a floating-point-based algorithm for v4i32 vectors on x86 targets that have SSE2 but do not have SSSE3.

  1. A new generic function, TargetLowering::expandCTLZWithFP, was added. This function implements the clz(x) = 31 - (bitcast&lt;int&gt;((float)x) &gt;&gt; 23) - 127 algorithm, complete with a VSELECT to handle the clz(0) == 32 edge case.
  2. The X86TargetLowering::LowerVectorCTLZ function was updated. It now checks for the +sse2,-ssse3 feature combination and calls expandCTLZWithFP for v4i32 vectors.
  3. The new implementation was benchmarked against the existing fallback for v4i32: the generic integer Expand logic (for SSE2).

Benchmark Results (llvm-mca on core2)

The llvm-mca analysis from the testcases (ctlz-v4i32-fp-1.ll, ctlz-v4i32-fp-2.ll) confirms the benefit of this change.

Implementation Instructions <br/> (Per Iteration) Total Cycles <br/> (Per Iteration) IPC <br/> (Average) Block RThroughput <br/> (Average)
Old SSE2 (Integer Fallback) 39 <br/> (3900 / 100) 45.01 <br/> (4501 / 100) 0.87 7.3
New SSE2 (FP Fallback) 18 <br/> (1800 / 100) 21.04 <br/> (2104 / 100) 0.86 4.0

The llvm-mca result files:
ctlz-v4i32-fp-1-after-mca.txt
ctlz-v4i32-fp-1-before-mca.txt
ctlz-v4i32-fp-2-after-mca.txt
ctlz-v4i32-fp-2-before-mca.txt


Full diff: https://github.com/llvm/llvm-project/pull/167034.diff

6 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+5)
  • (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+47)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+13)
  • (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll (+26)
  • (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll (+28)
  • (added) llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll (+23)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 98565f423df3e..bb58f48cfdb5c 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -5543,6 +5543,11 @@ class LLVM_ABI TargetLowering : public TargetLoweringBase {
   /// \returns The expansion result or SDValue() if it fails.
   SDValue expandVPCTLZ(SDNode *N, SelectionDAG &DAG) const;
 
+  /// Expands a CTLZ node into a sequence of floating point operations.
+  /// \param N Node to expand
+  /// \returns The expansion result or SDValue() if it fails.
+  SDValue expandCTLZWithFP(SDNode *N, SelectionDAG &DAG) const;
+
   /// Expand CTTZ via Table Lookup.
   /// \param N Node to expand
   /// \returns The expansion result or SDValue() if it fails.
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b51d6649af2ec..d6ab6b2fe77e2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9480,6 +9480,53 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
   return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
 }
 
+
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+  SDLoc dl(Node);
+  SDValue Op = Node->getOperand(0);
+  EVT VT = Op.getValueType();
+
+  assert(VT.isVector() && "This expansion is intended for vectors");
+
+  EVT EltVT = VT.getVectorElementType();
+  EVT FloatVT, CmpVT;
+  unsigned BitWidth, MantissaBits, ExponentBias;
+
+  // Converting to float type
+  if (EltVT == MVT::i32) {
+    FloatVT = VT.changeVectorElementType(MVT::f32);
+    BitWidth = 32;
+    MantissaBits = 23;
+    ExponentBias = 127;
+  } 
+  else {
+    return SDValue();
+  }
+
+  // Handling the case for when Op == 0 which is stored in ZeroRes
+  CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
+  SDValue Zero = DAG.getConstant(0, dl, VT);
+  SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ);
+  SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT);
+
+  // Handling the case for Non-zero inputs using the algorithm mentioned below
+  SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
+  SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
+  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
+  SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+  SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+  //Returns the respective DAG Node based on the input being zero or non-zero
+  return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
+
+  // pseudocode : 
+  // if(x==0) return 32;
+  // float f = (float) x;
+  // int i = bitcast<int>(f);
+  // int ilog2 = (i >> 23) - 127;
+  // return 31 - ilog2;
+}
+
 SDValue TargetLowering::CTTZTableLookup(SDNode *Node, SelectionDAG &DAG,
                                         const SDLoc &DL, EVT VT, SDValue Op,
                                         unsigned BitWidth) const {
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 05a854a0bf3fa..bdea6c4734908 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -1348,6 +1348,12 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
     setOperationAction(ISD::SUB,                MVT::i32, Custom);
   }
 
+  if (!Subtarget.useSoftFloat() && Subtarget.hasSSE2() &&
+      !Subtarget.hasSSSE3()) {
+    setOperationAction(ISD::CTLZ, MVT::v4i32, Custom);
+    setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::v4i32, Custom);
+  }
+
   if (!Subtarget.useSoftFloat() && Subtarget.hasSSE41()) {
     for (MVT RoundedTy : {MVT::f32, MVT::f64, MVT::v4f32, MVT::v2f64}) {
       setOperationAction(ISD::FFLOOR,            RoundedTy,  Legal);
@@ -29039,6 +29045,13 @@ static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,
   if (VT.is512BitVector() && !Subtarget.hasBWI())
     return splitVectorIntUnary(Op, DAG, DL);
 
+  if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
+    const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+    SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
+    if (New.getNode())
+      return New;
+  }
+
   assert(Subtarget.hasSSSE3() && "Expected SSSE3 support for PSHUFB");
   return LowerVectorCTLZInRegLUT(Op, DL, Subtarget, DAG);
 }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
new file mode 100644
index 0000000000000..20467b3799875
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-1.ll
@@ -0,0 +1,26 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2:
+; CHECK:       # %bb.0:
+
+; Zero test (strict CTLZ needs select)
+; CHECK-DAG:   pcmpeqd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Exponent extraction + bias arithmetic (order-free)
+; CHECK-DAG:   psrld {{\$}}23, %xmm{{[0-9]+}}
+; CHECK-DAG:   psubd %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Select/merge (could be por/pandn etc.)
+; CHECK:       por %xmm{{[0-9]+}}, %xmm{{[0-9]+}}
+
+; Must NOT use SSSE3 LUT path
+; CHECK-NOT:   pshufb
+
+; CHECK:       retq
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
new file mode 100644
index 0000000000000..6949fe4110e58
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-2.ll
@@ -0,0 +1,28 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
+
+define <4 x i32> @test_v4i32_sse2_zero_undef(<4 x i32> %a) #0 {
+; CHECK-LABEL: test_v4i32_sse2_zero_undef:
+
+; zero check
+; CHECK-DAG:   pcmpeqd
+
+; FP-based mantissa/exponent steps (order may vary)
+; CHECK-DAG:   psrld     $16
+; CHECK-DAG:   subps
+; CHECK-DAG:   psrld     $23
+; CHECK-DAG:   psubd
+
+; merge/select
+; CHECK:       pandn
+; CHECK:       por
+
+; CHECK-NOT:   pshufb
+
+; CHECK: retq
+
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 true)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)
+attributes #0 = { "optnone" }
diff --git a/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
new file mode 100644
index 0000000000000..8d10c17223a21
--- /dev/null
+++ b/llvm/test/CodeGen/X86/ctlz-v4i32-fp-3.ll
@@ -0,0 +1,23 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+ssse3 -o - | FileCheck %s
+
+; This verifies that **with SSSE3 enabled**, we use the LUT-based `pshufb`
+; implementation and *not* the floating-point exponent trick.
+
+define <4 x i32> @test_v4i32_ssse3(<4 x i32> %a) {
+; CHECK-LABEL: test_v4i32_ssse3:
+; CHECK:       # %bb.0:
+
+; Must use SSSE3 table LUT:
+; CHECK:       pshufb
+
+; Must NOT use FP exponent trick:
+; CHECK-NOT:   cvtdq2ps
+; CHECK-NOT:   psrld $23
+; CHECK-NOT:   psubd
+
+; CHECK:       retq
+  %res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %a, i1 false)
+  ret <4 x i32> %res
+}
+
+declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1)

Comment on lines 9498 to 9500
BitWidth = 32;
MantissaBits = 23;
ExponentBias = 127;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should get these from the fltSemantics instead of hardcoding it

Comment on lines 9522 to 9527
// pseudocode :
// if(x==0) return 32;
// float f = (float) x;
// int i = bitcast<int>(f);
// int ilog2 = (i >> 23) - 127;
// return 31 - ilog2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to the start, not the end

SDValue Op = Node->getOperand(0);
EVT VT = Op.getValueType();

assert(VT.isVector() && "This expansion is intended for vectors");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just write to not have this limitation

@RKSimon RKSimon self-requested a review November 8, 2025 08:42
@NishiB137
Copy link
Author

Thank you for the review!

  • replaced the hardcoded constants with values from fltSemantics
  • moved the pseudocode comment to the start of the function

Regarding the limitation, can you please explain a little more about what exactly we need to do? Do we need to use this function for scalars as well?

Thanks!

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 8, 2025

In the issue description i32 was converted to f64, are you sure it works in all cases with f32?

@github-actions
Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions cpp,h -- llvm/include/llvm/CodeGen/TargetLowering.h llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp llvm/lib/Target/X86/X86ISelLowering.cpp --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b1aa884f5..97f026bcf 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9481,8 +9481,8 @@ SDValue TargetLowering::expandVPCTLZ(SDNode *Node, SelectionDAG &DAG) const {
   return DAG.getNode(ISD::VP_CTPOP, dl, VT, Op, Mask, VL);
 }
 
-
-SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const {
+SDValue TargetLowering::expandCTLZWithFP(SDNode *Node,
+                                         SelectionDAG &DAG) const {
   // pseudocode :
   // if(x==0) return 32;
   // float f = (float) x;
@@ -9508,8 +9508,7 @@ SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const
     MantissaBits = APFloat::semanticsPrecision(Sem) - 1;
     ExponentBias =
         static_cast<unsigned>(-APFloat::semanticsMinExponent(Sem) + 1);
-  } 
-  else {
+  } else {
     return SDValue();
   }
 
@@ -9522,11 +9521,14 @@ SDValue TargetLowering::expandCTLZWithFP(SDNode *Node, SelectionDAG &DAG) const
   // Handling the case for Non-zero inputs using the algorithm mentioned below
   SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
   SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
-  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
-  SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
-  SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
-
-  //Returns the respective DAG Node based on the input being zero or non-zero
+  SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits,
+                            DAG.getConstant(MantissaBits, dl, VT));
+  SDValue MSBIndex =
+      DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
+  SDValue NonZeroRes = DAG.getNode(
+      ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
+
+  // Returns the respective DAG Node based on the input being zero or non-zero
   return DAG.getNode(ISD::VSELECT, dl, VT, IsZero, ZeroRes, NonZeroRes);
 }
 

Comment on lines +9516 to +9520
// Handling the case for when Op == 0 which is stored in ZeroRes
CmpVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
SDValue Zero = DAG.getConstant(0, dl, VT);
SDValue IsZero = DAG.getSetCC(dl, CmpVT, Op, Zero, ISD::SETEQ);
SDValue ZeroRes = DAG.getConstant(BitWidth, dl, VT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit the Op == 0 handling for CTLZ_ZERO_UNDEF.

Comment on lines +9526 to +9527
SDValue MSBIndex = DAG.getNode(ISD::SUB, dl, VT, Exp, DAG.getConstant(ExponentBias, dl, VT));
SDValue NonZeroRes = DAG.getNode(ISD::SUB, dl, VT, DAG.getConstant(BitWidth - 1, dl, VT), MSBIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest emitting (BitWidth - 1 + ExponentBias) - Exp directly here, instead of emitting two SUBs and relying on SelectionDAG to combine them for you.

@@ -0,0 +1,26 @@
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2,-ssse3 -o - | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all 3 of these new test files can be dropped - just regenerate the vector-lzcnt-128.ll checks with the update script

Comment on lines +9511 to +9514
}
else {
return SDValue();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
else {
return SDValue();
}
} else {
return SDValue();
}

// Handling the case for Non-zero inputs using the algorithm mentioned below
SDValue Float = DAG.getNode(ISD::UINT_TO_FP, dl, FloatVT, Op);
SDValue FloatBits = DAG.getNode(ISD::BITCAST, dl, VT, Float);
SDValue Exp = DAG.getNode(ISD::SRL, dl, VT, FloatBits, DAG.getConstant(MantissaBits, dl, VT));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use getShiftAmountTy / getShiftAmountConstant

if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
const TargetLowering &TLI = DAG.getTargetLoweringInfo();
SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
if (New.getNode())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (New.getNode())
if (New)

return splitVectorIntUnary(Op, DAG, DL);

if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
const TargetLowering &TLI = DAG.getTargetLoweringInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const TargetLowering &TLI = DAG.getTargetLoweringInfo();


if (VT == MVT::v4i32 && Subtarget.hasSSE2() && !Subtarget.hasSSSE3()) {
const TargetLowering &TLI = DAG.getTargetLoweringInfo();
SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SDValue New = TLI.expandCTLZWithFP(Op.getNode(), DAG);
SDValue New = expandCTLZWithFP(Op.getNode(), DAG);

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vXi32 types must be converted to vXf64 - not vXf32 - only smaller types (or if you can prove the upper 16 bits are zero) can use vXf32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:X86 llvm:SelectionDAG SelectionDAGISel as well

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement clz using floating point math if clz instruction is not available

6 participants