Skip to content

[VPlan] Add ExtractLane VPInst to extract across multiple parts. #148817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 27, 2025

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Jul 15, 2025

This patch adds a new ExtractLane VPInstruction which extracts across multiple parts using a wide index, to be used in combination with FirstActiveLane.

The patch updates early-exit codegen to use it instead ExtractElement, which is only per-part. With this change, interleaving should work correctly with early-exit loops.

The patch removes the restrictions added in 6f43754 (#145877), but does not yet automatically select interleave counts > 1 for early-exit loops.

I'll share a patch as follow-up. The cost of extracting a lane adds non-trivial overhead in the exit block, so that should be considered when picking the interleave count.

@llvmbot
Copy link
Member

llvmbot commented Jul 15, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Florian Hahn (fhahn)

Changes

This patch adds a new ExtractLane VPInstruction which extracts across multiple parts using a wide index, to be used in combination with FirstActiveLane.

The patch updates early-exit codegen to use it instead ExtractElement, which is only per-part. With this change, interleaving should work correctly with early-exit loops.

The patch removes the restrictions added in 6f43754 (#145877), but does not yet automatically select interleave counts > 1 for early-exit loops.

I'll share a patch as follow-up. The cost of extracting a lane adds non-trivial overhead in the exit block, so that should be considered when picking the interleave count.


Patch is 66.75 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/148817.diff

10 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (-6)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+4)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+33-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+4-4)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp (+7)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll (+56-6)
  • (modified) llvm/test/Transforms/LoopVectorize/single-early-exit-interleave-hint.ll (+15-8)
  • (modified) llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll (+334-32)
  • (modified) llvm/test/Transforms/LoopVectorize/vector-loop-backedge-elimination-early-exit.ll (+25-22)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 06db89a89bc38..f4e1ddf58a9b3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -10069,12 +10069,6 @@ bool LoopVectorizePass::processLoop(Loop *L) {
   // Get user vectorization factor and interleave count.
   ElementCount UserVF = Hints.getWidth();
   unsigned UserIC = Hints.getInterleave();
-  if (LVL.hasUncountableEarlyExit() && UserIC != 1) {
-    UserIC = 1;
-    reportVectorizationInfo("Interleaving not supported for loops "
-                            "with uncountable early exits",
-                            "InterleaveEarlyExitDisabled", ORE, L);
-  }
 
   // Plan how to best vectorize.
   LVP.plan(UserVF, UserIC);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 703cfe969577d..a81dc0bb0bef6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1012,6 +1012,10 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
     ReductionStartVector,
     // Creates a step vector starting from 0 to VF with a step of 1.
     StepVector,
+    /// Extracts a single lane (first operand) from a set of vector operands.
+    /// The lane specifies an index into a vector formed by combining all vector
+    /// operands (all operands after the first one).
+    ExtractLane,
 
   };
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index b27a7ffeed208..a0f5f10beb9fa 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -109,6 +109,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
   case VPInstruction::BuildStructVector:
   case VPInstruction::BuildVector:
     return SetResultTyFromOp();
+  case VPInstruction::ExtractLane:
+    return inferScalarType(R->getOperand(1));
   case VPInstruction::FirstActiveLane:
     return Type::getIntNTy(Ctx, 64);
   case VPInstruction::ExtractLastElement:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 7dd844aec1242..967adbac56c87 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -860,6 +860,31 @@ Value *VPInstruction::generate(VPTransformState &State) {
       Res = Builder.CreateOr(Res, State.get(Op));
     return Builder.CreateOrReduce(Res);
   }
+  case VPInstruction::ExtractLane: {
+    Value *LaneToExtract = State.get(getOperand(0), true);
+    Type *IdxTy = State.TypeAnalysis.inferScalarType(getOperand(0));
+    Value *Res = nullptr;
+    Value *RuntimeVF = getRuntimeVF(State.Builder, IdxTy, State.VF);
+
+    for (unsigned Idx = 1; Idx != getNumOperands(); ++Idx) {
+      Value *VectorStart =
+          Builder.CreateMul(RuntimeVF, ConstantInt::get(IdxTy, Idx - 1));
+      Value *VectorIdx = Idx == 1
+                             ? LaneToExtract
+                             : Builder.CreateSub(LaneToExtract, VectorStart);
+      Value *Ext = State.VF.isScalar()
+                       ? State.get(getOperand(Idx))
+                       : Builder.CreateExtractElement(
+                             State.get(getOperand(Idx)), VectorIdx);
+      if (Res) {
+        Value *Cmp = Builder.CreateICmpUGE(LaneToExtract, VectorStart);
+        Res = Builder.CreateSelect(Cmp, Ext, Res);
+      } else {
+        Res = Ext;
+      }
+    }
+    return Res;
+  }
   case VPInstruction::FirstActiveLane: {
     if (getNumOperands() == 1) {
       Value *Mask = State.get(getOperand(0));
@@ -918,7 +943,8 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
   }
 
   switch (getOpcode()) {
-  case Instruction::ExtractElement: {
+  case Instruction::ExtractElement:
+  case VPInstruction::ExtractLane: {
     // Add on the cost of extracting the element.
     auto *VecTy = toVectorTy(Ctx.Types.inferScalarType(getOperand(0)), VF);
     return Ctx.TTI.getVectorInstrCost(Instruction::ExtractElement, VecTy,
@@ -980,6 +1006,7 @@ bool VPInstruction::isVectorToScalar() const {
   return getOpcode() == VPInstruction::ExtractLastElement ||
          getOpcode() == VPInstruction::ExtractPenultimateElement ||
          getOpcode() == Instruction::ExtractElement ||
+         getOpcode() == VPInstruction::ExtractLane ||
          getOpcode() == VPInstruction::FirstActiveLane ||
          getOpcode() == VPInstruction::ComputeAnyOfResult ||
          getOpcode() == VPInstruction::ComputeFindIVResult ||
@@ -1038,6 +1065,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
   case VPInstruction::BuildVector:
   case VPInstruction::CalculateTripCountMinusVF:
   case VPInstruction::CanonicalIVIncrementForPart:
+  case VPInstruction::ExtractLane:
   case VPInstruction::ExtractLastElement:
   case VPInstruction::ExtractPenultimateElement:
   case VPInstruction::FirstActiveLane:
@@ -1063,6 +1091,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
   default:
     return false;
   case Instruction::ExtractElement:
+  case VPInstruction::ExtractLane:
     return Op == getOperand(1);
   case Instruction::PHI:
     return true;
@@ -1164,6 +1193,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
   case VPInstruction::BuildVector:
     O << "buildvector";
     break;
+  case VPInstruction::ExtractLane:
+    O << "extract-lane";
+    break;
   case VPInstruction::ExtractLastElement:
     O << "extract-last-element";
     break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 6a3b3e6e41955..338001820d593 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -774,10 +774,10 @@ static VPValue *optimizeEarlyExitInductionUser(VPlan &Plan,
   using namespace VPlanPatternMatch;
 
   VPValue *Incoming, *Mask;
-  if (!match(Op, m_VPInstruction<Instruction::ExtractElement>(
-                     m_VPValue(Incoming),
+  if (!match(Op, m_VPInstruction<VPInstruction::ExtractLane>(
                      m_VPInstruction<VPInstruction::FirstActiveLane>(
-                         m_VPValue(Mask)))))
+                         m_VPValue(Mask)),
+                     m_VPValue(Incoming))))
     return nullptr;
 
   auto *WideIV = getOptimizableIVOf(Incoming);
@@ -2831,7 +2831,7 @@ void VPlanTransforms::handleUncountableEarlyExit(
           VPInstruction::FirstActiveLane, {CondToEarlyExit}, nullptr,
           "first.active.lane");
       IncomingFromEarlyExit = EarlyExitB.createNaryOp(
-          Instruction::ExtractElement, {IncomingFromEarlyExit, FirstActiveLane},
+          VPInstruction::ExtractLane, {FirstActiveLane, IncomingFromEarlyExit},
           nullptr, "early.exit.value");
       ExitIRI->setOperand(EarlyExitIdx, IncomingFromEarlyExit);
     }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
index b89cd21595efd..871e37ef3966a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
@@ -363,6 +363,13 @@ void UnrollState::unrollBlock(VPBlockBase *VPB) {
       continue;
     }
     VPValue *Op0;
+    if (match(&R, m_VPInstruction<VPInstruction::ExtractLane>(
+                      m_VPValue(Op0), m_VPValue(Op1)))) {
+      addUniformForAllParts(cast<VPInstruction>(&R));
+      for (unsigned Part = 1; Part != UF; ++Part)
+        R.addOperand(getValueForPart(Op1, Part));
+      continue;
+    }
     if (match(&R, m_VPInstruction<VPInstruction::ExtractLastElement>(
                       m_VPValue(Op0))) ||
         match(&R, m_VPInstruction<VPInstruction::ExtractPenultimateElement>(
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll b/llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll
index 61ef3cef603fa..c7be4593c6a9c 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll
@@ -14,15 +14,16 @@ define i64 @same_exit_block_pre_inc_use1() #0 {
 ; CHECK-NEXT:    call void @init_mem(ptr [[P1]], i64 1024)
 ; CHECK-NEXT:    call void @init_mem(ptr [[P2]], i64 1024)
 ; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP1:%.*]] = mul nuw i64 [[TMP0]], 16
-; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK-NEXT:    [[TMP1:%.*]] = mul nuw i64 [[TMP0]], 64
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 510, [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 16
+; CHECK-NEXT:    [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 64
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 510, [[TMP3]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 510, [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP5:%.*]] = mul nuw i64 [[TMP4]], 16
+; CHECK-NEXT:    [[TMP5:%.*]] = mul nuw i64 [[TMP4]], 64
 ; CHECK-NEXT:    [[INDEX_NEXT:%.*]] = add i64 3, [[N_VEC]]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       vector.body:
@@ -30,13 +31,43 @@ define i64 @same_exit_block_pre_inc_use1() #0 {
 ; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX1]]
 ; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[P1]], i64 [[OFFSET_IDX]]
 ; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0
+; CHECK-NEXT:    [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP19:%.*]] = mul nuw i64 [[TMP18]], 16
+; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i64 [[TMP19]]
+; CHECK-NEXT:    [[TMP29:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP36:%.*]] = mul nuw i64 [[TMP29]], 32
+; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i64 [[TMP36]]
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP38:%.*]] = mul nuw i64 [[TMP15]], 48
+; CHECK-NEXT:    [[TMP54:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i64 [[TMP38]]
 ; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <vscale x 16 x i8>, ptr [[TMP8]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD2:%.*]] = load <vscale x 16 x i8>, ptr [[TMP11]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <vscale x 16 x i8>, ptr [[TMP37]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD5:%.*]] = load <vscale x 16 x i8>, ptr [[TMP54]], align 1
 ; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P2]], i64 [[OFFSET_IDX]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i32 0
+; CHECK-NEXT:    [[TMP20:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP21:%.*]] = mul nuw i64 [[TMP20]], 16
+; CHECK-NEXT:    [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i64 [[TMP21]]
+; CHECK-NEXT:    [[TMP23:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP24:%.*]] = mul nuw i64 [[TMP23]], 32
+; CHECK-NEXT:    [[TMP25:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i64 [[TMP24]]
+; CHECK-NEXT:    [[TMP26:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP27:%.*]] = mul nuw i64 [[TMP26]], 48
+; CHECK-NEXT:    [[TMP28:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i64 [[TMP27]]
 ; CHECK-NEXT:    [[WIDE_LOAD8:%.*]] = load <vscale x 16 x i8>, ptr [[TMP10]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD6:%.*]] = load <vscale x 16 x i8>, ptr [[TMP22]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD7:%.*]] = load <vscale x 16 x i8>, ptr [[TMP25]], align 1
+; CHECK-NEXT:    [[WIDE_LOAD9:%.*]] = load <vscale x 16 x i8>, ptr [[TMP28]], align 1
 ; CHECK-NEXT:    [[TMP32:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD4]], [[WIDE_LOAD8]]
+; CHECK-NEXT:    [[TMP30:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD2]], [[WIDE_LOAD6]]
+; CHECK-NEXT:    [[TMP31:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD3]], [[WIDE_LOAD7]]
+; CHECK-NEXT:    [[TMP59:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD5]], [[WIDE_LOAD9]]
 ; CHECK-NEXT:    [[INDEX_NEXT3]] = add nuw i64 [[INDEX1]], [[TMP5]]
-; CHECK-NEXT:    [[TMP12:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP32]])
+; CHECK-NEXT:    [[TMP33:%.*]] = or <vscale x 16 x i1> [[TMP32]], [[TMP30]]
+; CHECK-NEXT:    [[TMP34:%.*]] = or <vscale x 16 x i1> [[TMP33]], [[TMP31]]
+; CHECK-NEXT:    [[TMP35:%.*]] = or <vscale x 16 x i1> [[TMP34]], [[TMP59]]
+; CHECK-NEXT:    [[TMP12:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP35]])
 ; CHECK-NEXT:    [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT3]], [[N_VEC]]
 ; CHECK-NEXT:    [[TMP14:%.*]] = or i1 [[TMP12]], [[TMP13]]
 ; CHECK-NEXT:    br i1 [[TMP14]], label [[MIDDLE_SPLIT:%.*]], label [[LOOP]], !llvm.loop [[LOOP0:![0-9]+]]
@@ -46,8 +77,27 @@ define i64 @same_exit_block_pre_inc_use1() #0 {
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 510, [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[LOOP_END:%.*]], label [[SCALAR_PH]]
 ; CHECK:       vector.early.exit:
+; CHECK-NEXT:    [[TMP39:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP40:%.*]] = mul nuw i64 [[TMP39]], 16
+; CHECK-NEXT:    [[TMP41:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP59]], i1 true)
+; CHECK-NEXT:    [[TMP42:%.*]] = mul i64 [[TMP40]], 3
+; CHECK-NEXT:    [[TMP43:%.*]] = add i64 [[TMP42]], [[TMP41]]
+; CHECK-NEXT:    [[TMP44:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP31]], i1 true)
+; CHECK-NEXT:    [[TMP45:%.*]] = mul i64 [[TMP40]], 2
+; CHECK-NEXT:    [[TMP46:%.*]] = add i64 [[TMP45]], [[TMP44]]
+; CHECK-NEXT:    [[TMP47:%.*]] = icmp ne i64 [[TMP44]], [[TMP40]]
+; CHECK-NEXT:    [[TMP48:%.*]] = select i1 [[TMP47]], i64 [[TMP46]], i64 [[TMP43]]
+; CHECK-NEXT:    [[TMP49:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP30]], i1 true)
+; CHECK-NEXT:    [[TMP50:%.*]] = mul i64 [[TMP40]], 1
+; CHECK-NEXT:    [[TMP51:%.*]] = add i64 [[TMP50]], [[TMP49]]
+; CHECK-NEXT:    [[TMP52:%.*]] = icmp ne i64 [[TMP49]], [[TMP40]]
+; CHECK-NEXT:    [[TMP53:%.*]] = select i1 [[TMP52]], i64 [[TMP51]], i64 [[TMP48]]
 ; CHECK-NEXT:    [[TMP61:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP32]], i1 true)
-; CHECK-NEXT:    [[TMP16:%.*]] = add i64 [[INDEX1]], [[TMP61]]
+; CHECK-NEXT:    [[TMP55:%.*]] = mul i64 [[TMP40]], 0
+; CHECK-NEXT:    [[TMP56:%.*]] = add i64 [[TMP55]], [[TMP61]]
+; CHECK-NEXT:    [[TMP57:%.*]] = icmp ne i64 [[TMP61]], [[TMP40]]
+; CHECK-NEXT:    [[TMP58:%.*]] = select i1 [[TMP57]], i64 [[TMP56]], i64 [[TMP53]]
+; CHECK-NEXT:    [[TMP16:%.*]] = add i64 [[INDEX1]], [[TMP58]]
 ; CHECK-NEXT:    [[TMP17:%.*]] = add i64 3, [[TMP16]]
 ; CHECK-NEXT:    br label [[LOOP_END]]
 ; CHECK:       scalar.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave-hint.ll b/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave-hint.ll
index de8a3c5a8eaf2..6c2ae2048cf7f 100644
--- a/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave-hint.ll
+++ b/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave-hint.ll
@@ -1,13 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
-; REQUIRES: asserts
-; RUN: opt -p loop-vectorize -enable-early-exit-vectorization -force-vector-width=4 \
-; RUN:   -debug-only=loop-vectorize -S %s 2>%t | FileCheck --check-prefix=VF4IC4 %s
-; RUN: cat %t | FileCheck --check-prefix=DEBUG %s
+; RUN: opt -p loop-vectorize -enable-early-exit-vectorization -force-vector-width=4 -S %s | FileCheck --check-prefix=VF4IC4 %s
 
 declare void @init_mem(ptr, i64);
 
-; DEBUG: Interleaving not supported for loops with uncountable early exits
-
 define i64 @multi_exiting_to_different_exits_live_in_exit_values() {
 ; VF4IC4-LABEL: define i64 @multi_exiting_to_different_exits_live_in_exit_values() {
 ; VF4IC4-NEXT:  [[ENTRY:.*]]:
@@ -20,10 +15,22 @@ define i64 @multi_exiting_to_different_exits_live_in_exit_values() {
 ; VF4IC4-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
 ; VF4IC4-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i64 [[INDEX]]
 ; VF4IC4-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 0
+; VF4IC4-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 4
+; VF4IC4-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 8
+; VF4IC4-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 12
 ; VF4IC4-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP1]], align 4
+; VF4IC4-NEXT:    [[WIDE_LOAD1:%.*]] = load <4 x i32>, ptr [[TMP12]], align 4
+; VF4IC4-NEXT:    [[WIDE_LOAD2:%.*]] = load <4 x i32>, ptr [[TMP13]], align 4
+; VF4IC4-NEXT:    [[WIDE_LOAD3:%.*]] = load <4 x i32>, ptr [[TMP14]], align 4
 ; VF4IC4-NEXT:    [[TMP2:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD]], splat (i32 10)
-; VF4IC4-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
-; VF4IC4-NEXT:    [[TMP3:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP2]])
+; VF4IC4-NEXT:    [[TMP6:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD1]], splat (i32 10)
+; VF4IC4-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD2]], splat (i32 10)
+; VF4IC4-NEXT:    [[TMP8:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD3]], splat (i32 10)
+; VF4IC4-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; VF4IC4-NEXT:    [[TMP9:%.*]] = or <4 x i1> [[TMP2]], [[TMP6]]
+; VF4IC4-NEXT:    [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP7]]
+; VF4IC4-NEXT:    [[TMP11:%.*]] = or <4 x i1> [[TMP10]], [[TMP8]]
+; VF4IC4-NEXT:    [[TMP3:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP11]])
 ; VF4IC4-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 128
 ; VF4IC4-NEXT:    [[TMP5:%.*]] = or i1 [[TMP3]], [[TMP4]]
 ; VF4IC4-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_SPLIT:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
diff --git a/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll b/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll
index 0f99ed576f1fe..6b758785f2512 100644
--- a/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll
+++ b/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll
@@ -15,10 +15,22 @@ define i64 @multi_exiting_to_different_exits_live_in_exit_values() {
 ; VF4IC4-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; VF4IC4-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i64 [[INDEX]]
 ; VF4IC4-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 0
+; VF4IC4-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 4
+; VF4IC4-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 8
+; VF4IC4-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 12
 ; VF4IC4-NEXT:    [[WIDE_LOAD3:%.*]] = load <4 x i32>, ptr [[TMP1]], align 4
+; VF4IC4-NEXT:    [[WIDE_LOAD1:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
+; VF4IC4-NEXT:    [[WIDE_LOAD2:%.*]] = load <4 x i32>, ptr [[TMP12]], align 4
+; VF4IC4-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i32>, ptr [[TMP13]], align 4
 ; VF4IC4-NEXT:    [[TMP8:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD3]], splat (i32 10)
-; VF4IC4-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
-; VF4IC4-NEXT:    [[TMP3:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP8]])
+; VF4IC4-NEXT:    [[TMP6:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD1]], splat (i32 10)
+; VF4IC4-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD2]], splat (i32 10)
+; VF4IC4-NEXT:    [[TMP14:%.*]] = icmp eq <4 x i32> [[WIDE_LOAD4]], splat (i32 10)
+; VF4IC4-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; VF4IC4-NEXT:    [[TMP9:%.*]] = or <4 x i1> [[TMP8]], [[TMP6]]
+; VF4IC4-NEXT:    [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP7]]
+; VF4IC4-NEXT:    [[TMP11:%.*]] = or <4 x i1> [[TMP10]], [[TMP14]]
+; VF4IC4-NEXT:    [[TMP3:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP11]])
 ; VF4IC4-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 128
 ; VF4IC4-NEXT:    [[TMP5:%.*]] = or i1 [[TMP3]], [[TMP4]]
 ; VF4IC4-NEXT:    br i1 [[TMP5]], ...
[truncated]

fhahn added a commit to fhahn/llvm-project that referenced this pull request Jul 16, 2025
…(WIP)

Building on top of llvm#148817, use
ExtractLane + FirstActiveLane to support vectorizing external users when
tail-folding.

Currently marked as WIP as there is a regression when
-prefer-predicate-over-epilogue=predicate-else-scalar-epilogue is used
because we bail out when building VPlans, so cannot recover and switch
to non-tail-folding.

Ideally we would have built both VPlans
(llvm#148882).
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jul 16, 2025
…(WIP)

Building on top of llvm#148817, use
ExtractLane + FirstActiveLane to support vectorizing external users when
tail-folding.

Currently marked as WIP as there is a regression when
-prefer-predicate-over-epilogue=predicate-else-scalar-epilogue is used
because we bail out when building VPlans, so cannot recover and switch
to non-tail-folding.

Ideally we would have built both VPlans
(llvm#148882).

See also llvm#148603

Depends on llvm#148817 (included in
the PR).
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jul 16, 2025
…(WIP)

Building on top of llvm#148817, use
ExtractLane + FirstActiveLane to support vectorizing external users when
tail-folding.

Currently marked as WIP as there is a regression when
-prefer-predicate-over-epilogue=predicate-else-scalar-epilogue is used
because we bail out when building VPlans, so cannot recover and switch
to non-tail-folding.

Ideally we would have built both VPlans
(llvm#148882).

See also llvm#148603

Depends on llvm#148817 (included in
the PR).
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fhahn, sorry to do this but I think the lane calculation when interleaving is broken at the moment.

; CHECK-NEXT: [[TMP44:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP31]], i1 true)
; CHECK-NEXT: [[TMP45:%.*]] = mul i64 [[TMP40]], 2
; CHECK-NEXT: [[TMP46:%.*]] = add i64 [[TMP45]], [[TMP44]]
; CHECK-NEXT: [[TMP47:%.*]] = icmp ne i64 [[TMP44]], [[TMP40]]
Copy link
Contributor

@david-arm david-arm Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised this IR is broken I think. Since the cttz intrinsics have been split up into parts you can have situations like this:

[[TMP41:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP59]], i1 true) -> 3
and
[[TMP44:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP31]], i1 true) -> 7

which is because there can be multiple triggering exit conditions, but the first exit condition was in part 1. Therefore, it's not valid to use the condition [[TMP44]] != [[TMP40]] for selecting between [[TMP46]] and [[TMP43]]. I think really this should be

; CHECK-NEXT:    [[TMP47:%.*]] = icmp eq i64 [[TMP41]], [[TMP40]]
; CHECK-NEXT:    [[TMP48:%.*]] = select i1 [[TMP47]], i64 [[TMP46]], i64 [[TMP43]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or something like that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I might be missing something, but I think we should check & select the parts in reverse order, so that the final select (TMP58), picks the offset based on part 0, if part 0 has an active lane, otherwise select offset for part 1 if it has an active lane and so on. Some of the check pattern variables don't have consecutive numbers, can be a bit confusing.

For the example, below is what I think should happen, but I am not sure if I missed anything.

[[TMP41] should handle the mask for part 3 ([[TMP59]] should be the compare for the last part)
[[TMP44]] should handle the mask for part 2 ([[TMP31]] should be the compare for the second-to-last part)

[[TMP47]] should be true if there is an active lane in part 2. (llvm.experimental.cttz.elts should return the number of vector elements if no lane is active)

[[TMP47:%.*]] = icmp ne i64 [[TMP44]], [[TMP40]]
[[TMP48:%.*]] = select i1 [[TMP47]], i64 [[TMP46]], i64 [[TMP43]]

So [[TMP48]] should select [[TMP46], if there's an active lane for part 2, which is the count based on the active lanes in part 2. Only if there's no active lane in part 2 we should select the one based on part 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, my bad. I got confused because the IR for each part is being generated in reverse order. I'll remove the "request changes"!

@fhahn fhahn force-pushed the vplan-extract-lane-early-exit-interleave branch from 0b5ce1a to f71b4ec Compare July 23, 2025 14:18
@david-arm david-arm self-requested a review July 23, 2025 14:35
@fhahn fhahn force-pushed the vplan-extract-lane-early-exit-interleave branch from f71b4ec to efa987c Compare July 25, 2025 09:32
Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. With only one vector this behaves like a ExtractElement VPInstruction right? Is it possible to replace ExtractElement with this later?

fhahn added 3 commits July 26, 2025 20:33
This patch adds a new ExtractLane VPInstruction which extracts across
multiple parts using a wide index, to be used in combination with
FirstActiveLane.

The patch updates early-exit codegen to use it instead ExtractElement,
which is only per-part. With this change, interleaving should work
correctly with early-exit loops.

The patch removes the restrictions added in 6f43754 (llvm#145877), but
does not yet automatically select interleave counts > 1 for early-exit
loops.

I'll share a patch as follow-up. The cost of extracting a lane adds
non-trivial overhead in the exit block, so that should be considered
when picking the interleave count.
@fhahn fhahn force-pushed the vplan-extract-lane-early-exit-interleave branch from efa987c to 58b39b0 Compare July 26, 2025 19:42
@fhahn
Copy link
Contributor Author

fhahn commented Jul 27, 2025

LGTM. With only one vector this behaves like a ExtractElement VPInstruction right? Is it possible to replace ExtractElement with this later?

Yep that should be possible, will look into this a potential follow-up, thanks!

@fhahn fhahn merged commit 80c43b6 into llvm:main Jul 27, 2025
9 checks passed
@fhahn fhahn deleted the vplan-extract-lane-early-exit-interleave branch July 27, 2025 07:08
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jul 27, 2025

LLVM Buildbot has detected a new failure on builder fuchsia-x86_64-linux running on fuchsia-debian-64-us-central1-b-1 while building llvm at step 4 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/11/builds/20470

Here is the relevant piece of the build log for the reference
Step 4 (annotate) failure: 'python ../llvm-zorg/zorg/buildbot/builders/annotated/fuchsia-linux.py ...' (failure)
...
[473/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.issignaling.dir/issignaling.cpp.obj
[474/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_width_uc.dir/stdc_bit_width_uc.cpp.obj
[475/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_width_ui.dir/stdc_bit_width_ui.cpp.obj
[476/2522] Building CXX object libc/src/inttypes/CMakeFiles/libc.src.inttypes.imaxdiv.dir/imaxdiv.cpp.obj
[477/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.issignalingf.dir/issignalingf.cpp.obj
[478/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_uc.dir/stdc_bit_floor_uc.cpp.obj
[479/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_us.dir/stdc_bit_floor_us.cpp.obj
[480/2522] Building CXX object libc/src/__support/CMakeFiles/libc.src.__support.freelist.dir/freelist.cpp.obj
[481/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_ul.dir/stdc_bit_floor_ul.cpp.obj
[482/2522] Building CXX object libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj
FAILED: libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj 
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-hblaw1l3/./bin/clang++ --target=armv8m.main-none-eabi -DLIBC_NAMESPACE=__llvm_libc_22_0_0_git -I/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc -isystem /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-hblaw1l3/include/armv8m.main-unknown-none-eabi --target=armv8m.main-none-eabi -Wno-atomic-alignment "-Dvfprintf(stream, format, vlist)=vprintf(format, vlist)" "-Dfprintf(stream, format, ...)=printf(format)" "-Dfputs(string, stream)=puts(string)" -D_LIBCPP_PRINT=1 -mthumb -mfloat-abi=softfp -march=armv8m.main+fp+dsp -mcpu=cortex-m33 -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -ffunction-sections -fdata-sections -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-hblaw1l3/runtimes/runtimes-armv8m.main-none-eabi-bins=../../../../llvm-project -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/= -no-canonical-prefixes -Os -DNDEBUG --target=armv8m.main-none-eabi -DLIBC_QSORT_IMPL=LIBC_QSORT_HEAP_SORT -DLIBC_TYPES_TIME_T_IS_32_BIT -DLIBC_ADD_NULL_CHECKS "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES)" -DLIBC_ERRNO_MODE=LIBC_ERRNO_MODE_EXTERNAL -fpie -ffreestanding -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-builtin -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wdeprecated -Wno-c99-extensions -Wno-gnu-imaginary-constant -Wno-pedantic -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors -DLIBC_COPT_PUBLIC_PACKAGING -MD -MT libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj -MF libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj.d -o libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj -c /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/sscanf.cpp
In file included from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/sscanf.cpp:14:
In file included from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/scanf_core/scanf_main.h:14:
In file included from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/scanf_core/converter.h:15:
In file included from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/scanf_core/core_structs.h:16:
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-hblaw1l3/./lib/clang/22/include/inttypes.h:24:15: fatal error: 'inttypes.h' file not found
   24 | #include_next <inttypes.h>
      |               ^~~~~~~~~~~~
1 error generated.
[483/2522] Building CXX object libc/src/inttypes/CMakeFiles/libc.src.inttypes.imaxabs.dir/imaxabs.cpp.obj
[484/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.issignalingl.dir/issignalingl.cpp.obj
[485/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_width_ull.dir/stdc_bit_width_ull.cpp.obj
[486/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_ui.dir/stdc_bit_floor_ui.cpp.obj
[487/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_ull.dir/stdc_bit_floor_ull.cpp.obj
[488/2522] Building CXX object libc/src/__support/CMakeFiles/libc.src.__support.freetrie.dir/freetrie.cpp.obj
[489/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_uc.dir/stdc_bit_ceil_uc.cpp.obj
[490/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_ull.dir/stdc_bit_ceil_ull.cpp.obj
[491/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_ui.dir/stdc_bit_ceil_ui.cpp.obj
[492/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.iscanonicalf.dir/iscanonicalf.cpp.obj
[493/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_uc.dir/stdc_first_leading_zero_uc.cpp.obj
[494/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_us.dir/stdc_first_leading_zero_us.cpp.obj
[495/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_ul.dir/stdc_bit_ceil_ul.cpp.obj
[496/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_ul.dir/stdc_first_leading_zero_ul.cpp.obj
[497/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.iscanonical.dir/iscanonical.cpp.obj
[498/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_us.dir/stdc_bit_ceil_us.cpp.obj
[499/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_ui.dir/stdc_first_leading_zero_ui.cpp.obj
[500/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_us.dir/stdc_first_leading_one_us.cpp.obj
[501/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_ui.dir/stdc_first_leading_one_ui.cpp.obj
[502/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.iscanonicall.dir/iscanonicall.cpp.obj
[503/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_trailing_zero_ui.dir/stdc_first_trailing_zero_ui.cpp.obj
[504/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_ull.dir/stdc_first_leading_one_ull.cpp.obj
[505/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_uc.dir/stdc_first_leading_one_uc.cpp.obj
[506/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_ull.dir/stdc_first_leading_zero_ull.cpp.obj
[507/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_trailing_zero_us.dir/stdc_first_trailing_zero_us.cpp.obj
[508/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_trailing_zero_uc.dir/stdc_first_trailing_zero_uc.cpp.obj
[509/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_ul.dir/stdc_first_leading_one_ul.cpp.obj
[510/2522] Building CXX object libc/src/__support/CMakeFiles/libc.src.__support.freelist_heap.dir/freelist_heap.cpp.obj
[511/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_trailing_one_us.dir/stdc_first_trailing_one_us.cpp.obj
Step 6 (build) failure: build (failure)
...
[473/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.issignaling.dir/issignaling.cpp.obj
[474/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_width_uc.dir/stdc_bit_width_uc.cpp.obj
[475/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_width_ui.dir/stdc_bit_width_ui.cpp.obj
[476/2522] Building CXX object libc/src/inttypes/CMakeFiles/libc.src.inttypes.imaxdiv.dir/imaxdiv.cpp.obj
[477/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.issignalingf.dir/issignalingf.cpp.obj
[478/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_uc.dir/stdc_bit_floor_uc.cpp.obj
[479/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_us.dir/stdc_bit_floor_us.cpp.obj
[480/2522] Building CXX object libc/src/__support/CMakeFiles/libc.src.__support.freelist.dir/freelist.cpp.obj
[481/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_ul.dir/stdc_bit_floor_ul.cpp.obj
[482/2522] Building CXX object libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj
FAILED: libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj 
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-hblaw1l3/./bin/clang++ --target=armv8m.main-none-eabi -DLIBC_NAMESPACE=__llvm_libc_22_0_0_git -I/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc -isystem /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-hblaw1l3/include/armv8m.main-unknown-none-eabi --target=armv8m.main-none-eabi -Wno-atomic-alignment "-Dvfprintf(stream, format, vlist)=vprintf(format, vlist)" "-Dfprintf(stream, format, ...)=printf(format)" "-Dfputs(string, stream)=puts(string)" -D_LIBCPP_PRINT=1 -mthumb -mfloat-abi=softfp -march=armv8m.main+fp+dsp -mcpu=cortex-m33 -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -ffunction-sections -fdata-sections -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-hblaw1l3/runtimes/runtimes-armv8m.main-none-eabi-bins=../../../../llvm-project -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/= -no-canonical-prefixes -Os -DNDEBUG --target=armv8m.main-none-eabi -DLIBC_QSORT_IMPL=LIBC_QSORT_HEAP_SORT -DLIBC_TYPES_TIME_T_IS_32_BIT -DLIBC_ADD_NULL_CHECKS "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES)" -DLIBC_ERRNO_MODE=LIBC_ERRNO_MODE_EXTERNAL -fpie -ffreestanding -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-builtin -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wdeprecated -Wno-c99-extensions -Wno-gnu-imaginary-constant -Wno-pedantic -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors -DLIBC_COPT_PUBLIC_PACKAGING -MD -MT libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj -MF libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj.d -o libc/src/stdio/CMakeFiles/libc.src.stdio.sscanf.dir/sscanf.cpp.obj -c /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/sscanf.cpp
In file included from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/sscanf.cpp:14:
In file included from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/scanf_core/scanf_main.h:14:
In file included from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/scanf_core/converter.h:15:
In file included from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdio/scanf_core/core_structs.h:16:
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-hblaw1l3/./lib/clang/22/include/inttypes.h:24:15: fatal error: 'inttypes.h' file not found
   24 | #include_next <inttypes.h>
      |               ^~~~~~~~~~~~
1 error generated.
[483/2522] Building CXX object libc/src/inttypes/CMakeFiles/libc.src.inttypes.imaxabs.dir/imaxabs.cpp.obj
[484/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.issignalingl.dir/issignalingl.cpp.obj
[485/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_width_ull.dir/stdc_bit_width_ull.cpp.obj
[486/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_ui.dir/stdc_bit_floor_ui.cpp.obj
[487/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_floor_ull.dir/stdc_bit_floor_ull.cpp.obj
[488/2522] Building CXX object libc/src/__support/CMakeFiles/libc.src.__support.freetrie.dir/freetrie.cpp.obj
[489/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_uc.dir/stdc_bit_ceil_uc.cpp.obj
[490/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_ull.dir/stdc_bit_ceil_ull.cpp.obj
[491/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_ui.dir/stdc_bit_ceil_ui.cpp.obj
[492/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.iscanonicalf.dir/iscanonicalf.cpp.obj
[493/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_uc.dir/stdc_first_leading_zero_uc.cpp.obj
[494/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_us.dir/stdc_first_leading_zero_us.cpp.obj
[495/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_ul.dir/stdc_bit_ceil_ul.cpp.obj
[496/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_ul.dir/stdc_first_leading_zero_ul.cpp.obj
[497/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.iscanonical.dir/iscanonical.cpp.obj
[498/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_bit_ceil_us.dir/stdc_bit_ceil_us.cpp.obj
[499/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_ui.dir/stdc_first_leading_zero_ui.cpp.obj
[500/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_us.dir/stdc_first_leading_one_us.cpp.obj
[501/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_ui.dir/stdc_first_leading_one_ui.cpp.obj
[502/2522] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.iscanonicall.dir/iscanonicall.cpp.obj
[503/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_trailing_zero_ui.dir/stdc_first_trailing_zero_ui.cpp.obj
[504/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_ull.dir/stdc_first_leading_one_ull.cpp.obj
[505/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_uc.dir/stdc_first_leading_one_uc.cpp.obj
[506/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_zero_ull.dir/stdc_first_leading_zero_ull.cpp.obj
[507/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_trailing_zero_us.dir/stdc_first_trailing_zero_us.cpp.obj
[508/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_trailing_zero_uc.dir/stdc_first_trailing_zero_uc.cpp.obj
[509/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_leading_one_ul.dir/stdc_first_leading_one_ul.cpp.obj
[510/2522] Building CXX object libc/src/__support/CMakeFiles/libc.src.__support.freelist_heap.dir/freelist_heap.cpp.obj
[511/2522] Building CXX object libc/src/stdbit/CMakeFiles/libc.src.stdbit.stdc_first_trailing_one_us.dir/stdc_first_trailing_one_us.cpp.obj

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jul 27, 2025
…parts. (#148817)

This patch adds a new ExtractLane VPInstruction which extracts across
multiple parts using a wide index, to be used in combination with
FirstActiveLane.

The patch updates early-exit codegen to use it instead ExtractElement,
which is only per-part. With this change, interleaving should work
correctly with early-exit loops.

The patch removes the restrictions added in 6f43754 (#145877), but
does not yet automatically select interleave counts > 1 for early-exit
loops.

I'll share a patch as follow-up. The cost of extracting a lane adds
non-trivial overhead in the exit block, so that should be considered
when picking the interleave count.

PR: llvm/llvm-project#148817
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Jul 28, 2025
…m#148817)

This patch adds a new ExtractLane VPInstruction which extracts across
multiple parts using a wide index, to be used in combination with
FirstActiveLane.

The patch updates early-exit codegen to use it instead ExtractElement,
which is only per-part. With this change, interleaving should work
correctly with early-exit loops.

The patch removes the restrictions added in 6f43754 (llvm#145877), but
does not yet automatically select interleave counts > 1 for early-exit
loops.

I'll share a patch as follow-up. The cost of extracting a lane adds
non-trivial overhead in the exit block, so that should be considered
when picking the interleave count.

PR: llvm#148817
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants