Skip to content

Conversation

@sjoerdmeijer
Copy link
Collaborator

@sjoerdmeijer sjoerdmeijer commented May 12, 2025

Skip the Pre-RA MachineScheduler for large hand-written vector intrinsic codes when targetting the Neoverse V2. The motivation to skip the scheduler is the same as this abandoned patch: #127784. To quickly recap, we would like to disable the pre-RA machine scheduler for the Neoverse V2 because we have a key workload that massively benefits from this (25% uplift). Despite the machine scheduler being register pressure aware, it results in spills/reloads for this workload in apparently the wrong places.

But this reimplementation is much more focused and fine-grained and based on the following heuristic:

  • only skip the pre-ra machine scheduler for large (hand-written) vector intrinsic code,
  • do this only for the Neoverse V2 (a wide micro-architecture).

The intuition of this patch is that:

  • scheduling based on instruction latency isn't useful for a very wide micro-architecture (which is why GCC also partly stopped doing this),
  • however, the machine scheduler also performs some optimisations: i) load/store clusttering, and ii) copy elimination. These are useful optimisations, and that's why disabling the machine scheduler in general isn't a good idea, i.e. this results in some regressions.
  • but the function where the machine scheduler and register allocator are not working well together is a large, hand-written vector code. Thus, one could argue that scheduling this kind of code is against the programmer's intent, so let's not do that, which avoids complications later down in the optimisation pipeline.

The heuristic is trying to recognise large hand-written intrinsic code by calculating a percentage of vector code and other instructions in a function and skips the machine scheduler if certain treshold values are exceeded. I.e., if a function is more than 70% vector code, contains more than 2800 IR instructions and 425 intrinsics, don't schedule this function.

This obviously is a heuristic, but is hopefully narrow enough to not cause regressions (I haven't found any). The alternative is to look into regalloc, which is where the problems occur with the placement of spill/reload code. However, there will be heuristics involved there too, and so this seems like a valid heuristic and looking into regalloc is an orthogonal exercise.

… vector intrinsic codes

Skip the Pre-RA MachineScheduler for large hand-written vector intrinsic
codes when targetting the Neoverse V2. The motivation to skip the
scheduler is the same as this abandoned patch:

llvm#127784

But this reimplementation is much more focused and fine-grained and
based on the following heuristic:
- only skip the pre-ra machine scheduler for large (hand-written) vector
  intrinsic code,
- do this only for the Neoverse V2 (a wide micro-architecture).

The intuition of this patch is that:
- scheduling based on instruction latency isn't useful for a very wide
  micro-architecture (which is why GCC also partly stopped doing this),
- however, the machine scheduler also performs some optimisations:
  i) load/store clusttering, and ii) copy elimination. These are useful
  optimisations, and that's why disabling the machine scheduler in
  general isn't a good idea, i.e. this results in some regressions.
- but the function where the machine scheduler and register allocator
  are not working well together is a large, hand-written vector code.
  Thus, one could argue that scheduling this kind of code is against the
  programmer's intent, so let's not do that, which avoids complications
  later down in the optimisation pipeline.

The heuristic is trying to recognise large hand-written intrinsic code
by calculating a percentage of vector code and other instructions in a
function and skips the machine scheduler if certain treshold values are
exceeded. I.e., if a function is more than 70% vector code, contains
more than 2800 IR instructions and 425 intrinsics, don't schedule this
function.

This obviously is a heuristic, but is hopefully narrow enough to not
cause regressions (I haven't found any). The alternative is to look
into regalloc, which is where the problems occur with the placement
of spill/reload code. However, there will be heuristics involved there
too, and so this seems like a valid heuristic and looking into
regalloc is an orthogonal exercise.
@llvmbot llvmbot added backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding labels May 12, 2025
@llvmbot
Copy link
Member

llvmbot commented May 12, 2025

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-aarch64

Author: Sjoerd Meijer (sjoerdmeijer)

Changes

Skip the Pre-RA MachineScheduler for large hand-written vector intrinsic codes when targetting the Neoverse V2. The motivation to skip the scheduler is the same as this abandoned patch: #127784

But this reimplementation is much more focused and fine-grained and based on the following heuristic:

  • only skip the pre-ra machine scheduler for large (hand-written) vector intrinsic code,
  • do this only for the Neoverse V2 (a wide micro-architecture).

The intuition of this patch is that:

  • scheduling based on instruction latency isn't useful for a very wide micro-architecture (which is why GCC also partly stopped doing this),
  • however, the machine scheduler also performs some optimisations: i) load/store clusttering, and ii) copy elimination. These are useful optimisations, and that's why disabling the machine scheduler in general isn't a good idea, i.e. this results in some regressions.
  • but the function where the machine scheduler and register allocator are not working well together is a large, hand-written vector code. Thus, one could argue that scheduling this kind of code is against the programmer's intent, so let's not do that, which avoids complications later down in the optimisation pipeline.

The heuristic is trying to recognise large hand-written intrinsic code by calculating a percentage of vector code and other instructions in a function and skips the machine scheduler if certain treshold values are exceeded. I.e., if a function is more than 70% vector code, contains more than 2800 IR instructions and 425 intrinsics, don't schedule this function.

This obviously is a heuristic, but is hopefully narrow enough to not cause regressions (I haven't found any). The alternative is to look into regalloc, which is where the problems occur with the placement of spill/reload code. However, there will be heuristics involved there too, and so this seems like a valid heuristic and looking into regalloc is an orthogonal exercise.


Full diff: https://github.com/llvm/llvm-project/pull/139557.diff

9 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+4)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/include/llvm/CodeGen/TargetSubtargetInfo.h (+4)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/CodeGen/MachineScheduler.cpp (+60)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+2)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+7)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+4)
  • (added) llvm/test/CodeGen/AArch64/skip-misched-large-vec-func.ll (+94)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index f4f66447d1c3d..42a1025e10024 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1019,6 +1019,10 @@ class TargetTransformInfo {
   /// Enable matching of interleaved access groups.
   bool enableInterleavedAccessVectorization() const;
 
+  /// Disable the machine scheduler for a large function with a lot of
+  /// (hand-written) vector code and intrinsics.
+  bool skipPreRASchedLargeVecFunc() const;
+
   /// Enable matching of interleaved access groups that contain predicated
   /// accesses or gaps and therefore vectorized using masked
   /// vector loads/stores.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 02d6435e61b4d..8d8f02338a3b0 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -499,6 +499,8 @@ class TargetTransformInfoImplBase {
 
   virtual bool enableInterleavedAccessVectorization() const { return false; }
 
+  virtual bool skipPreRASchedLargeVecFunc() const { return false; }
+
   virtual bool enableMaskedInterleavedAccessVectorization() const {
     return false;
   }
diff --git a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
index 1230349956973..ab901f969f948 100644
--- a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
@@ -184,6 +184,10 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
     return false;
   }
 
+  virtual bool enableSkipPreRASchedLargeVecFunc() const {
+    return false;
+  }
+
   /// True if the subtarget should run MachineScheduler after aggressive
   /// coalescing.
   ///
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 3ced70e113bf7..1422cfcdcb762 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -677,6 +677,10 @@ bool TargetTransformInfo::enableInterleavedAccessVectorization() const {
   return TTIImpl->enableInterleavedAccessVectorization();
 }
 
+bool TargetTransformInfo::skipPreRASchedLargeVecFunc() const {
+  return TTIImpl->skipPreRASchedLargeVecFunc();
+}
+
 bool TargetTransformInfo::enableMaskedInterleavedAccessVectorization() const {
   return TTIImpl->enableMaskedInterleavedAccessVectorization();
 }
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..83dc71c880cb8 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -21,6 +21,7 @@
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator_range.h"
 #include "llvm/Analysis/AliasAnalysis.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/CodeGen/LiveInterval.h"
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
@@ -49,6 +50,7 @@
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
 #include "llvm/CodeGenTypes/MachineValueType.h"
 #include "llvm/Config/llvm-config.h"
+#include "llvm/IR/IntrinsicInst.h"
 #include "llvm/InitializePasses.h"
 #include "llvm/MC/LaneBitmask.h"
 #include "llvm/Pass.h"
@@ -110,6 +112,21 @@ cl::opt<bool> VerifyScheduling(
     "verify-misched", cl::Hidden,
     cl::desc("Verify machine instrs before and after machine scheduling"));
 
+// Heuristics for skipping pre-RA machine scheduling for large functions,
+// containing (handwritten) intrinsic vector-code.
+cl::opt<unsigned> LargeFunctionThreshold(
+    "misched-large-func-threshold", cl::Hidden, cl::init(2800),
+    cl::desc("The minimum number of IR instructions in a large (hand-written) "
+             "intrinsic vector code function"));
+cl::opt<unsigned> NbOfIntrinsicsThreshold(
+    "misched-intrinsics-threshold", cl::Hidden, cl::init(425),
+    cl::desc("The minimum number of intrinsic instructions in a large "
+             "(hand-written) intrinsic vector code function"));
+cl::opt<unsigned> VectorCodeDensityPercentageThreshold(
+    "misched-vector-density-threshold", cl::Hidden, cl::init(70),
+    cl::desc("Minimum percentage of vector instructions compared to scalar in "
+             "a large (hand-written) intrinsic vector code function"));
+
 #ifndef NDEBUG
 cl::opt<bool> ViewMISchedDAGs(
     "view-misched-dags", cl::Hidden,
@@ -319,6 +336,7 @@ INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
 INITIALIZE_PASS_DEPENDENCY(MachineLoopInfoWrapperPass)
 INITIALIZE_PASS_DEPENDENCY(SlotIndexesWrapperPass)
 INITIALIZE_PASS_DEPENDENCY(LiveIntervalsWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
 INITIALIZE_PASS_END(MachineSchedulerLegacy, DEBUG_TYPE,
                     "Machine Instruction Scheduler", false, false)
 
@@ -336,6 +354,7 @@ void MachineSchedulerLegacy::getAnalysisUsage(AnalysisUsage &AU) const {
   AU.addPreserved<SlotIndexesWrapperPass>();
   AU.addRequired<LiveIntervalsWrapperPass>();
   AU.addPreserved<LiveIntervalsWrapperPass>();
+  AU.addRequired<TargetTransformInfoWrapperPass>();
   MachineFunctionPass::getAnalysisUsage(AU);
 }
 
@@ -557,6 +576,47 @@ bool MachineSchedulerLegacy::runOnMachineFunction(MachineFunction &MF) {
     return false;
   }
 
+  // Try to recognise large hand-written instrinc vector code, and skip the
+  // machine scheduler for this function if the target and TTI hook are okay
+  // with this.
+  const TargetSubtargetInfo &STI = MF.getSubtarget();
+  const MCSchedModel &SchedModel = STI.getSchedModel();
+  auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(MF.getFunction());
+
+  if (TTI.skipPreRASchedLargeVecFunc()) {
+    uint64_t InstructionCount = 0;
+    uint64_t IntrinsicCount = 0;
+    uint64_t VectorTypeCount = 0;
+    for (auto &BB : MF.getFunction()) {
+      for (Instruction &I : BB) {
+       InstructionCount++;
+       if (isa<IntrinsicInst>(I))
+         IntrinsicCount++;
+       Type *T = I.getType();
+       if (T && T->isVectorTy())
+         VectorTypeCount++;
+      }
+    }
+
+    unsigned VecDensity = (VectorTypeCount / (double) InstructionCount) * 100;
+
+    LLVM_DEBUG(dbgs() << "Instruction count: " << InstructionCount << ", ";
+               dbgs() << "threshold: " << LargeFunctionThreshold << "\n";
+               dbgs() << "Intrinsic count: " << IntrinsicCount << ", ";
+               dbgs() << "threshold: " << NbOfIntrinsicsThreshold << "\n";
+               dbgs() << "Vector density: " << VecDensity << ", ";
+               dbgs() << "threshold: " << VectorCodeDensityPercentageThreshold
+                      << "\n";);
+
+    if (InstructionCount > LargeFunctionThreshold &&
+        IntrinsicCount > NbOfIntrinsicsThreshold &&
+        VecDensity > VectorCodeDensityPercentageThreshold) {
+      LLVM_DEBUG(
+          dbgs() << "Skipping MISched for very vector and intrinsic heavy code");
+      return false;
+    }
+  }
+
   LLVM_DEBUG(dbgs() << "Before MISched:\n"; MF.print(dbgs()));
 
   auto &MLI = getAnalysis<MachineLoopInfoWrapperPass>().getLI();
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 7b4ded6322098..7b1e26ba1fad2 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -268,6 +268,8 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
     MaxBytesForLoopAlignment = 16;
     break;
   case NeoverseV2:
+    SkipPreRASchedLargeVecFunc = true;
+    LLVM_FALLTHROUGH;
   case NeoverseV3:
     EpilogueVectorizationMinVF = 8;
     MaxInterleaveFactor = 4;
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index f5ffc72cae537..5e1801e821e1b 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -71,6 +71,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   unsigned MaxBytesForLoopAlignment = 0;
   unsigned MinimumJumpTableEntries = 4;
   unsigned MaxJumpTableSize = 0;
+  bool SkipPreRASchedLargeVecFunc = false;
 
   // ReserveXRegister[i] - X#i is not available as a general purpose register.
   BitVector ReserveXRegister;
@@ -160,6 +161,12 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
   bool enableSubRegLiveness() const override { return EnableSubregLiveness; }
 
+  /// Returns true if the subtarget should consider skipping the pre-RA
+  /// machine scheduler for large (hand-written) instrinsic vector functions.
+  bool enableSkipPreRASchedLargeVecFunc() const override {
+    return SkipPreRASchedLargeVecFunc;
+  }
+
   bool enableMachinePipeliner() const override;
   bool useDFAforSMS() const override { return false; }
 
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index be6bca2225eac..8d26ec2b6149f 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -118,6 +118,10 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
 
   bool enableInterleavedAccessVectorization() const override { return true; }
 
+  bool skipPreRASchedLargeVecFunc() const override {
+    return ST->enableSkipPreRASchedLargeVecFunc();
+  }
+
   bool enableMaskedInterleavedAccessVectorization() const override {
     return ST->hasSVE();
   }
diff --git a/llvm/test/CodeGen/AArch64/skip-misched-large-vec-func.ll b/llvm/test/CodeGen/AArch64/skip-misched-large-vec-func.ll
new file mode 100644
index 0000000000000..93e9051ade118
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/skip-misched-large-vec-func.ll
@@ -0,0 +1,94 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v1 | FileCheck %s --check-prefix=SCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 | FileCheck %s --check-prefix=SCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -misched-large-func-threshold=30 -misched-intrinsics-threshold=2 -misched-vector-density-threshold=31 | FileCheck %s --check-prefix=NOSCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -misched-large-func-threshold=31 -misched-intrinsics-threshold=2 -misched-vector-density-threshold=31 | FileCheck %s --check-prefix=SCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -misched-large-func-threshold=30 -misched-intrinsics-threshold=3 -misched-vector-density-threshold=31 | FileCheck %s --check-prefix=SCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -misched-large-func-threshold=30 -misched-intrinsics-threshold=2 -misched-vector-density-threshold=32 | FileCheck %s --check-prefix=SCHED
+
+define void @test_fma_loop(ptr %ptr_a, ptr %ptr_b, ptr %ptr_c, ptr %ptr_out, i32 %n) {
+; SCHED-LABEL: test_fma_loop:
+; SCHED:       // %bb.0: // %entry
+; SCHED-NEXT:    cbz w4, .LBB0_2
+; SCHED-NEXT:    .p2align 5, , 16
+; SCHED-NEXT:  .LBB0_1: // %loop
+; SCHED-NEXT:    // =>This Inner Loop Header: Depth=1
+; SCHED-NEXT:    ldr q0, [x0], #16
+; SCHED-NEXT:    ldp q1, q2, [x1]
+; SCHED-NEXT:    subs w4, w4, #1
+; SCHED-NEXT:    ldp q3, q4, [x2]
+; SCHED-NEXT:    fmla v3.4s, v1.4s, v0.4s
+; SCHED-NEXT:    ldr q0, [x1, #32]
+; SCHED-NEXT:    ldr q1, [x2, #32]
+; SCHED-NEXT:    add x1, x1, #48
+; SCHED-NEXT:    add x2, x2, #48
+; SCHED-NEXT:    fmla v4.4s, v2.4s, v3.4s
+; SCHED-NEXT:    fmla v1.4s, v0.4s, v4.4s
+; SCHED-NEXT:    str q1, [x3], #16
+; SCHED-NEXT:    b.ne .LBB0_1
+; SCHED-NEXT:  .LBB0_2: // %exit
+; SCHED-NEXT:    ret
+;
+; NOSCHED-LABEL: test_fma_loop:
+; NOSCHED:       // %bb.0: // %entry
+; NOSCHED-NEXT:    cbz w4, .LBB0_2
+; NOSCHED-NEXT:    .p2align 5, , 16
+; NOSCHED-NEXT:  .LBB0_1: // %loop
+; NOSCHED-NEXT:    // =>This Inner Loop Header: Depth=1
+; NOSCHED-NEXT:    ldr q0, [x0], #16
+; NOSCHED-NEXT:    ldr q1, [x1]
+; NOSCHED-NEXT:    ldr q2, [x2]
+; NOSCHED-NEXT:    subs w4, w4, #1
+; NOSCHED-NEXT:    fmla v2.4s, v1.4s, v0.4s
+; NOSCHED-NEXT:    ldp q0, q3, [x1, #16]
+; NOSCHED-NEXT:    ldp q1, q4, [x2, #16]
+; NOSCHED-NEXT:    add x1, x1, #48
+; NOSCHED-NEXT:    add x2, x2, #48
+; NOSCHED-NEXT:    fmla v1.4s, v0.4s, v2.4s
+; NOSCHED-NEXT:    fmla v4.4s, v3.4s, v1.4s
+; NOSCHED-NEXT:    str q4, [x3], #16
+; NOSCHED-NEXT:    b.ne .LBB0_1
+; NOSCHED-NEXT:  .LBB0_2: // %exit
+; NOSCHED-NEXT:    ret
+entry:
+  %cmp = icmp eq i32 %n, 0
+  br i1 %cmp, label %exit, label %loop
+
+loop:
+  %iv = phi i32 [ %n, %entry ], [ %iv.next, %loop ]
+  %ptr_a.addr = phi ptr [ %ptr_a, %entry ], [ %ptr_a.next, %loop ]
+  %ptr_b.addr = phi ptr [ %ptr_b, %entry ], [ %ptr_b.next, %loop ]
+  %ptr_c.addr = phi ptr [ %ptr_c, %entry ], [ %ptr_c.next, %loop ]
+  %ptr_out.addr = phi ptr [ %ptr_out, %entry ], [ %ptr_out.next, %loop ]
+
+  %a = load <4 x float>, ptr %ptr_a.addr
+  %b1 = load <4 x float>, ptr %ptr_b.addr
+  %c1 = load <4 x float>, ptr %ptr_c.addr
+  %res1 = call <4 x float> @llvm.fma.v4f32(<4 x float> %a, <4 x float> %b1, <4 x float> %c1)
+
+  %ptr_b2 = getelementptr <4 x float>, ptr %ptr_b.addr, i64 1
+  %ptr_c2 = getelementptr <4 x float>, ptr %ptr_c.addr, i64 1
+  %b2 = load <4 x float>, ptr %ptr_b2
+  %c2 = load <4 x float>, ptr %ptr_c2
+  %ptr_b3 = getelementptr <4 x float>, ptr %ptr_b.addr, i64 2
+  %ptr_c3 = getelementptr <4 x float>, ptr %ptr_c.addr, i64 2
+  %b3 = load <4 x float>, ptr %ptr_b3
+  %c3 = load <4 x float>, ptr %ptr_c3
+
+  %res2 = call <4 x float> @llvm.fma.v4f32(<4 x float> %res1, <4 x float> %b2, <4 x float> %c2)
+  %res3 = call <4 x float> @llvm.fma.v4f32(<4 x float> %res2, <4 x float> %b3, <4 x float> %c3)
+
+  store <4 x float> %res3, ptr %ptr_out.addr
+
+  %ptr_a.next = getelementptr <4 x float>, ptr %ptr_a.addr, i64 1
+  %ptr_b.next = getelementptr <4 x float>, ptr %ptr_b.addr, i64 3
+  %ptr_c.next = getelementptr <4 x float>, ptr %ptr_c.addr, i64 3
+  %ptr_out.next = getelementptr <4 x float>, ptr %ptr_out.addr, i64 1
+
+  %iv.next = sub i32 %iv, 1
+  %cmp.next = icmp ne i32 %iv.next, 0
+  br i1 %cmp.next, label %loop, label %exit
+
+exit:
+  ret void
+}

@github-actions
Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- llvm/include/llvm/Analysis/TargetTransformInfo.h llvm/include/llvm/Analysis/TargetTransformInfoImpl.h llvm/include/llvm/CodeGen/TargetSubtargetInfo.h llvm/lib/Analysis/TargetTransformInfo.cpp llvm/lib/CodeGen/MachineScheduler.cpp llvm/lib/Target/AArch64/AArch64Subtarget.cpp llvm/lib/Target/AArch64/AArch64Subtarget.h llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
View the diff from clang-format here.
diff --git a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
index ab901f969..91530027f 100644
--- a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
@@ -184,9 +184,7 @@ public:
     return false;
   }
 
-  virtual bool enableSkipPreRASchedLargeVecFunc() const {
-    return false;
-  }
+  virtual bool enableSkipPreRASchedLargeVecFunc() const { return false; }
 
   /// True if the subtarget should run MachineScheduler after aggressive
   /// coalescing.
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index ee1a058aa..901e4be9f 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -680,7 +680,8 @@ bool MachineSchedulerLegacy::runOnMachineFunction(MachineFunction &MF) {
   // with this.
   const TargetSubtargetInfo &STI = MF.getSubtarget();
   const MCSchedModel &SchedModel = STI.getSchedModel();
-  auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(MF.getFunction());
+  auto &TTI =
+      getAnalysis<TargetTransformInfoWrapperPass>().getTTI(MF.getFunction());
 
   if (TTI.skipPreRASchedLargeVecFunc()) {
     uint64_t InstructionCount = 0;
@@ -688,16 +689,16 @@ bool MachineSchedulerLegacy::runOnMachineFunction(MachineFunction &MF) {
     uint64_t VectorTypeCount = 0;
     for (auto &BB : MF.getFunction()) {
       for (Instruction &I : BB) {
-       InstructionCount++;
-       if (isa<IntrinsicInst>(I))
-         IntrinsicCount++;
-       Type *T = I.getType();
-       if (T && T->isVectorTy())
-         VectorTypeCount++;
+        InstructionCount++;
+        if (isa<IntrinsicInst>(I))
+          IntrinsicCount++;
+        Type *T = I.getType();
+        if (T && T->isVectorTy())
+          VectorTypeCount++;
       }
     }
 
-    unsigned VecDensity = (VectorTypeCount / (double) InstructionCount) * 100;
+    unsigned VecDensity = (VectorTypeCount / (double)InstructionCount) * 100;
 
     LLVM_DEBUG(dbgs() << "Instruction count: " << InstructionCount << ", ";
                dbgs() << "threshold: " << LargeFunctionThreshold << "\n";
@@ -711,7 +712,8 @@ bool MachineSchedulerLegacy::runOnMachineFunction(MachineFunction &MF) {
         IntrinsicCount > NbOfIntrinsicsThreshold &&
         VecDensity > VectorCodeDensityPercentageThreshold) {
       LLVM_DEBUG(
-          dbgs() << "Skipping MISched for very vector and intrinsic heavy code");
+          dbgs()
+          << "Skipping MISched for very vector and intrinsic heavy code");
       return false;
     }
   }

Comment on lines +586 to +599
if (TTI.skipPreRASchedLargeVecFunc()) {
uint64_t InstructionCount = 0;
uint64_t IntrinsicCount = 0;
uint64_t VectorTypeCount = 0;
for (auto &BB : MF.getFunction()) {
for (Instruction &I : BB) {
InstructionCount++;
if (isa<IntrinsicInst>(I))
IntrinsicCount++;
Type *T = I.getType();
if (T && T->isVectorTy())
VectorTypeCount++;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really should not be writing an IR based heuristic in a machine pass. This is baking in a lot of assumptions about the architecture and how the IR will be lowered. You have better information from the current machine instructions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you what you mean, but I intentionally iterated over the IR to extract high level information that is not available on MIR, i.e. the vector intrinsics are lowered (FMAs) on MIR and are no longer recognisable.
I can calculate the heuristic on the MIR too, but then I will have to change it and drop the number intrinsics from the heuristic calculation, which then becomes "recognising a large and very vector code dense function". Which is slightly less specific, so if that is acceptable, that's easy to implement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes more sense to me, this IR processing is extremely vague as written

if (isa<IntrinsicInst>(I))
IntrinsicCount++;
Type *T = I.getType();
if (T && T->isVectorTy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't be null

for (auto &BB : MF.getFunction()) {
for (Instruction &I : BB) {
InstructionCount++;
if (isa<IntrinsicInst>(I))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to skip debug intrinsics too

Comment on lines +586 to +599
if (TTI.skipPreRASchedLargeVecFunc()) {
uint64_t InstructionCount = 0;
uint64_t IntrinsicCount = 0;
uint64_t VectorTypeCount = 0;
for (auto &BB : MF.getFunction()) {
for (Instruction &I : BB) {
InstructionCount++;
if (isa<IntrinsicInst>(I))
IntrinsicCount++;
Type *T = I.getType();
if (T && T->isVectorTy())
VectorTypeCount++;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes more sense to me, this IR processing is extremely vague as written

sjoerdmeijer added a commit to sjoerdmeijer/llvm-project that referenced this pull request May 21, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to
not schedule hand-written vector code. This solution is preferred
because it avoids another heuristic and achieves what we want, and for
what is worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
@sjoerdmeijer
Copy link
Collaborator Author

Hi @arsenm, thanks a lot for your review! I am going to abandon this work, because we've found a less intrusive way of fixing the issue that we're trying to fix, i.e. the fix in #140897 sets a subtarget feature FeatureDisableLatencySchedHeuristic that does the trick.

sjoerdmeijer added a commit that referenced this pull request Jun 6, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (#127784), and #139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
tomtor pushed a commit to tomtor/llvm-project that referenced this pull request Jun 14, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:codegen

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants