[MachineScheduler][AArch64] Skip Neoverse V2 Pre-RA MISched for large vector intrinsic codes #139557

sjoerdmeijer · 2025-05-12T14:59:07Z

Skip the Pre-RA MachineScheduler for large hand-written vector intrinsic codes when targetting the Neoverse V2. The motivation to skip the scheduler is the same as this abandoned patch: #127784. To quickly recap, we would like to disable the pre-RA machine scheduler for the Neoverse V2 because we have a key workload that massively benefits from this (25% uplift). Despite the machine scheduler being register pressure aware, it results in spills/reloads for this workload in apparently the wrong places.

But this reimplementation is much more focused and fine-grained and based on the following heuristic:

only skip the pre-ra machine scheduler for large (hand-written) vector intrinsic code,
do this only for the Neoverse V2 (a wide micro-architecture).

The intuition of this patch is that:

scheduling based on instruction latency isn't useful for a very wide micro-architecture (which is why GCC also partly stopped doing this),
however, the machine scheduler also performs some optimisations: i) load/store clusttering, and ii) copy elimination. These are useful optimisations, and that's why disabling the machine scheduler in general isn't a good idea, i.e. this results in some regressions.
but the function where the machine scheduler and register allocator are not working well together is a large, hand-written vector code. Thus, one could argue that scheduling this kind of code is against the programmer's intent, so let's not do that, which avoids complications later down in the optimisation pipeline.

The heuristic is trying to recognise large hand-written intrinsic code by calculating a percentage of vector code and other instructions in a function and skips the machine scheduler if certain treshold values are exceeded. I.e., if a function is more than 70% vector code, contains more than 2800 IR instructions and 425 intrinsics, don't schedule this function.

This obviously is a heuristic, but is hopefully narrow enough to not cause regressions (I haven't found any). The alternative is to look into regalloc, which is where the problems occur with the placement of spill/reload code. However, there will be heuristics involved there too, and so this seems like a valid heuristic and looking into regalloc is an orthogonal exercise.

… vector intrinsic codes Skip the Pre-RA MachineScheduler for large hand-written vector intrinsic codes when targetting the Neoverse V2. The motivation to skip the scheduler is the same as this abandoned patch: llvm#127784 But this reimplementation is much more focused and fine-grained and based on the following heuristic: - only skip the pre-ra machine scheduler for large (hand-written) vector intrinsic code, - do this only for the Neoverse V2 (a wide micro-architecture). The intuition of this patch is that: - scheduling based on instruction latency isn't useful for a very wide micro-architecture (which is why GCC also partly stopped doing this), - however, the machine scheduler also performs some optimisations: i) load/store clusttering, and ii) copy elimination. These are useful optimisations, and that's why disabling the machine scheduler in general isn't a good idea, i.e. this results in some regressions. - but the function where the machine scheduler and register allocator are not working well together is a large, hand-written vector code. Thus, one could argue that scheduling this kind of code is against the programmer's intent, so let's not do that, which avoids complications later down in the optimisation pipeline. The heuristic is trying to recognise large hand-written intrinsic code by calculating a percentage of vector code and other instructions in a function and skips the machine scheduler if certain treshold values are exceeded. I.e., if a function is more than 70% vector code, contains more than 2800 IR instructions and 425 intrinsics, don't schedule this function. This obviously is a heuristic, but is hopefully narrow enough to not cause regressions (I haven't found any). The alternative is to look into regalloc, which is where the problems occur with the placement of spill/reload code. However, there will be heuristics involved there too, and so this seems like a valid heuristic and looking into regalloc is an orthogonal exercise.

llvmbot · 2025-05-12T14:59:39Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-aarch64

Author: Sjoerd Meijer (sjoerdmeijer)

Changes

Skip the Pre-RA MachineScheduler for large hand-written vector intrinsic codes when targetting the Neoverse V2. The motivation to skip the scheduler is the same as this abandoned patch: #127784

But this reimplementation is much more focused and fine-grained and based on the following heuristic:

only skip the pre-ra machine scheduler for large (hand-written) vector intrinsic code,
do this only for the Neoverse V2 (a wide micro-architecture).

The intuition of this patch is that:

scheduling based on instruction latency isn't useful for a very wide micro-architecture (which is why GCC also partly stopped doing this),
however, the machine scheduler also performs some optimisations: i) load/store clusttering, and ii) copy elimination. These are useful optimisations, and that's why disabling the machine scheduler in general isn't a good idea, i.e. this results in some regressions.
but the function where the machine scheduler and register allocator are not working well together is a large, hand-written vector code. Thus, one could argue that scheduling this kind of code is against the programmer's intent, so let's not do that, which avoids complications later down in the optimisation pipeline.

The heuristic is trying to recognise large hand-written intrinsic code by calculating a percentage of vector code and other instructions in a function and skips the machine scheduler if certain treshold values are exceeded. I.e., if a function is more than 70% vector code, contains more than 2800 IR instructions and 425 intrinsics, don't schedule this function.

This obviously is a heuristic, but is hopefully narrow enough to not cause regressions (I haven't found any). The alternative is to look into regalloc, which is where the problems occur with the placement of spill/reload code. However, there will be heuristics involved there too, and so this seems like a valid heuristic and looking into regalloc is an orthogonal exercise.

Full diff: https://github.com/llvm/llvm-project/pull/139557.diff

9 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+4)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/include/llvm/CodeGen/TargetSubtargetInfo.h (+4)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/CodeGen/MachineScheduler.cpp (+60)
(modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+2)
(modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+7)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+4)
(added) llvm/test/CodeGen/AArch64/skip-misched-large-vec-func.ll (+94)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index f4f66447d1c3d..42a1025e10024 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1019,6 +1019,10 @@ class TargetTransformInfo {
   /// Enable matching of interleaved access groups.
   bool enableInterleavedAccessVectorization() const;
 
+  /// Disable the machine scheduler for a large function with a lot of
+  /// (hand-written) vector code and intrinsics.
+  bool skipPreRASchedLargeVecFunc() const;
+
   /// Enable matching of interleaved access groups that contain predicated
   /// accesses or gaps and therefore vectorized using masked
   /// vector loads/stores.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 02d6435e61b4d..8d8f02338a3b0 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -499,6 +499,8 @@ class TargetTransformInfoImplBase {
 
   virtual bool enableInterleavedAccessVectorization() const { return false; }
 
+  virtual bool skipPreRASchedLargeVecFunc() const { return false; }
+
   virtual bool enableMaskedInterleavedAccessVectorization() const {
     return false;
   }
diff --git a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
index 1230349956973..ab901f969f948 100644
--- a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
@@ -184,6 +184,10 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
     return false;
   }
 
+  virtual bool enableSkipPreRASchedLargeVecFunc() const {
+    return false;
+  }
+
   /// True if the subtarget should run MachineScheduler after aggressive
   /// coalescing.
   ///
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 3ced70e113bf7..1422cfcdcb762 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -677,6 +677,10 @@ bool TargetTransformInfo::enableInterleavedAccessVectorization() const {
   return TTIImpl->enableInterleavedAccessVectorization();
 }
 
+bool TargetTransformInfo::skipPreRASchedLargeVecFunc() const {
+  return TTIImpl->skipPreRASchedLargeVecFunc();
+}
+
 bool TargetTransformInfo::enableMaskedInterleavedAccessVectorization() const {
   return TTIImpl->enableMaskedInterleavedAccessVectorization();
 }
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..83dc71c880cb8 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -21,6 +21,7 @@
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator_range.h"
 #include "llvm/Analysis/AliasAnalysis.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/CodeGen/LiveInterval.h"
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
@@ -49,6 +50,7 @@
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
 #include "llvm/CodeGenTypes/MachineValueType.h"
 #include "llvm/Config/llvm-config.h"
+#include "llvm/IR/IntrinsicInst.h"
 #include "llvm/InitializePasses.h"
 #include "llvm/MC/LaneBitmask.h"
 #include "llvm/Pass.h"
@@ -110,6 +112,21 @@ cl::opt<bool> VerifyScheduling(
     "verify-misched", cl::Hidden,
     cl::desc("Verify machine instrs before and after machine scheduling"));
 
+// Heuristics for skipping pre-RA machine scheduling for large functions,
+// containing (handwritten) intrinsic vector-code.
+cl::opt<unsigned> LargeFunctionThreshold(
+    "misched-large-func-threshold", cl::Hidden, cl::init(2800),
+    cl::desc("The minimum number of IR instructions in a large (hand-written) "
+             "intrinsic vector code function"));
+cl::opt<unsigned> NbOfIntrinsicsThreshold(
+    "misched-intrinsics-threshold", cl::Hidden, cl::init(425),
+    cl::desc("The minimum number of intrinsic instructions in a large "
+             "(hand-written) intrinsic vector code function"));
+cl::opt<unsigned> VectorCodeDensityPercentageThreshold(
+    "misched-vector-density-threshold", cl::Hidden, cl::init(70),
+    cl::desc("Minimum percentage of vector instructions compared to scalar in "
+             "a large (hand-written) intrinsic vector code function"));
+
 #ifndef NDEBUG
 cl::opt<bool> ViewMISchedDAGs(
     "view-misched-dags", cl::Hidden,
@@ -319,6 +336,7 @@ INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
 INITIALIZE_PASS_DEPENDENCY(MachineLoopInfoWrapperPass)
 INITIALIZE_PASS_DEPENDENCY(SlotIndexesWrapperPass)
 INITIALIZE_PASS_DEPENDENCY(LiveIntervalsWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
 INITIALIZE_PASS_END(MachineSchedulerLegacy, DEBUG_TYPE,
                     "Machine Instruction Scheduler", false, false)
 
@@ -336,6 +354,7 @@ void MachineSchedulerLegacy::getAnalysisUsage(AnalysisUsage &AU) const {
   AU.addPreserved<SlotIndexesWrapperPass>();
   AU.addRequired<LiveIntervalsWrapperPass>();
   AU.addPreserved<LiveIntervalsWrapperPass>();
+  AU.addRequired<TargetTransformInfoWrapperPass>();
   MachineFunctionPass::getAnalysisUsage(AU);
 }
 
@@ -557,6 +576,47 @@ bool MachineSchedulerLegacy::runOnMachineFunction(MachineFunction &MF) {
     return false;
   }
 
+  // Try to recognise large hand-written instrinc vector code, and skip the
+  // machine scheduler for this function if the target and TTI hook are okay
+  // with this.
+  const TargetSubtargetInfo &STI = MF.getSubtarget();
+  const MCSchedModel &SchedModel = STI.getSchedModel();
+  auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(MF.getFunction());
+
+  if (TTI.skipPreRASchedLargeVecFunc()) {
+    uint64_t InstructionCount = 0;
+    uint64_t IntrinsicCount = 0;
+    uint64_t VectorTypeCount = 0;
+    for (auto &BB : MF.getFunction()) {
+      for (Instruction &I : BB) {
+       InstructionCount++;
+       if (isa<IntrinsicInst>(I))
+         IntrinsicCount++;
+       Type *T = I.getType();
+       if (T && T->isVectorTy())
+         VectorTypeCount++;
+      }
+    }
+
+    unsigned VecDensity = (VectorTypeCount / (double) InstructionCount) * 100;
+
+    LLVM_DEBUG(dbgs() << "Instruction count: " << InstructionCount << ", ";
+               dbgs() << "threshold: " << LargeFunctionThreshold << "\n";
+               dbgs() << "Intrinsic count: " << IntrinsicCount << ", ";
+               dbgs() << "threshold: " << NbOfIntrinsicsThreshold << "\n";
+               dbgs() << "Vector density: " << VecDensity << ", ";
+               dbgs() << "threshold: " << VectorCodeDensityPercentageThreshold
+                      << "\n";);
+
+    if (InstructionCount > LargeFunctionThreshold &&
+        IntrinsicCount > NbOfIntrinsicsThreshold &&
+        VecDensity > VectorCodeDensityPercentageThreshold) {
+      LLVM_DEBUG(
+          dbgs() << "Skipping MISched for very vector and intrinsic heavy code");
+      return false;
+    }
+  }
+
   LLVM_DEBUG(dbgs() << "Before MISched:\n"; MF.print(dbgs()));
 
   auto &MLI = getAnalysis<MachineLoopInfoWrapperPass>().getLI();
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 7b4ded6322098..7b1e26ba1fad2 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -268,6 +268,8 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
     MaxBytesForLoopAlignment = 16;
     break;
   case NeoverseV2:
+    SkipPreRASchedLargeVecFunc = true;
+    LLVM_FALLTHROUGH;
   case NeoverseV3:
     EpilogueVectorizationMinVF = 8;
     MaxInterleaveFactor = 4;
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index f5ffc72cae537..5e1801e821e1b 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -71,6 +71,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   unsigned MaxBytesForLoopAlignment = 0;
   unsigned MinimumJumpTableEntries = 4;
   unsigned MaxJumpTableSize = 0;
+  bool SkipPreRASchedLargeVecFunc = false;
 
   // ReserveXRegister[i] - X#i is not available as a general purpose register.
   BitVector ReserveXRegister;
@@ -160,6 +161,12 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
   bool enableSubRegLiveness() const override { return EnableSubregLiveness; }
 
+  /// Returns true if the subtarget should consider skipping the pre-RA
+  /// machine scheduler for large (hand-written) instrinsic vector functions.
+  bool enableSkipPreRASchedLargeVecFunc() const override {
+    return SkipPreRASchedLargeVecFunc;
+  }
+
   bool enableMachinePipeliner() const override;
   bool useDFAforSMS() const override { return false; }
 
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index be6bca2225eac..8d26ec2b6149f 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -118,6 +118,10 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
 
   bool enableInterleavedAccessVectorization() const override { return true; }
 
+  bool skipPreRASchedLargeVecFunc() const override {
+    return ST->enableSkipPreRASchedLargeVecFunc();
+  }
+
   bool enableMaskedInterleavedAccessVectorization() const override {
     return ST->hasSVE();
   }
diff --git a/llvm/test/CodeGen/AArch64/skip-misched-large-vec-func.ll b/llvm/test/CodeGen/AArch64/skip-misched-large-vec-func.ll
new file mode 100644
index 0000000000000..93e9051ade118
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/skip-misched-large-vec-func.ll
@@ -0,0 +1,94 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v1 | FileCheck %s --check-prefix=SCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 | FileCheck %s --check-prefix=SCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -misched-large-func-threshold=30 -misched-intrinsics-threshold=2 -misched-vector-density-threshold=31 | FileCheck %s --check-prefix=NOSCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -misched-large-func-threshold=31 -misched-intrinsics-threshold=2 -misched-vector-density-threshold=31 | FileCheck %s --check-prefix=SCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -misched-large-func-threshold=30 -misched-intrinsics-threshold=3 -misched-vector-density-threshold=31 | FileCheck %s --check-prefix=SCHED
+; RUN: llc < %s -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -misched-large-func-threshold=30 -misched-intrinsics-threshold=2 -misched-vector-density-threshold=32 | FileCheck %s --check-prefix=SCHED
+
+define void @test_fma_loop(ptr %ptr_a, ptr %ptr_b, ptr %ptr_c, ptr %ptr_out, i32 %n) {
+; SCHED-LABEL: test_fma_loop:
+; SCHED:       // %bb.0: // %entry
+; SCHED-NEXT:    cbz w4, .LBB0_2
+; SCHED-NEXT:    .p2align 5, , 16
+; SCHED-NEXT:  .LBB0_1: // %loop
+; SCHED-NEXT:    // =>This Inner Loop Header: Depth=1
+; SCHED-NEXT:    ldr q0, [x0], #16
+; SCHED-NEXT:    ldp q1, q2, [x1]
+; SCHED-NEXT:    subs w4, w4, #1
+; SCHED-NEXT:    ldp q3, q4, [x2]
+; SCHED-NEXT:    fmla v3.4s, v1.4s, v0.4s
+; SCHED-NEXT:    ldr q0, [x1, #32]
+; SCHED-NEXT:    ldr q1, [x2, #32]
+; SCHED-NEXT:    add x1, x1, #48
+; SCHED-NEXT:    add x2, x2, #48
+; SCHED-NEXT:    fmla v4.4s, v2.4s, v3.4s
+; SCHED-NEXT:    fmla v1.4s, v0.4s, v4.4s
+; SCHED-NEXT:    str q1, [x3], #16
+; SCHED-NEXT:    b.ne .LBB0_1
+; SCHED-NEXT:  .LBB0_2: // %exit
+; SCHED-NEXT:    ret
+;
+; NOSCHED-LABEL: test_fma_loop:
+; NOSCHED:       // %bb.0: // %entry
+; NOSCHED-NEXT:    cbz w4, .LBB0_2
+; NOSCHED-NEXT:    .p2align 5, , 16
+; NOSCHED-NEXT:  .LBB0_1: // %loop
+; NOSCHED-NEXT:    // =>This Inner Loop Header: Depth=1
+; NOSCHED-NEXT:    ldr q0, [x0], #16
+; NOSCHED-NEXT:    ldr q1, [x1]
+; NOSCHED-NEXT:    ldr q2, [x2]
+; NOSCHED-NEXT:    subs w4, w4, #1
+; NOSCHED-NEXT:    fmla v2.4s, v1.4s, v0.4s
+; NOSCHED-NEXT:    ldp q0, q3, [x1, #16]
+; NOSCHED-NEXT:    ldp q1, q4, [x2, #16]
+; NOSCHED-NEXT:    add x1, x1, #48
+; NOSCHED-NEXT:    add x2, x2, #48
+; NOSCHED-NEXT:    fmla v1.4s, v0.4s, v2.4s
+; NOSCHED-NEXT:    fmla v4.4s, v3.4s, v1.4s
+; NOSCHED-NEXT:    str q4, [x3], #16
+; NOSCHED-NEXT:    b.ne .LBB0_1
+; NOSCHED-NEXT:  .LBB0_2: // %exit
+; NOSCHED-NEXT:    ret
+entry:
+  %cmp = icmp eq i32 %n, 0
+  br i1 %cmp, label %exit, label %loop
+
+loop:
+  %iv = phi i32 [ %n, %entry ], [ %iv.next, %loop ]
+  %ptr_a.addr = phi ptr [ %ptr_a, %entry ], [ %ptr_a.next, %loop ]
+  %ptr_b.addr = phi ptr [ %ptr_b, %entry ], [ %ptr_b.next, %loop ]
+  %ptr_c.addr = phi ptr [ %ptr_c, %entry ], [ %ptr_c.next, %loop ]
+  %ptr_out.addr = phi ptr [ %ptr_out, %entry ], [ %ptr_out.next, %loop ]
+
+  %a = load <4 x float>, ptr %ptr_a.addr
+  %b1 = load <4 x float>, ptr %ptr_b.addr
+  %c1 = load <4 x float>, ptr %ptr_c.addr
+  %res1 = call <4 x float> @llvm.fma.v4f32(<4 x float> %a, <4 x float> %b1, <4 x float> %c1)
+
+  %ptr_b2 = getelementptr <4 x float>, ptr %ptr_b.addr, i64 1
+  %ptr_c2 = getelementptr <4 x float>, ptr %ptr_c.addr, i64 1
+  %b2 = load <4 x float>, ptr %ptr_b2
+  %c2 = load <4 x float>, ptr %ptr_c2
+  %ptr_b3 = getelementptr <4 x float>, ptr %ptr_b.addr, i64 2
+  %ptr_c3 = getelementptr <4 x float>, ptr %ptr_c.addr, i64 2
+  %b3 = load <4 x float>, ptr %ptr_b3
+  %c3 = load <4 x float>, ptr %ptr_c3
+
+  %res2 = call <4 x float> @llvm.fma.v4f32(<4 x float> %res1, <4 x float> %b2, <4 x float> %c2)
+  %res3 = call <4 x float> @llvm.fma.v4f32(<4 x float> %res2, <4 x float> %b3, <4 x float> %c3)
+
+  store <4 x float> %res3, ptr %ptr_out.addr
+
+  %ptr_a.next = getelementptr <4 x float>, ptr %ptr_a.addr, i64 1
+  %ptr_b.next = getelementptr <4 x float>, ptr %ptr_b.addr, i64 3
+  %ptr_c.next = getelementptr <4 x float>, ptr %ptr_c.addr, i64 3
+  %ptr_out.next = getelementptr <4 x float>, ptr %ptr_out.addr, i64 1
+
+  %iv.next = sub i32 %iv, 1
+  %cmp.next = icmp ne i32 %iv.next, 0
+  br i1 %cmp.next, label %loop, label %exit
+
+exit:
+  ret void
+}

github-actions · 2025-05-12T15:01:33Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- llvm/include/llvm/Analysis/TargetTransformInfo.h llvm/include/llvm/Analysis/TargetTransformInfoImpl.h llvm/include/llvm/CodeGen/TargetSubtargetInfo.h llvm/lib/Analysis/TargetTransformInfo.cpp llvm/lib/CodeGen/MachineScheduler.cpp llvm/lib/Target/AArch64/AArch64Subtarget.cpp llvm/lib/Target/AArch64/AArch64Subtarget.h llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

View the diff from clang-format here.

diff --git a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
index ab901f969..91530027f 100644
--- a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
@@ -184,9 +184,7 @@ public:
     return false;
   }
 
-  virtual bool enableSkipPreRASchedLargeVecFunc() const {
-    return false;
-  }
+  virtual bool enableSkipPreRASchedLargeVecFunc() const { return false; }
 
   /// True if the subtarget should run MachineScheduler after aggressive
   /// coalescing.
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index ee1a058aa..901e4be9f 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -680,7 +680,8 @@ bool MachineSchedulerLegacy::runOnMachineFunction(MachineFunction &MF) {
   // with this.
   const TargetSubtargetInfo &STI = MF.getSubtarget();
   const MCSchedModel &SchedModel = STI.getSchedModel();
-  auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(MF.getFunction());
+  auto &TTI =
+      getAnalysis<TargetTransformInfoWrapperPass>().getTTI(MF.getFunction());
 
   if (TTI.skipPreRASchedLargeVecFunc()) {
     uint64_t InstructionCount = 0;
@@ -688,16 +689,16 @@ bool MachineSchedulerLegacy::runOnMachineFunction(MachineFunction &MF) {
     uint64_t VectorTypeCount = 0;
     for (auto &BB : MF.getFunction()) {
       for (Instruction &I : BB) {
-       InstructionCount++;
-       if (isa<IntrinsicInst>(I))
-         IntrinsicCount++;
-       Type *T = I.getType();
-       if (T && T->isVectorTy())
-         VectorTypeCount++;
+        InstructionCount++;
+        if (isa<IntrinsicInst>(I))
+          IntrinsicCount++;
+        Type *T = I.getType();
+        if (T && T->isVectorTy())
+          VectorTypeCount++;
       }
     }
 
-    unsigned VecDensity = (VectorTypeCount / (double) InstructionCount) * 100;
+    unsigned VecDensity = (VectorTypeCount / (double)InstructionCount) * 100;
 
     LLVM_DEBUG(dbgs() << "Instruction count: " << InstructionCount << ", ";
                dbgs() << "threshold: " << LargeFunctionThreshold << "\n";
@@ -711,7 +712,8 @@ bool MachineSchedulerLegacy::runOnMachineFunction(MachineFunction &MF) {
         IntrinsicCount > NbOfIntrinsicsThreshold &&
         VecDensity > VectorCodeDensityPercentageThreshold) {
       LLVM_DEBUG(
-          dbgs() << "Skipping MISched for very vector and intrinsic heavy code");
+          dbgs()
+          << "Skipping MISched for very vector and intrinsic heavy code");
       return false;
     }
   }

arsenm · 2025-05-13T07:45:35Z

llvm/lib/CodeGen/MachineScheduler.cpp

+  if (TTI.skipPreRASchedLargeVecFunc()) {
+    uint64_t InstructionCount = 0;
+    uint64_t IntrinsicCount = 0;
+    uint64_t VectorTypeCount = 0;
+    for (auto &BB : MF.getFunction()) {
+      for (Instruction &I : BB) {
+       InstructionCount++;
+       if (isa<IntrinsicInst>(I))
+         IntrinsicCount++;
+       Type *T = I.getType();
+       if (T && T->isVectorTy())
+         VectorTypeCount++;
+      }
+    }


Really should not be writing an IR based heuristic in a machine pass. This is baking in a lot of assumptions about the architecture and how the IR will be lowered. You have better information from the current machine instructions

I see you what you mean, but I intentionally iterated over the IR to extract high level information that is not available on MIR, i.e. the vector intrinsics are lowered (FMAs) on MIR and are no longer recognisable.
I can calculate the heuristic on the MIR too, but then I will have to change it and drop the number intrinsics from the heuristic calculation, which then becomes "recognising a large and very vector code dense function". Which is slightly less specific, so if that is acceptable, that's easy to implement.

That makes more sense to me, this IR processing is extremely vague as written

arsenm · 2025-05-13T09:12:24Z

llvm/lib/CodeGen/MachineScheduler.cpp

+       if (isa<IntrinsicInst>(I))
+         IntrinsicCount++;
+       Type *T = I.getType();
+       if (T && T->isVectorTy())


Can't be null

arsenm · 2025-05-13T09:12:34Z

llvm/lib/CodeGen/MachineScheduler.cpp

+    for (auto &BB : MF.getFunction()) {
+      for (Instruction &I : BB) {
+       InstructionCount++;
+       if (isa<IntrinsicInst>(I))


Need to skip debug intrinsics too

arsenm · 2025-05-13T09:13:23Z

llvm/lib/CodeGen/MachineScheduler.cpp

+  if (TTI.skipPreRASchedLargeVecFunc()) {
+    uint64_t InstructionCount = 0;
+    uint64_t IntrinsicCount = 0;
+    uint64_t VectorTypeCount = 0;
+    for (auto &BB : MF.getFunction()) {
+      for (Instruction &I : BB) {
+       InstructionCount++;
+       if (isa<IntrinsicInst>(I))
+         IntrinsicCount++;
+       Type *T = I.getType();
+       if (T && T->isVectorTy())
+         VectorTypeCount++;
+      }
+    }


That makes more sense to me, this IR processing is extremely vague as written

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.

sjoerdmeijer · 2025-05-21T13:25:55Z

Hi @arsenm, thanks a lot for your review! I am going to abandon this work, because we've found a less intrusive way of fixing the issue that we're trying to fix, i.e. the fix in #140897 sets a subtarget feature FeatureDisableLatencySchedHeuristic that does the trick.

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (#127784), and #139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.

sjoerdmeijer requested review from arsenm, c-rhodes, davemgreen, fpetrogalli, jayfoad, mcinally and nikic May 12, 2025 14:59

llvmbot added backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding labels May 12, 2025

arsenm added the llvm:codegen label May 13, 2025

arsenm reviewed May 13, 2025

View reviewed changes

sjoerdmeijer mentioned this pull request May 21, 2025

[AArch64] Neoverse V2 FeatureDisableLatencySchedHeuristic #140897

Merged

sjoerdmeijer closed this May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MachineScheduler][AArch64] Skip Neoverse V2 Pre-RA MISched for large vector intrinsic codes #139557

[MachineScheduler][AArch64] Skip Neoverse V2 Pre-RA MISched for large vector intrinsic codes #139557

Uh oh!

sjoerdmeijer commented May 12, 2025 •

edited

Loading

Uh oh!

llvmbot commented May 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

arsenm May 13, 2025

Uh oh!

sjoerdmeijer May 13, 2025

Uh oh!

arsenm May 13, 2025

Uh oh!

arsenm May 13, 2025

Uh oh!

arsenm May 13, 2025

Uh oh!

arsenm May 13, 2025

Uh oh!

sjoerdmeijer commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[MachineScheduler][AArch64] Skip Neoverse V2 Pre-RA MISched for large vector intrinsic codes #139557

[MachineScheduler][AArch64] Skip Neoverse V2 Pre-RA MISched for large vector intrinsic codes #139557

Uh oh!

Conversation

sjoerdmeijer commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

arsenm May 13, 2025

Choose a reason for hiding this comment

Uh oh!

sjoerdmeijer May 13, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm May 13, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm May 13, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm May 13, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm May 13, 2025

Choose a reason for hiding this comment

Uh oh!

sjoerdmeijer commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sjoerdmeijer commented May 12, 2025 •

edited

Loading

llvmbot commented May 12, 2025 •

edited

Loading