Skip to content

Conversation

simonwallis2
Copy link
Contributor

Recently the IssueWidth in the scheduling model was reduced for Neoverse-V2 and N2. This patch does the same for Neoverse-V1, N1 and N3.

On Neoverse-V1, various values of IssueWidth (15, 8, 7, 6, 5) were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 8. No significant regressions were noted.

On Neoverse-N1, various values of IssueWidth (8, 6, 5, 4, 3) were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 3. No significant regressions were noted.

On Neoverse-N3, it makes sense to do exactly the same as was done for N2. It is proposed to use an issue width of 5.

Related V2 PR: #142565
Related N2 PR: #145717

Recently the IssueWidth in the scheduling model was reduced for Neoverse-V2 and N2.
This patch does the same for Neoverse-V1, N1 and N3.

On Neoverse-V1, various values of IssueWidth (15, 8, 7, 6, 5)
were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 8.
No significant regressions were noted.

On Neoverse-N1, various values of IssueWidth (8, 6, 5, 4, 3)
were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 3.
No significant regressions were noted.

On Neoverse-N3, it makes sense to do exactly the same as was done for N2.
It is proposed to use an issue width of 5.

Related V2 PR: llvm#142565
Related N2 PR: llvm#145717

Change-Id: I7d17359e7a4ae7a527e321a208bdf96893fe50f9
@llvmbot
Copy link
Member

llvmbot commented Aug 20, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Simon Wallis (simonwallis2)

Changes

Recently the IssueWidth in the scheduling model was reduced for Neoverse-V2 and N2. This patch does the same for Neoverse-V1, N1 and N3.

On Neoverse-V1, various values of IssueWidth (15, 8, 7, 6, 5) were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 8. No significant regressions were noted.

On Neoverse-N1, various values of IssueWidth (8, 6, 5, 4, 3) were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 3. No significant regressions were noted.

On Neoverse-N3, it makes sense to do exactly the same as was done for N2. It is proposed to use an issue width of 5.

Related V2 PR: #142565
Related N2 PR: #145717


Patch is 964.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154495.diff

12 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td (+1-1)
  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td (+1-1)
  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td (+1-1)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s (+2214-2203)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-writeback.s (+1916-1906)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-basic-instructions.s (+1-1)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-clear-upper-regs.s (+16-16)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-forwarding.s (+123-123)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-misc-instructions.s (+15-15)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-sve-instructions.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-writeback.s (+1457-1456)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-zero-dependency.s (+1-1)
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td
index 524fa33f498bb..9da42322dd10d 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td
@@ -15,7 +15,7 @@
 //===----------------------------------------------------------------------===//
 
 def NeoverseN1Model : SchedMachineModel {
-  let IssueWidth            =   8; // Maximum micro-ops dispatch rate.
+  let IssueWidth            =   3; // Maximum micro-ops dispatch rate.
   let MicroOpBufferSize     = 128; // NOTE: Copied from Cortex-A76.
   let LoadLatency           =   4; // Optimistic load latency.
   let MispredictPenalty     =  11; // Cycles cost of branch mispredicted.
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
index e44d40f8d7020..cd0d8a9186d5b 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
@@ -11,7 +11,7 @@
 //===----------------------------------------------------------------------===//
 
 def NeoverseN3Model : SchedMachineModel {
-    let IssueWidth            =  10; // Micro-ops dispatched at a time.
+    let IssueWidth            =   5; // Micro-ops dispatched at a time.
     let MicroOpBufferSize     = 160; // Entries in micro-op re-order buffer. NOTE: Copied from N2.
     let LoadLatency           =   4; // Optimistic load latency.
     let MispredictPenalty     =  10; // Extra cycles for mispredicted branch. NOTE: Copied from N2.
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
index 44625a2034d9d..b78c5e90d6338 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
@@ -19,7 +19,7 @@
 //===----------------------------------------------------------------------===//
 
 def NeoverseV1Model : SchedMachineModel {
-  let IssueWidth            =  15; // Maximum micro-ops dispatch rate.
+  let IssueWidth            =   8; // Maximum micro-ops dispatch rate.
   let MicroOpBufferSize     = 256; // Micro-op re-order buffer.
   let LoadLatency           =   4; // Optimistic load latency.
   let MispredictPenalty     =  11; // Cycles cost of branch mispredicted.
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s
index 8fe21167a5bd3..127c8c30fc2c6 100644
--- a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s
@@ -1165,10 +1165,10 @@ add x0, x27, 1
 # CHECK-NEXT: Total Cycles:      507
 # CHECK-NEXT: Total uOps:        1500
 
-# CHECK:      Dispatch Width:    8
+# CHECK:      Dispatch Width:    3
 # CHECK-NEXT: uOps Per Cycle:    2.96
 # CHECK-NEXT: IPC:               1.97
-# CHECK-NEXT: Block RThroughput: 3.3
+# CHECK-NEXT: Block RThroughput: 5.0
 
 # CHECK:      Timeline view:
 # CHECK-NEXT:                     01
@@ -1176,14 +1176,14 @@ add x0, x27, 1
 
 # CHECK:      [0,0]     DeeeeeER  ..   ld1	{ v1.1d }, [x27], #8
 # CHECK-NEXT: [0,1]     D=eE---R  ..   add	x0, x27, #1
-# CHECK-NEXT: [0,2]     D=eeeeeER ..   ld1	{ v1.2d }, [x27], #16
-# CHECK-NEXT: [0,3]     D==eE---R ..   add	x0, x27, #1
-# CHECK-NEXT: [0,4]     D==eeeeeER..   ld1	{ v1.2s }, [x27], #8
-# CHECK-NEXT: [0,5]     .D==eE---R..   add	x0, x27, #1
-# CHECK-NEXT: [0,6]     .D==eeeeeER.   ld1	{ v1.4h }, [x27], #8
-# CHECK-NEXT: [0,7]     .D===eE---R.   add	x0, x27, #1
-# CHECK-NEXT: [0,8]     .D===eeeeeER   ld1	{ v1.4s }, [x27], #16
-# CHECK-NEXT: [0,9]     .D====eE---R   add	x0, x27, #1
+# CHECK-NEXT: [0,2]     .DeeeeeER ..   ld1	{ v1.2d }, [x27], #16
+# CHECK-NEXT: [0,3]     .D=eE---R ..   add	x0, x27, #1
+# CHECK-NEXT: [0,4]     . DeeeeeER..   ld1	{ v1.2s }, [x27], #8
+# CHECK-NEXT: [0,5]     . D=eE---R..   add	x0, x27, #1
+# CHECK-NEXT: [0,6]     .  DeeeeeER.   ld1	{ v1.4h }, [x27], #8
+# CHECK-NEXT: [0,7]     .  D=eE---R.   add	x0, x27, #1
+# CHECK-NEXT: [0,8]     .   DeeeeeER   ld1	{ v1.4s }, [x27], #16
+# CHECK-NEXT: [0,9]     .   D=eE---R   add	x0, x27, #1
 
 # CHECK:      Average Wait times (based on the timeline view):
 # CHECK-NEXT: [0]: Executions
@@ -1194,15 +1194,15 @@ add x0, x27, 1
 # CHECK:            [0]    [1]    [2]    [3]
 # CHECK-NEXT: 0.     1     1.0    1.0    0.0       ld1	{ v1.1d }, [x27], #8
 # CHECK-NEXT: 1.     1     2.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 2.     1     2.0    0.0    0.0       ld1	{ v1.2d }, [x27], #16
-# CHECK-NEXT: 3.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 4.     1     3.0    0.0    0.0       ld1	{ v1.2s }, [x27], #8
-# CHECK-NEXT: 5.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 6.     1     3.0    0.0    0.0       ld1	{ v1.4h }, [x27], #8
-# CHECK-NEXT: 7.     1     4.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 8.     1     4.0    0.0    0.0       ld1	{ v1.4s }, [x27], #16
-# CHECK-NEXT: 9.     1     5.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT:        1     3.0    0.1    1.5       <total>
+# CHECK-NEXT: 2.     1     1.0    0.0    0.0       ld1	{ v1.2d }, [x27], #16
+# CHECK-NEXT: 3.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 4.     1     1.0    0.0    0.0       ld1	{ v1.2s }, [x27], #8
+# CHECK-NEXT: 5.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 6.     1     1.0    0.0    0.0       ld1	{ v1.4h }, [x27], #8
+# CHECK-NEXT: 7.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 8.     1     1.0    0.0    0.0       ld1	{ v1.4s }, [x27], #16
+# CHECK-NEXT: 9.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT:        1     1.5    0.1    1.5       <total>
 
 # CHECK:      [1] Code Region - G02
 
@@ -1211,10 +1211,10 @@ add x0, x27, 1
 # CHECK-NEXT: Total Cycles:      507
 # CHECK-NEXT: Total uOps:        1500
 
-# CHECK:      Dispatch Width:    8
+# CHECK:      Dispatch Width:    3
 # CHECK-NEXT: uOps Per Cycle:    2.96
 # CHECK-NEXT: IPC:               1.97
-# CHECK-NEXT: Block RThroughput: 3.3
+# CHECK-NEXT: Block RThroughput: 5.0
 
 # CHECK:      Timeline view:
 # CHECK-NEXT:                     01
@@ -1222,14 +1222,14 @@ add x0, x27, 1
 
 # CHECK:      [0,0]     DeeeeeER  ..   ld1	{ v1.8b }, [x27], #8
 # CHECK-NEXT: [0,1]     D=eE---R  ..   add	x0, x27, #1
-# CHECK-NEXT: [0,2]     D=eeeeeER ..   ld1	{ v1.8h }, [x27], #16
-# CHECK-NEXT: [0,3]     D==eE---R ..   add	x0, x27, #1
-# CHECK-NEXT: [0,4]     D==eeeeeER..   ld1	{ v1.16b }, [x27], #16
-# CHECK-NEXT: [0,5]     .D==eE---R..   add	x0, x27, #1
-# CHECK-NEXT: [0,6]     .D==eeeeeER.   ld1	{ v1.1d }, [x27], x28
-# CHECK-NEXT: [0,7]     .D===eE---R.   add	x0, x27, #1
-# CHECK-NEXT: [0,8]     .D===eeeeeER   ld1	{ v1.2d }, [x27], x28
-# CHECK-NEXT: [0,9]     .D====eE---R   add	x0, x27, #1
+# CHECK-NEXT: [0,2]     .DeeeeeER ..   ld1	{ v1.8h }, [x27], #16
+# CHECK-NEXT: [0,3]     .D=eE---R ..   add	x0, x27, #1
+# CHECK-NEXT: [0,4]     . DeeeeeER..   ld1	{ v1.16b }, [x27], #16
+# CHECK-NEXT: [0,5]     . D=eE---R..   add	x0, x27, #1
+# CHECK-NEXT: [0,6]     .  DeeeeeER.   ld1	{ v1.1d }, [x27], x28
+# CHECK-NEXT: [0,7]     .  D=eE---R.   add	x0, x27, #1
+# CHECK-NEXT: [0,8]     .   DeeeeeER   ld1	{ v1.2d }, [x27], x28
+# CHECK-NEXT: [0,9]     .   D=eE---R   add	x0, x27, #1
 
 # CHECK:      Average Wait times (based on the timeline view):
 # CHECK-NEXT: [0]: Executions
@@ -1240,15 +1240,15 @@ add x0, x27, 1
 # CHECK:            [0]    [1]    [2]    [3]
 # CHECK-NEXT: 0.     1     1.0    1.0    0.0       ld1	{ v1.8b }, [x27], #8
 # CHECK-NEXT: 1.     1     2.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 2.     1     2.0    0.0    0.0       ld1	{ v1.8h }, [x27], #16
-# CHECK-NEXT: 3.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 4.     1     3.0    0.0    0.0       ld1	{ v1.16b }, [x27], #16
-# CHECK-NEXT: 5.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 6.     1     3.0    0.0    0.0       ld1	{ v1.1d }, [x27], x28
-# CHECK-NEXT: 7.     1     4.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 8.     1     4.0    0.0    0.0       ld1	{ v1.2d }, [x27], x28
-# CHECK-NEXT: 9.     1     5.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT:        1     3.0    0.1    1.5       <total>
+# CHECK-NEXT: 2.     1     1.0    0.0    0.0       ld1	{ v1.8h }, [x27], #16
+# CHECK-NEXT: 3.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 4.     1     1.0    0.0    0.0       ld1	{ v1.16b }, [x27], #16
+# CHECK-NEXT: 5.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 6.     1     1.0    0.0    0.0       ld1	{ v1.1d }, [x27], x28
+# CHECK-NEXT: 7.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 8.     1     1.0    0.0    0.0       ld1	{ v1.2d }, [x27], x28
+# CHECK-NEXT: 9.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT:        1     1.5    0.1    1.5       <total>
 
 # CHECK:      [2] Code Region - G03
 
@@ -1257,10 +1257,10 @@ add x0, x27, 1
 # CHECK-NEXT: Total Cycles:      507
 # CHECK-NEXT: Total uOps:        1500
 
-# CHECK:      Dispatch Width:    8
+# CHECK:      Dispatch Width:    3
 # CHECK-NEXT: uOps Per Cycle:    2.96
 # CHECK-NEXT: IPC:               1.97
-# CHECK-NEXT: Block RThroughput: 3.3
+# CHECK-NEXT: Block RThroughput: 5.0
 
 # CHECK:      Timeline view:
 # CHECK-NEXT:                     01
@@ -1268,14 +1268,14 @@ add x0, x27, 1
 
 # CHECK:      [0,0]     DeeeeeER  ..   ld1	{ v1.2s }, [x27], x28
 # CHECK-NEXT: [0,1]     D=eE---R  ..   add	x0, x27, #1
-# CHECK-NEXT: [0,2]     D=eeeeeER ..   ld1	{ v1.4h }, [x27], x28
-# CHECK-NEXT: [0,3]     D==eE---R ..   add	x0, x27, #1
-# CHECK-NEXT: [0,4]     D==eeeeeER..   ld1	{ v1.4s }, [x27], x28
-# CHECK-NEXT: [0,5]     .D==eE---R..   add	x0, x27, #1
-# CHECK-NEXT: [0,6]     .D==eeeeeER.   ld1	{ v1.8b }, [x27], x28
-# CHECK-NEXT: [0,7]     .D===eE---R.   add	x0, x27, #1
-# CHECK-NEXT: [0,8]     .D===eeeeeER   ld1	{ v1.8h }, [x27], x28
-# CHECK-NEXT: [0,9]     .D====eE---R   add	x0, x27, #1
+# CHECK-NEXT: [0,2]     .DeeeeeER ..   ld1	{ v1.4h }, [x27], x28
+# CHECK-NEXT: [0,3]     .D=eE---R ..   add	x0, x27, #1
+# CHECK-NEXT: [0,4]     . DeeeeeER..   ld1	{ v1.4s }, [x27], x28
+# CHECK-NEXT: [0,5]     . D=eE---R..   add	x0, x27, #1
+# CHECK-NEXT: [0,6]     .  DeeeeeER.   ld1	{ v1.8b }, [x27], x28
+# CHECK-NEXT: [0,7]     .  D=eE---R.   add	x0, x27, #1
+# CHECK-NEXT: [0,8]     .   DeeeeeER   ld1	{ v1.8h }, [x27], x28
+# CHECK-NEXT: [0,9]     .   D=eE---R   add	x0, x27, #1
 
 # CHECK:      Average Wait times (based on the timeline view):
 # CHECK-NEXT: [0]: Executions
@@ -1286,42 +1286,42 @@ add x0, x27, 1
 # CHECK:            [0]    [1]    [2]    [3]
 # CHECK-NEXT: 0.     1     1.0    1.0    0.0       ld1	{ v1.2s }, [x27], x28
 # CHECK-NEXT: 1.     1     2.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 2.     1     2.0    0.0    0.0       ld1	{ v1.4h }, [x27], x28
-# CHECK-NEXT: 3.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 4.     1     3.0    0.0    0.0       ld1	{ v1.4s }, [x27], x28
-# CHECK-NEXT: 5.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 6.     1     3.0    0.0    0.0       ld1	{ v1.8b }, [x27], x28
-# CHECK-NEXT: 7.     1     4.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 8.     1     4.0    0.0    0.0       ld1	{ v1.8h }, [x27], x28
-# CHECK-NEXT: 9.     1     5.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT:        1     3.0    0.1    1.5       <total>
+# CHECK-NEXT: 2.     1     1.0    0.0    0.0       ld1	{ v1.4h }, [x27], x28
+# CHECK-NEXT: 3.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 4.     1     1.0    0.0    0.0       ld1	{ v1.4s }, [x27], x28
+# CHECK-NEXT: 5.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 6.     1     1.0    0.0    0.0       ld1	{ v1.8b }, [x27], x28
+# CHECK-NEXT: 7.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 8.     1     1.0    0.0    0.0       ld1	{ v1.8h }, [x27], x28
+# CHECK-NEXT: 9.     1     2.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT:        1     1.5    0.1    1.5       <total>
 
 # CHECK:      [3] Code Region - G04
 
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      1000
-# CHECK-NEXT: Total Cycles:      507
+# CHECK-NEXT: Total Cycles:      906
 # CHECK-NEXT: Total uOps:        1900
 
-# CHECK:      Dispatch Width:    8
-# CHECK-NEXT: uOps Per Cycle:    3.75
-# CHECK-NEXT: IPC:               1.97
-# CHECK-NEXT: Block RThroughput: 4.5
+# CHECK:      Dispatch Width:    3
+# CHECK-NEXT: uOps Per Cycle:    2.10
+# CHECK-NEXT: IPC:               1.10
+# CHECK-NEXT: Block RThroughput: 6.3
 
 # CHECK:      Timeline view:
-# CHECK-NEXT:                     01
+# CHECK-NEXT:                     01234
 # CHECK-NEXT: Index     0123456789
 
-# CHECK:      [0,0]     DeeeeeER  ..   ld1	{ v1.16b }, [x27], x28
-# CHECK-NEXT: [0,1]     D=eE---R  ..   add	x0, x27, #1
-# CHECK-NEXT: [0,2]     D=eeeeeER ..   ld1	{ v1.1d, v2.1d }, [x27], #16
-# CHECK-NEXT: [0,3]     D==eE---R ..   add	x0, x27, #1
-# CHECK-NEXT: [0,4]     .D=eeeeeER..   ld1	{ v1.2d, v2.2d }, [x27], #32
-# CHECK-NEXT: [0,5]     .D==eE---R..   add	x0, x27, #1
-# CHECK-NEXT: [0,6]     .D==eeeeeER.   ld1	{ v1.2s, v2.2s }, [x27], #16
-# CHECK-NEXT: [0,7]     .D===eE---R.   add	x0, x27, #1
-# CHECK-NEXT: [0,8]     . D==eeeeeER   ld1	{ v1.4h, v2.4h }, [x27], #16
-# CHECK-NEXT: [0,9]     . D===eE---R   add	x0, x27, #1
+# CHECK:      [0,0]     DeeeeeER  .   .   ld1	{ v1.16b }, [x27], x28
+# CHECK-NEXT: [0,1]     D=eE---R  .   .   add	x0, x27, #1
+# CHECK-NEXT: [0,2]     .DeeeeeER .   .   ld1	{ v1.1d, v2.1d }, [x27], #16
+# CHECK-NEXT: [0,3]     . DeE---R .   .   add	x0, x27, #1
+# CHECK-NEXT: [0,4]     .  DeeeeeER   .   ld1	{ v1.2d, v2.2d }, [x27], #32
+# CHECK-NEXT: [0,5]     .   DeE---R   .   add	x0, x27, #1
+# CHECK-NEXT: [0,6]     .    DeeeeeER .   ld1	{ v1.2s, v2.2s }, [x27], #16
+# CHECK-NEXT: [0,7]     .    .DeE---R .   add	x0, x27, #1
+# CHECK-NEXT: [0,8]     .    . DeeeeeER   ld1	{ v1.4h, v2.4h }, [x27], #16
+# CHECK-NEXT: [0,9]     .    .  DeE---R   add	x0, x27, #1
 
 # CHECK:      Average Wait times (based on the timeline view):
 # CHECK-NEXT: [0]: Executions
@@ -1332,42 +1332,42 @@ add x0, x27, 1
 # CHECK:            [0]    [1]    [2]    [3]
 # CHECK-NEXT: 0.     1     1.0    1.0    0.0       ld1	{ v1.16b }, [x27], x28
 # CHECK-NEXT: 1.     1     2.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 2.     1     2.0    0.0    0.0       ld1	{ v1.1d, v2.1d }, [x27], #16
-# CHECK-NEXT: 3.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 4.     1     2.0    0.0    0.0       ld1	{ v1.2d, v2.2d }, [x27], #32
-# CHECK-NEXT: 5.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 6.     1     3.0    0.0    0.0       ld1	{ v1.2s, v2.2s }, [x27], #16
-# CHECK-NEXT: 7.     1     4.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 8.     1     3.0    0.0    0.0       ld1	{ v1.4h, v2.4h }, [x27], #16
-# CHECK-NEXT: 9.     1     4.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT:        1     2.7    0.1    1.5       <total>
+# CHECK-NEXT: 2.     1     1.0    0.0    0.0       ld1	{ v1.1d, v2.1d }, [x27], #16
+# CHECK-NEXT: 3.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 4.     1     1.0    1.0    0.0       ld1	{ v1.2d, v2.2d }, [x27], #32
+# CHECK-NEXT: 5.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 6.     1     1.0    1.0    0.0       ld1	{ v1.2s, v2.2s }, [x27], #16
+# CHECK-NEXT: 7.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 8.     1     1.0    1.0    0.0       ld1	{ v1.4h, v2.4h }, [x27], #16
+# CHECK-NEXT: 9.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT:        1     1.1    0.4    1.5       <total>
 
 # CHECK:      [4] Code Region - G05
 
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      1000
-# CHECK-NEXT: Total Cycles:      507
+# CHECK-NEXT: Total Cycles:      1006
 # CHECK-NEXT: Total uOps:        2000
 
-# CHECK:      Dispatch Width:    8
-# CHECK-NEXT: uOps Per Cycle:    3.94
-# CHECK-NEXT: IPC:               1.97
-# CHECK-NEXT: Block RThroughput: 5.0
+# CHECK:      Dispatch Width:    3
+# CHECK-NEXT: uOps Per Cycle:    1.99
+# CHECK-NEXT: IPC:               0.99
+# CHECK-NEXT: Block RThroughput: 6.7
 
 # CHECK:      Timeline view:
-# CHECK-NEXT:                     01
+# CHECK-NEXT:                     012345
 # CHECK-NEXT: Index     0123456789
 
-# CHECK:      [0,0]     DeeeeeER  ..   ld1	{ v1.4s, v2.4s }, [x27], #32
-# CHECK-NEXT: [0,1]     D=eE---R  ..   add	x0, x27, #1
-# CHECK-NEXT: [0,2]     D=eeeeeER ..   ld1	{ v1.8b, v2.8b }, [x27], #16
-# CHECK-NEXT: [0,3]     D==eE---R ..   add	x0, x27, #1
-# CHECK-NEXT: [0,4]     .D=eeeeeER..   ld1	{ v1.8h, v2.8h }, [x27], #32
-# CHECK-NEXT: [0,5]     .D==eE---R..   add	x0, x27, #1
-# CHECK-NEXT: [0,6]     .D==eeeeeER.   ld1	{ v1.16b, v2.16b }, [x27], #32
-# CHECK-NEXT: [0,7]     .D===eE---R.   add	x0, x27, #1
-# CHECK-NEXT: [0,8]     . D==eeeeeER   ld1	{ v1.1d, v2.1d }, [x27], x28
-# CHECK-NEXT: [0,9]     . D===eE---R   add	x0, x27, #1
+# CHECK:      [0,0]     DeeeeeER  .    .   ld1	{ v1.4s, v2.4s }, [x27], #32
+# CHECK-NEXT: [0,1]     .DeE---R  .    .   add	x0, x27, #1
+# CHECK-NEXT: [0,2]     . DeeeeeER.    .   ld1	{ v1.8b, v2.8b }, [x27], #16
+# CHECK-NEXT: [0,3]     .  DeE---R.    .   add	x0, x27, #1
+# CHECK-NEXT: [0,4]     .   DeeeeeER   .   ld1	{ v1.8h, v2.8h }, [x27], #32
+# CHECK-NEXT: [0,5]     .    DeE---R   .   add	x0, x27, #1
+# CHECK-NEXT: [0,6]     .    .DeeeeeER .   ld1	{ v1.16b, v2.16b }, [x27], #32
+# CHECK-NEXT: [0,7]     .    . DeE---R .   add	x0, x27, #1
+# CHECK-NEXT: [0,8]     .    .  DeeeeeER   ld1	{ v1.1d, v2.1d }, [x27], x28
+# CHECK-NEXT: [0,9]     .    .   DeE---R   add	x0, x27, #1
 
 # CHECK:      Average Wait times (based on the timeline view):
 # CHECK-NEXT: [0]: Executions
@@ -1377,43 +1377,43 @@ add x0, x27, 1
 
 # CHECK:            [0]    [1]    [2]    [3]
 # CHECK-NEXT: 0.     1     1.0    1.0    0.0       ld1	{ v1.4s, v2.4s }, [x27], #32
-# CHECK-NEXT: 1.     1     2.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 2.     1     2.0    0.0    0.0       ld1	{ v1.8b, v2.8b }, [x27], #16
-# CHECK-NEXT: 3.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 4.     1     2.0    0.0    0.0       ld1	{ v1.8h, v2.8h }, [x27], #32
-# CHECK-NEXT: 5.     1     3.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 6.     1     3.0    0.0    0.0       ld1	{ v1.16b, v2.16b }, [x27], #32
-# CHECK-NEXT: 7.     1     4.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT: 8.     1     3.0    0.0    0.0       ld1	{ v1.1d, v2.1d }, [x27], x28
-# CHECK-NEXT: 9.     1     4.0    0.0    3.0       add	x0, x27, #1
-# CHECK-NEXT:        1     2.7    0.1    1.5       <total>
+# CHECK-NEXT: 1.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 2.     1     1.0    1.0    0.0       ld1	{ v1.8b, v2.8b }, [x27], #16
+# CHECK-NEXT: 3.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 4.     1     1.0    1.0    0.0       ld1	{ v1.8h, v2.8h }, [x27], #32
+# CHECK-NEXT: 5.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 6.     1     1.0    1.0    0.0       ld1	{ v1.16b, v2.16b }, [x27], #32
+# CHECK-NEXT: 7.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT: 8.     1     1.0    1.0    0.0       ld1	{ v1.1d, v2.1d }, [x27], x28
+# CHECK-NEXT: 9.     1     1.0    0.0    3.0       add	x0, x27, #1
+# CHECK-NEXT:        1     1.0    0.5    1.5       <total>
 
 # CHECK:      [5] Code Region - G06
 
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      1000
-# CHECK-NEXT: Total Cycles:      507
+# CHECK-NEXT: Total Cycles:      1006
 # CHECK-NEXT: Total uOps:        2000
 
-# CHECK:      Dispatch Width:    8
-# CHECK-NEXT: uOps Per Cycle:    3.94
-# CHECK-NEXT: IPC:               1.97
-# CHECK-NEXT: Block RThroughput: 5.0
+# CHECK:      Dispatch Width:    3
+# CHECK-NEXT: uOps Per Cycle:    1.99
+# CHECK-NEXT: IPC:               0.99
+# CHECK-NEXT: Block RThroughput: 6.7
 
 # CHECK:      Timeline view:
-# CHECK-NEXT:                     01
+# CHECK-NEXT:                     012345
 # CHECK-NEXT: Index     0123456789
 
-# CHECK:      [0,0]     DeeeeeER  ..   ld1	{ v1.2d, v2.2d }, [x27], x28
-# CHECK-NEXT: [0,1]     D=eE---R  .. ...
[truncated]

@davemgreen
Copy link
Collaborator

This sounds sensible. What was the difference between 3 and 4 for neoverse-n1? A value of 4 would help match the SWOG (and #136374), and if there isn't much in it would make a better value I believe.

@simonwallis2
Copy link
Contributor Author

This sounds sensible. What was the difference between 3 and 4 for neoverse-n1? A value of 4 would help match the SWOG (and #136374), and if there isn't much in it would make a better value I believe.

For neoverse-n1, with a value of 4 I saw some small amount of noise but no overall improvement over the original value of 8.


def NeoverseN1Model : SchedMachineModel {
let IssueWidth = 8; // Maximum micro-ops dispatch rate.
let IssueWidth = 3; // Maximum micro-ops dispatch rate.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update these comments to match the v2 scheduler following this discussion: #142565 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done for AArch64SchedNeoverseN1.td and AArch64SchedNeoverseV1.td.

Recently the IssueWidth in the scheduling model was reduced for Neoverse-V2 and N2.
This patch does the same for Neoverse-V1, N1 and N3.

On Neoverse-V1, various values of IssueWidth (15, 8, 7, 6, 5)
were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 8.
No significant regressions were noted.

On Neoverse-N1, various values of IssueWidth (8, 6, 5, 4, 3)
were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 3.
No significant regressions were noted.

On Neoverse-N3, it makes sense to do exactly the same as was done for N2.
It is proposed to use an issue width of 5.

Related V2 PR: llvm#142565
Related N2 PR: llvm#145717

Change-Id: I7d1e71e0470407bfd49b535a1068ff52235e9792
@david-arm
Copy link
Contributor

Thanks for this! In the commit message it says The highest overall geomean score was achieved, but I think it might be helpful for anyone attempting to reproduce this to know which score or benchmarks were analysed.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with a IssueWidth of 4 for NeoverseN1, as I believe that matches reality better, but either way this LGTM.

Copy link
Collaborator

@c-rhodes c-rhodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also LGTM, cheers

@simonwallis2 simonwallis2 merged commit a044d61 into llvm:main Sep 17, 2025
9 checks passed
@vvereschaka
Copy link
Contributor

@simonwallis2 ,

https://lab.llvm.org/buildbot/#/builders/193/builds/10666

broken tests on the cross builder:

******************** TEST 'LLVM :: tools/llvm-mca/AArch64/Neoverse/V1-misc-instructions.s' FAILED ********************
Exit Code: 1
Command Output (stdout):
--
# RUN: at line 2
c:\buildbot\as-builder-2\x-aarch64\build\bin\llvm-mca.exe -mtriple=aarch64 -mcpu=neoverse-v1 -instruction-tables < C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\tools\llvm-mca\AArch64\Neoverse\V1-misc-instructions.s | c:\buildbot\as-builder-2\x-aarch64\build\bin\filecheck.exe C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\tools\llvm-mca\AArch64\Neoverse\V1-misc-instructions.s
# executed command: 'c:\buildbot\as-builder-2\x-aarch64\build\bin\llvm-mca.exe' -mtriple=aarch64 -mcpu=neoverse-v1 -instruction-tables
# executed command: 'c:\buildbot\as-builder-2\x-aarch64\build\bin\filecheck.exe' 'C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\tools\llvm-mca\AArch64\Neoverse\V1-misc-instructions.s'
# .---command stderr------------
# | C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\tools\llvm-mca\AArch64\Neoverse\V1-misc-instructions.s:33:15: error: CHECK-NEXT: expected string not found in input
# | # CHECK-NEXT: 1 1 0.12 U at s12e1r, x28
# |               ^
# | <stdin>:11:38: note: scanning from here
# | [1] [2] [3] [4] [5] [6] Instructions:
# |                                      ^
# | <stdin>:12:2: note: possible intended match here
# |  1 1 0.13 U at s12e1r, x28
# |  ^
# | 
# | Input file: <stdin>
# | Check file: C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\tools\llvm-mca\AArch64\Neoverse\V1-misc-instructions.s
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |            .
# |            .
# |            .
# |            6: [3]: RThroughput 
# |            7: [4]: MayLoad 
# |            8: [5]: MayStore 
# |            9: [6]: HasSideEffects (U) 
# |           10:  
# |           11: [1] [2] [3] [4] [5] [6] Instructions: 
# | next:33'0                                          X error: no match found
# |           12:  1 1 0.13 U at s12e1r, x28 
# | next:33'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~
# | next:33'1      ?                          possible intended match
# |           13:  1 1 0.13 U brk #0x8415 
# | next:33'0     ~~~~~~~~~~~~~~~~~~~~~~~~
# |           14:  1 1 0.13 * * U clrex 
# | next:33'0     ~~~~~~~~~~~~~~~~~~~~~~
# |           15:  1 1 0.13 * * U csdb 
# | next:33'0     ~~~~~~~~~~~~~~~~~~~~~
# |           16:  1 1 0.13 U dcps1 
# | next:33'0     ~~~~~~~~~~~~~~~~~~
# |           17:  1 1 0.13 U dcps2 
# | next:33'0     ~~~~~~~~~~~~~~~~~~
# |            .
# |            .
# |            .
# | >>>>>>
# `-----------------------------
# error: command failed with exit status: 1
--
********************

would you take care of it?

@simonwallis2
Copy link
Contributor Author

@vvereschaka Thanks for highlighting this.

All 3 of the failures reported by buildbot 193 in builds 10666 onwards are similar:
#CHECK-NEXT: 1 1 0.12 U 1 MRS mrs x3, ID_AA64ZRF)_EL1
Note: possible intended match here
#CHECK-NEXT: 1 1 0.13 U 1 MRS mrs x3, ID_AA64ZRF)_EL1

For builds 10665 and earlier, the IssueWidth for V1 was 15, so the reciprocal throughput 1/15 was 0.07 to 2 decimal places.
From build 10666, after PR154495 changed the IssueWidth to 8, the reciprocal throughput is 1/8 or 0.125.

The test case looks for 0.12. It has rounded 0.125 down. This is what I see on various linux hosts I tried.
The lllcm-clang-win-x-aarch64 build bot sees 0.13. It has rounded 0.125 up.

So we have inconsistent llvm-mca behaviour between this buildbot and other machines. I did not expect that.

As a short term hack to get this buildbot green again, it would be possible to modify these 3 test cases to expect rounding up when the host OS is windows, if that is the root of the differences.
But it would be better to make llvm-mca behave consistently.

@simonwallis2
Copy link
Contributor Author

In llvm/tools/llvm-mca, there are 10 instances of
format("%.2f", …)

9 of these instances explicitly round up, using
floor((X * 100) + 0.5) / 100)

This RT output is the 10th instance.
llvm-mca/InstructionInfoView.cpp should be modified to match the other 9, and the test cases updated to match the new rounded-up value.

simonwallis2 added a commit that referenced this pull request Sep 19, 2025
Explicitly round up the reciprocal calculation,
so that .125 is displayed as 0.13 consistently across all hosts.

Fix buildbot failure
https://lab.llvm.org/buildbot/#/builders/193/builds/10666
since #154495
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Sep 19, 2025
…(#159544)

Explicitly round up the reciprocal calculation,
so that .125 is displayed as 0.13 consistently across all hosts.

Fix buildbot failure
https://lab.llvm.org/buildbot/#/builders/193/builds/10666
since llvm/llvm-project#154495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants