-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[AArch64] Update IssueWidth for Neoverse V1, N1, N3 #154495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Recently the IssueWidth in the scheduling model was reduced for Neoverse-V2 and N2. This patch does the same for Neoverse-V1, N1 and N3. On Neoverse-V1, various values of IssueWidth (15, 8, 7, 6, 5) were tried with runs of various workloads. The highest overall geomean score was achieved with an issue width of 8. No significant regressions were noted. On Neoverse-N1, various values of IssueWidth (8, 6, 5, 4, 3) were tried with runs of various workloads. The highest overall geomean score was achieved with an issue width of 3. No significant regressions were noted. On Neoverse-N3, it makes sense to do exactly the same as was done for N2. It is proposed to use an issue width of 5. Related V2 PR: llvm#142565 Related N2 PR: llvm#145717 Change-Id: I7d17359e7a4ae7a527e321a208bdf96893fe50f9
@llvm/pr-subscribers-backend-aarch64 Author: Simon Wallis (simonwallis2) ChangesRecently the IssueWidth in the scheduling model was reduced for Neoverse-V2 and N2. This patch does the same for Neoverse-V1, N1 and N3. On Neoverse-V1, various values of IssueWidth (15, 8, 7, 6, 5) were tried with runs of various workloads. On Neoverse-N1, various values of IssueWidth (8, 6, 5, 4, 3) were tried with runs of various workloads. On Neoverse-N3, it makes sense to do exactly the same as was done for N2. It is proposed to use an issue width of 5. Related V2 PR: #142565 Patch is 964.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154495.diff 12 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td
index 524fa33f498bb..9da42322dd10d 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td
@@ -15,7 +15,7 @@
//===----------------------------------------------------------------------===//
def NeoverseN1Model : SchedMachineModel {
- let IssueWidth = 8; // Maximum micro-ops dispatch rate.
+ let IssueWidth = 3; // Maximum micro-ops dispatch rate.
let MicroOpBufferSize = 128; // NOTE: Copied from Cortex-A76.
let LoadLatency = 4; // Optimistic load latency.
let MispredictPenalty = 11; // Cycles cost of branch mispredicted.
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
index e44d40f8d7020..cd0d8a9186d5b 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
@@ -11,7 +11,7 @@
//===----------------------------------------------------------------------===//
def NeoverseN3Model : SchedMachineModel {
- let IssueWidth = 10; // Micro-ops dispatched at a time.
+ let IssueWidth = 5; // Micro-ops dispatched at a time.
let MicroOpBufferSize = 160; // Entries in micro-op re-order buffer. NOTE: Copied from N2.
let LoadLatency = 4; // Optimistic load latency.
let MispredictPenalty = 10; // Extra cycles for mispredicted branch. NOTE: Copied from N2.
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
index 44625a2034d9d..b78c5e90d6338 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
@@ -19,7 +19,7 @@
//===----------------------------------------------------------------------===//
def NeoverseV1Model : SchedMachineModel {
- let IssueWidth = 15; // Maximum micro-ops dispatch rate.
+ let IssueWidth = 8; // Maximum micro-ops dispatch rate.
let MicroOpBufferSize = 256; // Micro-op re-order buffer.
let LoadLatency = 4; // Optimistic load latency.
let MispredictPenalty = 11; // Cycles cost of branch mispredicted.
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s
index 8fe21167a5bd3..127c8c30fc2c6 100644
--- a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N1-writeback.s
@@ -1165,10 +1165,10 @@ add x0, x27, 1
# CHECK-NEXT: Total Cycles: 507
# CHECK-NEXT: Total uOps: 1500
-# CHECK: Dispatch Width: 8
+# CHECK: Dispatch Width: 3
# CHECK-NEXT: uOps Per Cycle: 2.96
# CHECK-NEXT: IPC: 1.97
-# CHECK-NEXT: Block RThroughput: 3.3
+# CHECK-NEXT: Block RThroughput: 5.0
# CHECK: Timeline view:
# CHECK-NEXT: 01
@@ -1176,14 +1176,14 @@ add x0, x27, 1
# CHECK: [0,0] DeeeeeER .. ld1 { v1.1d }, [x27], #8
# CHECK-NEXT: [0,1] D=eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,2] D=eeeeeER .. ld1 { v1.2d }, [x27], #16
-# CHECK-NEXT: [0,3] D==eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,4] D==eeeeeER.. ld1 { v1.2s }, [x27], #8
-# CHECK-NEXT: [0,5] .D==eE---R.. add x0, x27, #1
-# CHECK-NEXT: [0,6] .D==eeeeeER. ld1 { v1.4h }, [x27], #8
-# CHECK-NEXT: [0,7] .D===eE---R. add x0, x27, #1
-# CHECK-NEXT: [0,8] .D===eeeeeER ld1 { v1.4s }, [x27], #16
-# CHECK-NEXT: [0,9] .D====eE---R add x0, x27, #1
+# CHECK-NEXT: [0,2] .DeeeeeER .. ld1 { v1.2d }, [x27], #16
+# CHECK-NEXT: [0,3] .D=eE---R .. add x0, x27, #1
+# CHECK-NEXT: [0,4] . DeeeeeER.. ld1 { v1.2s }, [x27], #8
+# CHECK-NEXT: [0,5] . D=eE---R.. add x0, x27, #1
+# CHECK-NEXT: [0,6] . DeeeeeER. ld1 { v1.4h }, [x27], #8
+# CHECK-NEXT: [0,7] . D=eE---R. add x0, x27, #1
+# CHECK-NEXT: [0,8] . DeeeeeER ld1 { v1.4s }, [x27], #16
+# CHECK-NEXT: [0,9] . D=eE---R add x0, x27, #1
# CHECK: Average Wait times (based on the timeline view):
# CHECK-NEXT: [0]: Executions
@@ -1194,15 +1194,15 @@ add x0, x27, 1
# CHECK: [0] [1] [2] [3]
# CHECK-NEXT: 0. 1 1.0 1.0 0.0 ld1 { v1.1d }, [x27], #8
# CHECK-NEXT: 1. 1 2.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 2. 1 2.0 0.0 0.0 ld1 { v1.2d }, [x27], #16
-# CHECK-NEXT: 3. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 4. 1 3.0 0.0 0.0 ld1 { v1.2s }, [x27], #8
-# CHECK-NEXT: 5. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 6. 1 3.0 0.0 0.0 ld1 { v1.4h }, [x27], #8
-# CHECK-NEXT: 7. 1 4.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 8. 1 4.0 0.0 0.0 ld1 { v1.4s }, [x27], #16
-# CHECK-NEXT: 9. 1 5.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 1 3.0 0.1 1.5 <total>
+# CHECK-NEXT: 2. 1 1.0 0.0 0.0 ld1 { v1.2d }, [x27], #16
+# CHECK-NEXT: 3. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 4. 1 1.0 0.0 0.0 ld1 { v1.2s }, [x27], #8
+# CHECK-NEXT: 5. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 6. 1 1.0 0.0 0.0 ld1 { v1.4h }, [x27], #8
+# CHECK-NEXT: 7. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 8. 1 1.0 0.0 0.0 ld1 { v1.4s }, [x27], #16
+# CHECK-NEXT: 9. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 1 1.5 0.1 1.5 <total>
# CHECK: [1] Code Region - G02
@@ -1211,10 +1211,10 @@ add x0, x27, 1
# CHECK-NEXT: Total Cycles: 507
# CHECK-NEXT: Total uOps: 1500
-# CHECK: Dispatch Width: 8
+# CHECK: Dispatch Width: 3
# CHECK-NEXT: uOps Per Cycle: 2.96
# CHECK-NEXT: IPC: 1.97
-# CHECK-NEXT: Block RThroughput: 3.3
+# CHECK-NEXT: Block RThroughput: 5.0
# CHECK: Timeline view:
# CHECK-NEXT: 01
@@ -1222,14 +1222,14 @@ add x0, x27, 1
# CHECK: [0,0] DeeeeeER .. ld1 { v1.8b }, [x27], #8
# CHECK-NEXT: [0,1] D=eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,2] D=eeeeeER .. ld1 { v1.8h }, [x27], #16
-# CHECK-NEXT: [0,3] D==eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,4] D==eeeeeER.. ld1 { v1.16b }, [x27], #16
-# CHECK-NEXT: [0,5] .D==eE---R.. add x0, x27, #1
-# CHECK-NEXT: [0,6] .D==eeeeeER. ld1 { v1.1d }, [x27], x28
-# CHECK-NEXT: [0,7] .D===eE---R. add x0, x27, #1
-# CHECK-NEXT: [0,8] .D===eeeeeER ld1 { v1.2d }, [x27], x28
-# CHECK-NEXT: [0,9] .D====eE---R add x0, x27, #1
+# CHECK-NEXT: [0,2] .DeeeeeER .. ld1 { v1.8h }, [x27], #16
+# CHECK-NEXT: [0,3] .D=eE---R .. add x0, x27, #1
+# CHECK-NEXT: [0,4] . DeeeeeER.. ld1 { v1.16b }, [x27], #16
+# CHECK-NEXT: [0,5] . D=eE---R.. add x0, x27, #1
+# CHECK-NEXT: [0,6] . DeeeeeER. ld1 { v1.1d }, [x27], x28
+# CHECK-NEXT: [0,7] . D=eE---R. add x0, x27, #1
+# CHECK-NEXT: [0,8] . DeeeeeER ld1 { v1.2d }, [x27], x28
+# CHECK-NEXT: [0,9] . D=eE---R add x0, x27, #1
# CHECK: Average Wait times (based on the timeline view):
# CHECK-NEXT: [0]: Executions
@@ -1240,15 +1240,15 @@ add x0, x27, 1
# CHECK: [0] [1] [2] [3]
# CHECK-NEXT: 0. 1 1.0 1.0 0.0 ld1 { v1.8b }, [x27], #8
# CHECK-NEXT: 1. 1 2.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 2. 1 2.0 0.0 0.0 ld1 { v1.8h }, [x27], #16
-# CHECK-NEXT: 3. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 4. 1 3.0 0.0 0.0 ld1 { v1.16b }, [x27], #16
-# CHECK-NEXT: 5. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 6. 1 3.0 0.0 0.0 ld1 { v1.1d }, [x27], x28
-# CHECK-NEXT: 7. 1 4.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 8. 1 4.0 0.0 0.0 ld1 { v1.2d }, [x27], x28
-# CHECK-NEXT: 9. 1 5.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 1 3.0 0.1 1.5 <total>
+# CHECK-NEXT: 2. 1 1.0 0.0 0.0 ld1 { v1.8h }, [x27], #16
+# CHECK-NEXT: 3. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 4. 1 1.0 0.0 0.0 ld1 { v1.16b }, [x27], #16
+# CHECK-NEXT: 5. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 6. 1 1.0 0.0 0.0 ld1 { v1.1d }, [x27], x28
+# CHECK-NEXT: 7. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 8. 1 1.0 0.0 0.0 ld1 { v1.2d }, [x27], x28
+# CHECK-NEXT: 9. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 1 1.5 0.1 1.5 <total>
# CHECK: [2] Code Region - G03
@@ -1257,10 +1257,10 @@ add x0, x27, 1
# CHECK-NEXT: Total Cycles: 507
# CHECK-NEXT: Total uOps: 1500
-# CHECK: Dispatch Width: 8
+# CHECK: Dispatch Width: 3
# CHECK-NEXT: uOps Per Cycle: 2.96
# CHECK-NEXT: IPC: 1.97
-# CHECK-NEXT: Block RThroughput: 3.3
+# CHECK-NEXT: Block RThroughput: 5.0
# CHECK: Timeline view:
# CHECK-NEXT: 01
@@ -1268,14 +1268,14 @@ add x0, x27, 1
# CHECK: [0,0] DeeeeeER .. ld1 { v1.2s }, [x27], x28
# CHECK-NEXT: [0,1] D=eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,2] D=eeeeeER .. ld1 { v1.4h }, [x27], x28
-# CHECK-NEXT: [0,3] D==eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,4] D==eeeeeER.. ld1 { v1.4s }, [x27], x28
-# CHECK-NEXT: [0,5] .D==eE---R.. add x0, x27, #1
-# CHECK-NEXT: [0,6] .D==eeeeeER. ld1 { v1.8b }, [x27], x28
-# CHECK-NEXT: [0,7] .D===eE---R. add x0, x27, #1
-# CHECK-NEXT: [0,8] .D===eeeeeER ld1 { v1.8h }, [x27], x28
-# CHECK-NEXT: [0,9] .D====eE---R add x0, x27, #1
+# CHECK-NEXT: [0,2] .DeeeeeER .. ld1 { v1.4h }, [x27], x28
+# CHECK-NEXT: [0,3] .D=eE---R .. add x0, x27, #1
+# CHECK-NEXT: [0,4] . DeeeeeER.. ld1 { v1.4s }, [x27], x28
+# CHECK-NEXT: [0,5] . D=eE---R.. add x0, x27, #1
+# CHECK-NEXT: [0,6] . DeeeeeER. ld1 { v1.8b }, [x27], x28
+# CHECK-NEXT: [0,7] . D=eE---R. add x0, x27, #1
+# CHECK-NEXT: [0,8] . DeeeeeER ld1 { v1.8h }, [x27], x28
+# CHECK-NEXT: [0,9] . D=eE---R add x0, x27, #1
# CHECK: Average Wait times (based on the timeline view):
# CHECK-NEXT: [0]: Executions
@@ -1286,42 +1286,42 @@ add x0, x27, 1
# CHECK: [0] [1] [2] [3]
# CHECK-NEXT: 0. 1 1.0 1.0 0.0 ld1 { v1.2s }, [x27], x28
# CHECK-NEXT: 1. 1 2.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 2. 1 2.0 0.0 0.0 ld1 { v1.4h }, [x27], x28
-# CHECK-NEXT: 3. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 4. 1 3.0 0.0 0.0 ld1 { v1.4s }, [x27], x28
-# CHECK-NEXT: 5. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 6. 1 3.0 0.0 0.0 ld1 { v1.8b }, [x27], x28
-# CHECK-NEXT: 7. 1 4.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 8. 1 4.0 0.0 0.0 ld1 { v1.8h }, [x27], x28
-# CHECK-NEXT: 9. 1 5.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 1 3.0 0.1 1.5 <total>
+# CHECK-NEXT: 2. 1 1.0 0.0 0.0 ld1 { v1.4h }, [x27], x28
+# CHECK-NEXT: 3. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 4. 1 1.0 0.0 0.0 ld1 { v1.4s }, [x27], x28
+# CHECK-NEXT: 5. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 6. 1 1.0 0.0 0.0 ld1 { v1.8b }, [x27], x28
+# CHECK-NEXT: 7. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 8. 1 1.0 0.0 0.0 ld1 { v1.8h }, [x27], x28
+# CHECK-NEXT: 9. 1 2.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 1 1.5 0.1 1.5 <total>
# CHECK: [3] Code Region - G04
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 1000
-# CHECK-NEXT: Total Cycles: 507
+# CHECK-NEXT: Total Cycles: 906
# CHECK-NEXT: Total uOps: 1900
-# CHECK: Dispatch Width: 8
-# CHECK-NEXT: uOps Per Cycle: 3.75
-# CHECK-NEXT: IPC: 1.97
-# CHECK-NEXT: Block RThroughput: 4.5
+# CHECK: Dispatch Width: 3
+# CHECK-NEXT: uOps Per Cycle: 2.10
+# CHECK-NEXT: IPC: 1.10
+# CHECK-NEXT: Block RThroughput: 6.3
# CHECK: Timeline view:
-# CHECK-NEXT: 01
+# CHECK-NEXT: 01234
# CHECK-NEXT: Index 0123456789
-# CHECK: [0,0] DeeeeeER .. ld1 { v1.16b }, [x27], x28
-# CHECK-NEXT: [0,1] D=eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,2] D=eeeeeER .. ld1 { v1.1d, v2.1d }, [x27], #16
-# CHECK-NEXT: [0,3] D==eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,4] .D=eeeeeER.. ld1 { v1.2d, v2.2d }, [x27], #32
-# CHECK-NEXT: [0,5] .D==eE---R.. add x0, x27, #1
-# CHECK-NEXT: [0,6] .D==eeeeeER. ld1 { v1.2s, v2.2s }, [x27], #16
-# CHECK-NEXT: [0,7] .D===eE---R. add x0, x27, #1
-# CHECK-NEXT: [0,8] . D==eeeeeER ld1 { v1.4h, v2.4h }, [x27], #16
-# CHECK-NEXT: [0,9] . D===eE---R add x0, x27, #1
+# CHECK: [0,0] DeeeeeER . . ld1 { v1.16b }, [x27], x28
+# CHECK-NEXT: [0,1] D=eE---R . . add x0, x27, #1
+# CHECK-NEXT: [0,2] .DeeeeeER . . ld1 { v1.1d, v2.1d }, [x27], #16
+# CHECK-NEXT: [0,3] . DeE---R . . add x0, x27, #1
+# CHECK-NEXT: [0,4] . DeeeeeER . ld1 { v1.2d, v2.2d }, [x27], #32
+# CHECK-NEXT: [0,5] . DeE---R . add x0, x27, #1
+# CHECK-NEXT: [0,6] . DeeeeeER . ld1 { v1.2s, v2.2s }, [x27], #16
+# CHECK-NEXT: [0,7] . .DeE---R . add x0, x27, #1
+# CHECK-NEXT: [0,8] . . DeeeeeER ld1 { v1.4h, v2.4h }, [x27], #16
+# CHECK-NEXT: [0,9] . . DeE---R add x0, x27, #1
# CHECK: Average Wait times (based on the timeline view):
# CHECK-NEXT: [0]: Executions
@@ -1332,42 +1332,42 @@ add x0, x27, 1
# CHECK: [0] [1] [2] [3]
# CHECK-NEXT: 0. 1 1.0 1.0 0.0 ld1 { v1.16b }, [x27], x28
# CHECK-NEXT: 1. 1 2.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 2. 1 2.0 0.0 0.0 ld1 { v1.1d, v2.1d }, [x27], #16
-# CHECK-NEXT: 3. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 4. 1 2.0 0.0 0.0 ld1 { v1.2d, v2.2d }, [x27], #32
-# CHECK-NEXT: 5. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 6. 1 3.0 0.0 0.0 ld1 { v1.2s, v2.2s }, [x27], #16
-# CHECK-NEXT: 7. 1 4.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 8. 1 3.0 0.0 0.0 ld1 { v1.4h, v2.4h }, [x27], #16
-# CHECK-NEXT: 9. 1 4.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 1 2.7 0.1 1.5 <total>
+# CHECK-NEXT: 2. 1 1.0 0.0 0.0 ld1 { v1.1d, v2.1d }, [x27], #16
+# CHECK-NEXT: 3. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 4. 1 1.0 1.0 0.0 ld1 { v1.2d, v2.2d }, [x27], #32
+# CHECK-NEXT: 5. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 6. 1 1.0 1.0 0.0 ld1 { v1.2s, v2.2s }, [x27], #16
+# CHECK-NEXT: 7. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 8. 1 1.0 1.0 0.0 ld1 { v1.4h, v2.4h }, [x27], #16
+# CHECK-NEXT: 9. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 1 1.1 0.4 1.5 <total>
# CHECK: [4] Code Region - G05
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 1000
-# CHECK-NEXT: Total Cycles: 507
+# CHECK-NEXT: Total Cycles: 1006
# CHECK-NEXT: Total uOps: 2000
-# CHECK: Dispatch Width: 8
-# CHECK-NEXT: uOps Per Cycle: 3.94
-# CHECK-NEXT: IPC: 1.97
-# CHECK-NEXT: Block RThroughput: 5.0
+# CHECK: Dispatch Width: 3
+# CHECK-NEXT: uOps Per Cycle: 1.99
+# CHECK-NEXT: IPC: 0.99
+# CHECK-NEXT: Block RThroughput: 6.7
# CHECK: Timeline view:
-# CHECK-NEXT: 01
+# CHECK-NEXT: 012345
# CHECK-NEXT: Index 0123456789
-# CHECK: [0,0] DeeeeeER .. ld1 { v1.4s, v2.4s }, [x27], #32
-# CHECK-NEXT: [0,1] D=eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,2] D=eeeeeER .. ld1 { v1.8b, v2.8b }, [x27], #16
-# CHECK-NEXT: [0,3] D==eE---R .. add x0, x27, #1
-# CHECK-NEXT: [0,4] .D=eeeeeER.. ld1 { v1.8h, v2.8h }, [x27], #32
-# CHECK-NEXT: [0,5] .D==eE---R.. add x0, x27, #1
-# CHECK-NEXT: [0,6] .D==eeeeeER. ld1 { v1.16b, v2.16b }, [x27], #32
-# CHECK-NEXT: [0,7] .D===eE---R. add x0, x27, #1
-# CHECK-NEXT: [0,8] . D==eeeeeER ld1 { v1.1d, v2.1d }, [x27], x28
-# CHECK-NEXT: [0,9] . D===eE---R add x0, x27, #1
+# CHECK: [0,0] DeeeeeER . . ld1 { v1.4s, v2.4s }, [x27], #32
+# CHECK-NEXT: [0,1] .DeE---R . . add x0, x27, #1
+# CHECK-NEXT: [0,2] . DeeeeeER. . ld1 { v1.8b, v2.8b }, [x27], #16
+# CHECK-NEXT: [0,3] . DeE---R. . add x0, x27, #1
+# CHECK-NEXT: [0,4] . DeeeeeER . ld1 { v1.8h, v2.8h }, [x27], #32
+# CHECK-NEXT: [0,5] . DeE---R . add x0, x27, #1
+# CHECK-NEXT: [0,6] . .DeeeeeER . ld1 { v1.16b, v2.16b }, [x27], #32
+# CHECK-NEXT: [0,7] . . DeE---R . add x0, x27, #1
+# CHECK-NEXT: [0,8] . . DeeeeeER ld1 { v1.1d, v2.1d }, [x27], x28
+# CHECK-NEXT: [0,9] . . DeE---R add x0, x27, #1
# CHECK: Average Wait times (based on the timeline view):
# CHECK-NEXT: [0]: Executions
@@ -1377,43 +1377,43 @@ add x0, x27, 1
# CHECK: [0] [1] [2] [3]
# CHECK-NEXT: 0. 1 1.0 1.0 0.0 ld1 { v1.4s, v2.4s }, [x27], #32
-# CHECK-NEXT: 1. 1 2.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 2. 1 2.0 0.0 0.0 ld1 { v1.8b, v2.8b }, [x27], #16
-# CHECK-NEXT: 3. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 4. 1 2.0 0.0 0.0 ld1 { v1.8h, v2.8h }, [x27], #32
-# CHECK-NEXT: 5. 1 3.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 6. 1 3.0 0.0 0.0 ld1 { v1.16b, v2.16b }, [x27], #32
-# CHECK-NEXT: 7. 1 4.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 8. 1 3.0 0.0 0.0 ld1 { v1.1d, v2.1d }, [x27], x28
-# CHECK-NEXT: 9. 1 4.0 0.0 3.0 add x0, x27, #1
-# CHECK-NEXT: 1 2.7 0.1 1.5 <total>
+# CHECK-NEXT: 1. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 2. 1 1.0 1.0 0.0 ld1 { v1.8b, v2.8b }, [x27], #16
+# CHECK-NEXT: 3. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 4. 1 1.0 1.0 0.0 ld1 { v1.8h, v2.8h }, [x27], #32
+# CHECK-NEXT: 5. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 6. 1 1.0 1.0 0.0 ld1 { v1.16b, v2.16b }, [x27], #32
+# CHECK-NEXT: 7. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 8. 1 1.0 1.0 0.0 ld1 { v1.1d, v2.1d }, [x27], x28
+# CHECK-NEXT: 9. 1 1.0 0.0 3.0 add x0, x27, #1
+# CHECK-NEXT: 1 1.0 0.5 1.5 <total>
# CHECK: [5] Code Region - G06
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 1000
-# CHECK-NEXT: Total Cycles: 507
+# CHECK-NEXT: Total Cycles: 1006
# CHECK-NEXT: Total uOps: 2000
-# CHECK: Dispatch Width: 8
-# CHECK-NEXT: uOps Per Cycle: 3.94
-# CHECK-NEXT: IPC: 1.97
-# CHECK-NEXT: Block RThroughput: 5.0
+# CHECK: Dispatch Width: 3
+# CHECK-NEXT: uOps Per Cycle: 1.99
+# CHECK-NEXT: IPC: 0.99
+# CHECK-NEXT: Block RThroughput: 6.7
# CHECK: Timeline view:
-# CHECK-NEXT: 01
+# CHECK-NEXT: 012345
# CHECK-NEXT: Index 0123456789
-# CHECK: [0,0] DeeeeeER .. ld1 { v1.2d, v2.2d }, [x27], x28
-# CHECK-NEXT: [0,1] D=eE---R .. ...
[truncated]
|
This sounds sensible. What was the difference between 3 and 4 for neoverse-n1? A value of 4 would help match the SWOG (and #136374), and if there isn't much in it would make a better value I believe. |
For neoverse-n1, with a value of 4 I saw some small amount of noise but no overall improvement over the original value of 8. |
|
||
def NeoverseN1Model : SchedMachineModel { | ||
let IssueWidth = 8; // Maximum micro-ops dispatch rate. | ||
let IssueWidth = 3; // Maximum micro-ops dispatch rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update these comments to match the v2 scheduler following this discussion: #142565 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Done for AArch64SchedNeoverseN1.td and AArch64SchedNeoverseV1.td.
Recently the IssueWidth in the scheduling model was reduced for Neoverse-V2 and N2. This patch does the same for Neoverse-V1, N1 and N3. On Neoverse-V1, various values of IssueWidth (15, 8, 7, 6, 5) were tried with runs of various workloads. The highest overall geomean score was achieved with an issue width of 8. No significant regressions were noted. On Neoverse-N1, various values of IssueWidth (8, 6, 5, 4, 3) were tried with runs of various workloads. The highest overall geomean score was achieved with an issue width of 3. No significant regressions were noted. On Neoverse-N3, it makes sense to do exactly the same as was done for N2. It is proposed to use an issue width of 5. Related V2 PR: llvm#142565 Related N2 PR: llvm#145717 Change-Id: I7d1e71e0470407bfd49b535a1068ff52235e9792
Thanks for this! In the commit message it says |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with a IssueWidth of 4 for NeoverseN1, as I believe that matches reality better, but either way this LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also LGTM, cheers
https://lab.llvm.org/buildbot/#/builders/193/builds/10666 broken tests on the cross builder:
would you take care of it? |
@vvereschaka Thanks for highlighting this. All 3 of the failures reported by buildbot 193 in builds 10666 onwards are similar: For builds 10665 and earlier, the IssueWidth for V1 was 15, so the reciprocal throughput 1/15 was 0.07 to 2 decimal places. The test case looks for 0.12. It has rounded 0.125 down. This is what I see on various linux hosts I tried. So we have inconsistent llvm-mca behaviour between this buildbot and other machines. I did not expect that. As a short term hack to get this buildbot green again, it would be possible to modify these 3 test cases to expect rounding up when the host OS is windows, if that is the root of the differences. |
In llvm/tools/llvm-mca, there are 10 instances of 9 of these instances explicitly round up, using This RT output is the 10th instance. |
Explicitly round up the reciprocal calculation, so that .125 is displayed as 0.13 consistently across all hosts. Fix buildbot failure https://lab.llvm.org/buildbot/#/builders/193/builds/10666 since #154495
…(#159544) Explicitly round up the reciprocal calculation, so that .125 is displayed as 0.13 consistently across all hosts. Fix buildbot failure https://lab.llvm.org/buildbot/#/builders/193/builds/10666 since llvm/llvm-project#154495
Recently the IssueWidth in the scheduling model was reduced for Neoverse-V2 and N2. This patch does the same for Neoverse-V1, N1 and N3.
On Neoverse-V1, various values of IssueWidth (15, 8, 7, 6, 5) were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 8. No significant regressions were noted.
On Neoverse-N1, various values of IssueWidth (8, 6, 5, 4, 3) were tried with runs of various workloads.
The highest overall geomean score was achieved with an issue width of 3. No significant regressions were noted.
On Neoverse-N3, it makes sense to do exactly the same as was done for N2. It is proposed to use an issue width of 5.
Related V2 PR: #142565
Related N2 PR: #145717