-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[AArch64] Fix throughout of 64-bit SVE gather loads #168572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-backend-aarch64 Author: Asher Dobrescu (Asher8118) ChangesIn the Neoverse N3 Software Optimisation Guide, SVE non termporal gather load, vector+scalar 64-bit element size and gather load, vector + imm, 64-bit element size both show throughput of 4/5. However, it currently shows as 2/3. This patch adds a new resource in order to show the correct throughput. Patch is 1.63 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/168572.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
index c73f60a1a7741b..13f8c1be0a9dd7 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
@@ -40,6 +40,7 @@ def N3UnitM0 : ProcResource<1>; // Integer Single/Multi-Cycle 0
def N3UnitM1 : ProcResource<1>; // Integer Single/Multi-Cycle 1
def N3UnitL01 : ProcResource<2>; // Load/Store 0/1
def N3UnitL2 : ProcResource<1>; // Load 2
+def N3UnitGL : ProcResource<4>; // Gather Load
def N3UnitD : ProcResource<2>; // Integer Store data 0/1
def N3UnitV0 : ProcResource<1>; // FP/ASIMD 0
def N3UnitV1 : ProcResource<1>; // FP/ASIMD 1
@@ -160,6 +161,12 @@ def N3Write_6c_2L : SchedWriteRes<[N3UnitL, N3UnitL]> {
let NumMicroOps = 2;
}
+def N3Write_6c_2GL : SchedWriteRes<[N3UnitL, N3UnitGL]> {
+ let Latency = 6;
+ let NumMicroOps = 2;
+ let ReleaseAtCycles = [3, 5];
+}
+
def N3Write_2c_1L01_1V : SchedWriteRes<[N3UnitL01, N3UnitV]> {
let Latency = 2;
let NumMicroOps = 2;
@@ -2243,8 +2250,8 @@ def : InstRW<[N3Write_7c_4L], (instregex "^LDNT1[BHW]_ZZR_S$",
"^LDNT1S[BH]_ZZR_S$")>;
// Non temporal gather load, vector + scalar 64-bit element size
-def : InstRW<[N3Write_6c_2L], (instregex "^LDNT1S?[BHW]_ZZR_D$")>;
-def : InstRW<[N3Write_6c_2L], (instrs LDNT1D_ZZR_D)>;
+def : InstRW<[N3Write_6c_2GL], (instregex "^LDNT1S?[BHW]_ZZR_D$")>;
+def : InstRW<[N3Write_6c_2GL], (instrs LDNT1D_ZZR_D)>;
// Contiguous first faulting load, scalar + scalar
def : InstRW<[N3Write_6c_1L], (instregex "^LDFF1[BHWD]$",
@@ -2293,11 +2300,11 @@ def : InstRW<[N3Write_7c_4L], (instregex "^GLD(FF)?1S?[BH]_S_IMM$",
"^GLD(FF)?1W_IMM$")>;
// Gather load, vector + imm, 64-bit element size
-def : InstRW<[N3Write_6c_2L], (instregex "^GLD(FF)?1S?[BHW]_D_IMM$",
+def : InstRW<[N3Write_6c_2GL], (instregex "^GLD(FF)?1S?[BHW]_D_IMM$",
"^GLD(FF)?1D_IMM$")>;
// Gather load, 64-bit element size
-def : InstRW<[N3Write_6c_2L],
+def : InstRW<[N3Write_6c_2GL],
(instregex "^GLD(FF)?1S?[BHW]_D_[SU]XTW(_SCALED)?$",
"^GLD(FF)?1S?[BHW]_D(_SCALED)?$",
"^GLD(FF)?1D_[SU]XTW(_SCALED)?$",
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
index b9758280e2491e..1767d15d862ad6 100644
--- a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
@@ -2545,1181 +2545,1185 @@ drps
# CHECK-NEXT: [0.1] - N3UnitB
# CHECK-NEXT: [1.0] - N3UnitD
# CHECK-NEXT: [1.1] - N3UnitD
-# CHECK-NEXT: [2] - N3UnitL2
-# CHECK-NEXT: [3.0] - N3UnitL01
-# CHECK-NEXT: [3.1] - N3UnitL01
-# CHECK-NEXT: [4] - N3UnitM0
-# CHECK-NEXT: [5] - N3UnitM1
-# CHECK-NEXT: [6.0] - N3UnitS
-# CHECK-NEXT: [6.1] - N3UnitS
-# CHECK-NEXT: [7] - N3UnitV0
-# CHECK-NEXT: [8] - N3UnitV1
+# CHECK-NEXT: [2.0] - N3UnitGL
+# CHECK-NEXT: [2.1] - N3UnitGL
+# CHECK-NEXT: [2.2] - N3UnitGL
+# CHECK-NEXT: [2.3] - N3UnitGL
+# CHECK-NEXT: [3] - N3UnitL2
+# CHECK-NEXT: [4.0] - N3UnitL01
+# CHECK-NEXT: [4.1] - N3UnitL01
+# CHECK-NEXT: [5] - N3UnitM0
+# CHECK-NEXT: [6] - N3UnitM1
+# CHECK-NEXT: [7.0] - N3UnitS
+# CHECK-NEXT: [7.1] - N3UnitS
+# CHECK-NEXT: [8] - N3UnitV0
+# CHECK-NEXT: [9] - N3UnitV1
# CHECK: Resource pressure per iteration:
-# CHECK-NEXT: [0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6.0] [6.1] [7] [8]
-# CHECK-NEXT: 11.00 11.00 33.00 33.00 99.33 163.33 163.33 357.75 212.75 156.25 156.25 184.50 64.50
+# CHECK-NEXT: [0.0] [0.1] [1.0] [1.1] [2.0] [2.1] [2.2] [2.3] [3] [4.0] [4.1] [5] [6] [7.0] [7.1] [8] [9]
+# CHECK-NEXT: 11.00 11.00 33.00 33.00 - - - - 99.33 163.33 163.33 357.75 212.75 156.25 156.25 184.50 64.50
# CHECK: Resource pressure by instruction:
-# CHECK-NEXT: [0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6.0] [6.1] [7] [8] Instructions:
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add w2, w3, #4095
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add w30, w29, #1, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add w13, w5, #4095, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add x5, x7, #1638
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add w20, wsp, #801
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add wsp, wsp, #1104
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add wsp, w30, #4084
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add x0, x24, #291
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add x3, x24, #4095, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add x8, sp, #1074
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add sp, x29, #3816
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub w0, wsp, #4077
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub w4, w20, #546, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub sp, sp, #288
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub wsp, w19, #16
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds w13, w23, #291, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmn w2, #4095
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds w20, wsp, #0
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmn x3, #1, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmp sp, #20, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmp x30, #4095
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - subs x4, sp, #3822
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmn w3, #291, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmn wsp, #1365
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmn sp, #1092, lsl #12
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - mov sp, x30
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - mov wsp, w20
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - mov x11, sp
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - mov w24, wsp
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add w3, w5, w7
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add wzr, w3, w5
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add w20, wzr, w4
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add w4, w6, wzr
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add w11, w13, w15
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add w9, w3, wzr, lsl #10
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add w17, w29, w20, lsl #31
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add w21, w22, w23, lsr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add w24, w25, w26, lsr #18
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add w27, w28, w29, lsr #31
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add w2, w3, w4, asr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add w5, w6, w7, asr #21
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add w8, w9, w10, asr #31
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add x3, x5, x7
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add xzr, x3, x5
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add x20, xzr, x4
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add x4, x6, xzr
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - add x11, x13, x15
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add x9, x3, xzr, lsl #10
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add x17, x29, x20, lsl #63
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add x21, x22, x23, lsr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add x24, x25, x26, lsr #18
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add x27, x28, x29, lsr #63
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add x2, x3, x4, asr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add x5, x6, x7, asr #21
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - add x8, x9, x10, asr #63
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds w3, w5, w7
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmn w3, w5
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds w20, wzr, w4
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds w4, w6, wzr
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds w11, w13, w15
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds w9, w3, wzr, lsl #10
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds w17, w29, w20, lsl #31
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds w21, w22, w23, lsr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds w24, w25, w26, lsr #18
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds w27, w28, w29, lsr #31
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds w2, w3, w4, asr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds w5, w6, w7, asr #21
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds w8, w9, w10, asr #31
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds x3, x5, x7
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmn x3, x5
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds x20, xzr, x4
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds x4, x6, xzr
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - adds x11, x13, x15
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds x9, x3, xzr, lsl #10
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds x17, x29, x20, lsl #63
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds x21, x22, x23, lsr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds x24, x25, x26, lsr #18
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds x27, x28, x29, lsr #63
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds x2, x3, x4, asr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds x5, x6, x7, asr #21
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - adds x8, x9, x10, asr #63
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub w3, w5, w7
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub wzr, w3, w5
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub w4, w6, wzr
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub w11, w13, w15
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub w9, w3, wzr, lsl #10
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub w17, w29, w20, lsl #31
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub w21, w22, w23, lsr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub w24, w25, w26, lsr #18
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub w27, w28, w29, lsr #31
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub w2, w3, w4, asr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub w5, w6, w7, asr #21
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub w8, w9, w10, asr #31
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub x3, x5, x7
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub xzr, x3, x5
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub x4, x6, xzr
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - sub x11, x13, x15
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub x9, x3, xzr, lsl #10
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub x17, x29, x20, lsl #63
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub x21, x22, x23, lsr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub x24, x25, x26, lsr #18
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub x27, x28, x29, lsr #63
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub x2, x3, x4, asr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub x5, x6, x7, asr #21
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - sub x8, x9, x10, asr #63
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - subs w3, w5, w7
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmp w3, w5
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - subs w4, w6, wzr
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - subs w11, w13, w15
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs w9, w3, wzr, lsl #10
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs w17, w29, w20, lsl #31
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs w21, w22, w23, lsr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs w24, w25, w26, lsr #18
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs w27, w28, w29, lsr #31
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs w2, w3, w4, asr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs w5, w6, w7, asr #21
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs w8, w9, w10, asr #31
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - subs x3, x5, x7
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - cmp x3, x5
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - subs x4, x6, xzr
-# CHECK-NEXT: - - - - - - - 0.25 0.25 0.25 0.25 - - subs x11, x13, x15
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs x9, x3, xzr, lsl #10
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs x17, x29, x20, lsl #63
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs x21, x22, x23, lsr #0
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs x24, x25, x26, lsr #18
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs x27, x28, x29, lsr #63
-# CHECK-NEXT: - - - - - - - 0.50 0.50 - - - - subs x2, x3, x4, asr #0
-# CHECK-NEXT: - - -...
[truncated]
|
🐧 Linux x64 Test Results
|
|
why isn't it possible to get the correct throughput with the existing resources? |
Because the pipeline used by gather loads is unit L, which has 3 resources. This makes it so the throughput is a result of a division by 3. |
Doesn't that imply a bug somewhere? I don't understand why is the throughput 4/5 if it's not possible to get that with the resources as documented in the SWOG. Looking at the other neoverse cores they all use some of the vector pipes for these gathers, are we sure the SWOG is correct? Also "Non temporal gather load, vector + scalar 32-bit element size" is 4 micro-ops whereas 64-bit element size is 2 micro-ops, that doesnt make sense. |
I reasoned it would be a similar case as for flag setting instructions for V cores where we use V#UnitFlg, which is also a resource that does not appear in the SWOG.
That is odd, I think the micro-ops number should be the same for both 32-bit and 64-bit. I can change that as part of this patch. Done.
I think there are instances for the other Neoverse cores where 64-bit gather loads shows incorrect throughput when compared to the SWOG, eg: this load in V3. |
|
I had a closer look at this and understand better now. I checked with the CPU folks and these instructions are 4 mops, a pair of loads and a pair of FMOVs. The throughput (4/5) in the SWOG is correct, but the vector pipes are missing from the utilized pipelines. If you look at the SWOGs of the other Neoverse cores they all utilize the vector pipes. Throughput is calculated here: llvm-project/llvm/lib/MC/MCSchedule.cpp Line 98 in f8eca64
so for this you added as an example: it's roughly doing the following to get the right throughput: I can see it's not possible to get the correct throughput with the existing resources as the max num of units for all resources is 2, so to get rthroughput=1.25 would mean 1.0 / (2/2.5), i.e a fractional ReleaseAtCycles, which isn't possible. So ultimately a resource with 4 units is required. I did have a look and there is an alternative that would work with the existing resources: It's not perfect but I think it's a bit more constrained and will cause less churn in the tests at least. Not sure if this has been considered before or what others think. As an aside, it would be good if we could just explicitly set the throughput where we cant realisitically model it, instead of having to hack our way to it and potentially confuse people looking at this in the future thinking it's rooted in reality. Not sure if that's even possible, but I think the least we could do today is make it clear in such cases and strip it back to do the absolute bear minimum required to get the right value. I did see V#UnitFlg when reviewing the Neoverse V3 model recently and was a bit confused trying to understand where it came from looking at the SWOG until I looked at previous PRs. |
I took a look and you're right, it makes sense to add the vector pipes.
I am happy to go with your suggestion since I agree, it causes less changes in the tests.
I think a way to explicitly set the throughput would be very useful since I found a few cases where modelling throughput using the existing resources has not been straightforward eg this note :
However I am also wary of allowing throughput to be set rather than calculated based on resources. Ideally we should be able to model behaviour according to resources, though this is not always the case such as in this example. Not sure what the best option is in such cases. Perhaps allowing some exceptions with minimal disruption as you suggested is our best approach. |
c-rhodes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM cheers
In the Neoverse N3 Software Optimisation Guide, SVE non termporal gather load, vector+scalar 64-bit element size and gather load, vector + imm, 64-bit element size both show throughput of 4/5. However, it currently shows as 2/3. This patch adds a new resource group in order to show the correct throughput.
In the Neoverse N3 Software Optimisation Guide, SVE non termporal gather load, vector+scalar 64-bit element size and gather load, vector + imm, 64-bit element size both show throughput of 4/5. However, it currently shows as 2/3. This patch adds a new resource group in order to show the correct throughput.
In the Neoverse N3 Software Optimisation Guide, SVE non termporal gather load, vector+scalar 64-bit element size and gather load, vector + imm, 64-bit element size both show throughput of 4/5. However, it currently shows as 2/3. This patch adds a new resource in order to show the correct throughput.