Commit d9a4851
[libc][SVE] add sve handling for memcpy with count less than 32b (llvm#167446)
Add SVE optimization for AArch64 architectures. The idea is to use
predicate registers to avoid branching.
Microbench in repo shows considerable improvements on NV GB10 (locked on
largest X925):
```
======================================================================
BENCHMARK STATISTICS (time in nanoseconds)
======================================================================
memcpy_Google_A:
Old - Mean: 3.1257 ns, Median: 3.1162 ns
New - Mean: 2.8402 ns, Median: 2.8265 ns
Improvement: +9.14% (mean), +9.30% (median)
memcpy_Google_B:
Old - Mean: 2.3171 ns, Median: 2.3159 ns
New - Mean: 1.6589 ns, Median: 1.6593 ns
Improvement: +28.40% (mean), +28.35% (median)
memcpy_Google_D:
Old - Mean: 8.7602 ns, Median: 8.7645 ns
New - Mean: 8.4307 ns, Median: 8.4308 ns
Improvement: +3.76% (mean), +3.81% (median)
memcpy_Google_L:
Old - Mean: 1.7137 ns, Median: 1.7091 ns
New - Mean: 1.4530 ns, Median: 1.4553 ns
Improvement: +15.22% (mean), +14.85% (median)
memcpy_Google_M:
Old - Mean: 1.9823 ns, Median: 1.9825 ns
New - Mean: 1.4826 ns, Median: 1.4840 ns
Improvement: +25.20% (mean), +25.15% (median)
memcpy_Google_Q:
Old - Mean: 1.6812 ns, Median: 1.6784 ns
New - Mean: 1.1538 ns, Median: 1.1517 ns
Improvement: +31.37% (mean), +31.38% (median)
memcpy_Google_S:
Old - Mean: 2.1816 ns, Median: 2.1786 ns
New - Mean: 1.6297 ns, Median: 1.6287 ns
Improvement: +25.29% (mean), +25.24% (median)
memcpy_Google_U:
Old - Mean: 2.2851 ns, Median: 2.2825 ns
New - Mean: 1.7219 ns, Median: 1.7187 ns
Improvement: +24.65% (mean), +24.70% (median)
memcpy_Google_W:
Old - Mean: 2.0408 ns, Median: 2.0361 ns
New - Mean: 1.5260 ns, Median: 1.5252 ns
Improvement: +25.23% (mean), +25.09% (median)
uniform_384_to_4096:
Old - Mean: 26.9067 ns, Median: 26.8845 ns
New - Mean: 26.8083 ns, Median: 26.8149 ns
Improvement: +0.37% (mean), +0.26% (median)
```
The beginning of the memcpy function looks like the following:
```
Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm:
0x0000000000001340 <+0>: cbz x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252>
0x0000000000001344 <+4>: cbz x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256>
0x0000000000001348 <+8>: cbz x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260>
0x000000000000134c <+12>: subs x8, x2, #0x20
0x0000000000001350 <+16>: b.hi 0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52> // b.pmore
0x0000000000001354 <+20>: rdvl x8, llvm#1
0x0000000000001358 <+24>: whilelo p0.b, xzr, x2
0x000000000000135c <+28>: ld1b {z0.b}, p0/z, [x1]
0x0000000000001360 <+32>: whilelo p1.b, x8, x2
0x0000000000001364 <+36>: ld1b {z1.b}, p1/z, [x1, llvm#1, mul vl]
0x0000000000001368 <+40>: st1b {z0.b}, p0, [x0]
0x000000000000136c <+44>: st1b {z1.b}, p1, [x0, llvm#1, mul vl]
0x0000000000001370 <+48>: ret
```
---------
Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>1 parent 989f736 commit d9a4851
1 file changed
+25
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
17 | 21 | | |
18 | | - | |
19 | 22 | | |
20 | 23 | | |
| 24 | + | |
21 | 25 | | |
22 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
23 | 46 | | |
24 | 47 | | |
25 | 48 | | |
| |||
34 | 57 | | |
35 | 58 | | |
36 | 59 | | |
| 60 | + | |
37 | 61 | | |
38 | 62 | | |
39 | 63 | | |
| |||
0 commit comments