Skip to content

Conversation

@SchrodingerZhu
Copy link
Contributor

@SchrodingerZhu SchrodingerZhu commented Nov 11, 2025

Add SVE optimization for AArch64 architectures. The idea is to use predicate registers to avoid branching.
Microbench in repo shows considerable improvements on NV GB10 (locked on largest X925):

======================================================================
BENCHMARK STATISTICS (time in nanoseconds)
======================================================================

memcpy_Google_A:
  Old - Mean: 3.1257 ns, Median: 3.1162 ns
  New - Mean: 2.8402 ns, Median: 2.8265 ns
  Improvement: +9.14% (mean), +9.30% (median)

memcpy_Google_B:
  Old - Mean: 2.3171 ns, Median: 2.3159 ns
  New - Mean: 1.6589 ns, Median: 1.6593 ns
  Improvement: +28.40% (mean), +28.35% (median)

memcpy_Google_D:
  Old - Mean: 8.7602 ns, Median: 8.7645 ns
  New - Mean: 8.4307 ns, Median: 8.4308 ns
  Improvement: +3.76% (mean), +3.81% (median)

memcpy_Google_L:
  Old - Mean: 1.7137 ns, Median: 1.7091 ns
  New - Mean: 1.4530 ns, Median: 1.4553 ns
  Improvement: +15.22% (mean), +14.85% (median)

memcpy_Google_M:
  Old - Mean: 1.9823 ns, Median: 1.9825 ns
  New - Mean: 1.4826 ns, Median: 1.4840 ns
  Improvement: +25.20% (mean), +25.15% (median)

memcpy_Google_Q:
  Old - Mean: 1.6812 ns, Median: 1.6784 ns
  New - Mean: 1.1538 ns, Median: 1.1517 ns
  Improvement: +31.37% (mean), +31.38% (median)

memcpy_Google_S:
  Old - Mean: 2.1816 ns, Median: 2.1786 ns
  New - Mean: 1.6297 ns, Median: 1.6287 ns
  Improvement: +25.29% (mean), +25.24% (median)

memcpy_Google_U:
  Old - Mean: 2.2851 ns, Median: 2.2825 ns
  New - Mean: 1.7219 ns, Median: 1.7187 ns
  Improvement: +24.65% (mean), +24.70% (median)

memcpy_Google_W:
  Old - Mean: 2.0408 ns, Median: 2.0361 ns
  New - Mean: 1.5260 ns, Median: 1.5252 ns
  Improvement: +25.23% (mean), +25.09% (median)

uniform_384_to_4096:
  Old - Mean: 26.9067 ns, Median: 26.8845 ns
  New - Mean: 26.8083 ns, Median: 26.8149 ns
  Improvement: +0.37% (mean), +0.26% (median)

The beginning of the memcpy function looks like the following:

Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm:
   0x0000000000001340 <+0>:     cbz     x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252>
   0x0000000000001344 <+4>:     cbz     x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256>
   0x0000000000001348 <+8>:     cbz     x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260>
   0x000000000000134c <+12>:    subs    x8, x2, #0x20
   0x0000000000001350 <+16>:    b.hi    0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52>  // b.pmore
   0x0000000000001354 <+20>:    rdvl    x8, #1
   0x0000000000001358 <+24>:    whilelo p0.b, xzr, x2
   0x000000000000135c <+28>:    ld1b    {z0.b}, p0/z, [x1]
   0x0000000000001360 <+32>:    whilelo p1.b, x8, x2
   0x0000000000001364 <+36>:    ld1b    {z1.b}, p1/z, [x1, #1, mul vl]
   0x0000000000001368 <+40>:    st1b    {z0.b}, p0, [x0]
   0x000000000000136c <+44>:    st1b    {z1.b}, p1, [x0, #1, mul vl]
   0x0000000000001370 <+48>:    ret

@llvmbot llvmbot added the libc label Nov 11, 2025
@llvmbot
Copy link
Member

llvmbot commented Nov 11, 2025

@llvm/pr-subscribers-libc

Author: Schrodinger ZHU Yifan (SchrodingerZhu)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/167446.diff

1 Files Affected:

  • (modified) libc/src/string/memory_utils/aarch64/inline_memcpy.h (+27-1)
diff --git a/libc/src/string/memory_utils/aarch64/inline_memcpy.h b/libc/src/string/memory_utils/aarch64/inline_memcpy.h
index 11cf022e12b1f..7af0963c7e11e 100644
--- a/libc/src/string/memory_utils/aarch64/inline_memcpy.h
+++ b/libc/src/string/memory_utils/aarch64/inline_memcpy.h
@@ -9,15 +9,40 @@
 #define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_AARCH64_INLINE_MEMCPY_H
 
 #include "src/__support/macros/attributes.h" // LIBC_INLINE
+#include "src/__support/macros/properties/cpu_features.h"
 #include "src/string/memory_utils/op_builtin.h"
 #include "src/string/memory_utils/utils.h"
 
 #include <stddef.h> // size_t
 
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+#include <arm_sve.h>
+#endif
 namespace LIBC_NAMESPACE_DECL {
-
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+[[maybe_unused, gnu::always_inline]] LIBC_INLINE void
+inline_memcpy_aarch64_sve_32_bytes(Ptr __restrict dst, CPtr __restrict src,
+                                   size_t count) {
+  auto src_ptr = reinterpret_cast<const uint8_t *>(src);
+  auto dst_ptr = reinterpret_cast<uint8_t *>(dst);
+  const size_t vlen = svcntb();
+  svbool_t less_than_count_fst = svwhilelt_b8_u64(0, count);
+  svbool_t less_than_count_snd = svwhilelt_b8_u64(vlen, count);
+  svuint8_t fst = svld1_u8(less_than_count_fst, &src_ptr[0]);
+  svuint8_t snd = svld1_u8(less_than_count_snd, &src_ptr[vlen]);
+  svst1_u8(less_than_count_fst, &dst_ptr[0], fst);
+  svst1_u8(less_than_count_snd, &dst_ptr[vlen], snd);
+}
+#endif
 [[maybe_unused]] LIBC_INLINE void
 inline_memcpy_aarch64(Ptr __restrict dst, CPtr __restrict src, size_t count) {
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+  // SVE register is at least 16 bytes, we can use it to avoid branching in
+  // small cases. Here we use 2 SVE registers to always cover cases where count
+  // <= 32.
+  if (count <= 32)
+    return inline_memcpy_aarch64_sve_32_bytes(dst, src, count);
+#else
   if (count == 0)
     return;
   if (count == 1)
@@ -34,6 +59,7 @@ inline_memcpy_aarch64(Ptr __restrict dst, CPtr __restrict src, size_t count) {
     return builtin::Memcpy<8>::head_tail(dst, src, count);
   if (count < 32)
     return builtin::Memcpy<16>::head_tail(dst, src, count);
+#endif
   if (count < 64)
     return builtin::Memcpy<32>::head_tail(dst, src, count);
   if (count < 128)

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Nov 11, 2025

benchmark_comparison

@michaelrj-google
Copy link
Contributor

Adding @gchatelet as the expert on memory optimizations

// SVE register is at least 16 bytes, we can use it to avoid branching in
// small cases. Here we use 2 SVE registers to always cover cases where count
// <= 32.
if (count <= 32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how common the zero-size case is, but this introduces four memory-subsystem accesses for it. I know they won't fault, but I believe something does go out and do some work outside the processor. I would need to review how this affects cache behavior. Programmatically it should be (and is a no-op), and obviously won't affect the abstract machine's state, but I'm less confident of the memory subsystem.

Not sure it is worth it to change.

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Nov 12, 2025

@Sterling-Augustine
BTW, I know that except neoverse V1 and some fujitsu implementations, SVE implementations typical sticks to 32b.

However, since there are possibilities for much longer vectors. Do you think it may be useful to change this to if (count <= svcntb() * 2)

@Sterling-Augustine
Copy link
Contributor

@Sterling-Augustine BTW, I know that expect neoverse V1 and some fujitsu implementations, SVE implementations typical sticks to 32b.

However, since there are possibilities for much longer vectors. Do you think it may be useful to change this to if (count <= svcntb() * 2)

I'm in favor. Makes for a free performance bonus when new hardware appears.

@gchatelet
Copy link
Contributor

I'm just seeing this patch, please give me one more day so I can review it correctly 🙏

@gchatelet
Copy link
Contributor

SVE vectors are at least 128-bit wide (16B).

In case the CPU implementation offers wider vectors the code becomes suboptimal as it presumably goes through two (or more) predicated loads and stores which would extend access to 64B (or more).

As @Sterling-Augustine mentioned, it seems wasteful to "load"/"store" 32B if we copy 0 byte, granted that the operations are predicated but it doesn't mean they are free. In data centers we value latency for small sizes so it's important to handle them swiftly.

To me it seems that SVE becomes increasingly interesting for large sizes not for small ones.

Now an interesting experience would be to fully write memcpy in SVE for platforms that support it.

#include <arm_sve.h>
#include <cstddef>
#include <cstdint>

void memcpy(void* dest, const void* src, size_t n) {
    uint8_t* dest_b = (uint8_t*)dest;
    const uint8_t* src_b = (const uint8_t*)src;
    size_t i = 0;
    const uint64_t vl = svcntb();
    svbool_t p;
    do {
        p = svwhilelt_b8(i, n);
        const svuint8_t vec_data = svld1_u8(p, src_b + i);
        svst1_u8(p, dest_b + i, vec_data);
        i += vl;
   } while (svptest_first(svptrue_b8(), p));
}
memcpy(void*, void const*, unsigned long):
        mov     x8, xzr
.LBB0_1:
        whilelo p0.b, x8, x2
        ld1b    { z0.b }, p0/z, [x1, x8]
        st1b    { z0.b }, p0, [x0, x8]
        incb    x8
        b.mi    .LBB0_1
        ret

My gut feeling is that it would still be beneficial to handle small sizes (up to 16B) through GPRs as it is done today and only then to switch to SVE. This way we have the best of both worlds.

I think this deserves a few experiments on different ARM platforms (neoverse N1 and N2) before landing anyways. I've seen implementations that perform great on microbenchmarks but end up being neutral or negative in production, especially on different micro-architectures.

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Nov 13, 2025

SVE vectors are at least 128-bit wide (16B).

In case the CPU implementation offers wider vectors the code becomes suboptimal as it presumably goes through two (or more) predicated loads and stores which would extend access to 64B (or more).

As @Sterling-Augustine mentioned, it seems wasteful to "load"/"store" 32B if we copy 0 byte, granted that the operations are predicated but it doesn't mean they are free. In data centers we value latency for small sizes so it's important to handle them swiftly.

To me it seems that SVE becomes increasingly interesting for large sizes not for small ones.

Now an interesting experience would be to fully write memcpy in SVE for platforms that support it.

#include <arm_sve.h>
#include <cstddef>
#include <cstdint>

void memcpy(void* dest, const void* src, size_t n) {
    uint8_t* dest_b = (uint8_t*)dest;
    const uint8_t* src_b = (const uint8_t*)src;
    size_t i = 0;
    const uint64_t vl = svcntb();
    svbool_t p;
    do {
        p = svwhilelt_b8(i, n);
        const svuint8_t vec_data = svld1_u8(p, src_b + i);
        svst1_u8(p, dest_b + i, vec_data);
        i += vl;
   } while (svptest_first(svptrue_b8(), p));
}
memcpy(void*, void const*, unsigned long):
        mov     x8, xzr
.LBB0_1:
        whilelo p0.b, x8, x2
        ld1b    { z0.b }, p0/z, [x1, x8]
        st1b    { z0.b }, p0, [x0, x8]
        incb    x8
        b.mi    .LBB0_1
        ret

My gut feeling is that it would still be beneficial to handle small sizes (up to 16B) through GPRs as it is done today and only then to switch to SVE. This way we have the best of both worlds.

I think this deserves a few experiments on different ARM platforms (neoverse N1 and N2) before landing anyways. I've seen implementations that perform great on microbenchmarks but end up being neutral or negative in production, especially on different micro-architectures.

For the SVE long loop, according to my observation in #167624, we still need to parallelize the loop with two or more whilelo incremented by vlen to get optimal performance in microbench.

Would it be hard to find a sweet point on the amount of parallelism? For now, it seems that 16b vector may be in favor of more (may be 4) explicit parallelism. However, I would expect longer vector may impose longer latency and hence the number would change.

@gchatelet
Copy link
Contributor

I agree, I think a better scheme is to keep if-cascade for size <= 16, then use SVE to reach the next cache line, copy cache line wise as much as possible (eventually unrolling), manage the remainder with SVE. This way we keep latency low for small sizes and prevent regressions when SVE moves to wider registers.

Another track to explore is the performance of MOPS for architectures that supports it ( I believe ARMv8.8)
https://godbolt.org/z/Y5ns8xs6h

@gchatelet
Copy link
Contributor

There are internal discussions going on about mem-functions for aarch64, can we postpone this patch until we reach a conclusion?

@SchrodingerZhu
Copy link
Contributor Author

Sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants