[libc][SVE] add sve handling for memcpy with count less than 32b #167446

SchrodingerZhu · 2025-11-11T03:51:04Z

Add SVE optimization for AArch64 architectures. The idea is to use predicate registers to avoid branching.
Microbench in repo shows considerable improvements on NV GB10 (locked on largest X925):

======================================================================
BENCHMARK STATISTICS (time in nanoseconds)
======================================================================

memcpy_Google_A:
  Old - Mean: 3.1257 ns, Median: 3.1162 ns
  New - Mean: 2.8402 ns, Median: 2.8265 ns
  Improvement: +9.14% (mean), +9.30% (median)

memcpy_Google_B:
  Old - Mean: 2.3171 ns, Median: 2.3159 ns
  New - Mean: 1.6589 ns, Median: 1.6593 ns
  Improvement: +28.40% (mean), +28.35% (median)

memcpy_Google_D:
  Old - Mean: 8.7602 ns, Median: 8.7645 ns
  New - Mean: 8.4307 ns, Median: 8.4308 ns
  Improvement: +3.76% (mean), +3.81% (median)

memcpy_Google_L:
  Old - Mean: 1.7137 ns, Median: 1.7091 ns
  New - Mean: 1.4530 ns, Median: 1.4553 ns
  Improvement: +15.22% (mean), +14.85% (median)

memcpy_Google_M:
  Old - Mean: 1.9823 ns, Median: 1.9825 ns
  New - Mean: 1.4826 ns, Median: 1.4840 ns
  Improvement: +25.20% (mean), +25.15% (median)

memcpy_Google_Q:
  Old - Mean: 1.6812 ns, Median: 1.6784 ns
  New - Mean: 1.1538 ns, Median: 1.1517 ns
  Improvement: +31.37% (mean), +31.38% (median)

memcpy_Google_S:
  Old - Mean: 2.1816 ns, Median: 2.1786 ns
  New - Mean: 1.6297 ns, Median: 1.6287 ns
  Improvement: +25.29% (mean), +25.24% (median)

memcpy_Google_U:
  Old - Mean: 2.2851 ns, Median: 2.2825 ns
  New - Mean: 1.7219 ns, Median: 1.7187 ns
  Improvement: +24.65% (mean), +24.70% (median)

memcpy_Google_W:
  Old - Mean: 2.0408 ns, Median: 2.0361 ns
  New - Mean: 1.5260 ns, Median: 1.5252 ns
  Improvement: +25.23% (mean), +25.09% (median)

uniform_384_to_4096:
  Old - Mean: 26.9067 ns, Median: 26.8845 ns
  New - Mean: 26.8083 ns, Median: 26.8149 ns
  Improvement: +0.37% (mean), +0.26% (median)

The beginning of the memcpy function looks like the following:

Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm:
   0x0000000000001340 <+0>:     cbz     x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252>
   0x0000000000001344 <+4>:     cbz     x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256>
   0x0000000000001348 <+8>:     cbz     x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260>
   0x000000000000134c <+12>:    subs    x8, x2, #0x20
   0x0000000000001350 <+16>:    b.hi    0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52>  // b.pmore
   0x0000000000001354 <+20>:    rdvl    x8, #1
   0x0000000000001358 <+24>:    whilelo p0.b, xzr, x2
   0x000000000000135c <+28>:    ld1b    {z0.b}, p0/z, [x1]
   0x0000000000001360 <+32>:    whilelo p1.b, x8, x2
   0x0000000000001364 <+36>:    ld1b    {z1.b}, p1/z, [x1, #1, mul vl]
   0x0000000000001368 <+40>:    st1b    {z0.b}, p0, [x0]
   0x000000000000136c <+44>:    st1b    {z1.b}, p1, [x0, #1, mul vl]
   0x0000000000001370 <+48>:    ret

llvmbot · 2025-11-11T03:51:37Z

@llvm/pr-subscribers-libc

Author: Schrodinger ZHU Yifan (SchrodingerZhu)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/167446.diff

1 Files Affected:

(modified) libc/src/string/memory_utils/aarch64/inline_memcpy.h (+27-1)

diff --git a/libc/src/string/memory_utils/aarch64/inline_memcpy.h b/libc/src/string/memory_utils/aarch64/inline_memcpy.h
index 11cf022e12b1f..7af0963c7e11e 100644
--- a/libc/src/string/memory_utils/aarch64/inline_memcpy.h
+++ b/libc/src/string/memory_utils/aarch64/inline_memcpy.h
@@ -9,15 +9,40 @@
 #define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_AARCH64_INLINE_MEMCPY_H
 
 #include "src/__support/macros/attributes.h" // LIBC_INLINE
+#include "src/__support/macros/properties/cpu_features.h"
 #include "src/string/memory_utils/op_builtin.h"
 #include "src/string/memory_utils/utils.h"
 
 #include <stddef.h> // size_t
 
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+#include <arm_sve.h>
+#endif
 namespace LIBC_NAMESPACE_DECL {
-
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+[[maybe_unused, gnu::always_inline]] LIBC_INLINE void
+inline_memcpy_aarch64_sve_32_bytes(Ptr __restrict dst, CPtr __restrict src,
+                                   size_t count) {
+  auto src_ptr = reinterpret_cast<const uint8_t *>(src);
+  auto dst_ptr = reinterpret_cast<uint8_t *>(dst);
+  const size_t vlen = svcntb();
+  svbool_t less_than_count_fst = svwhilelt_b8_u64(0, count);
+  svbool_t less_than_count_snd = svwhilelt_b8_u64(vlen, count);
+  svuint8_t fst = svld1_u8(less_than_count_fst, &src_ptr[0]);
+  svuint8_t snd = svld1_u8(less_than_count_snd, &src_ptr[vlen]);
+  svst1_u8(less_than_count_fst, &dst_ptr[0], fst);
+  svst1_u8(less_than_count_snd, &dst_ptr[vlen], snd);
+}
+#endif
 [[maybe_unused]] LIBC_INLINE void
 inline_memcpy_aarch64(Ptr __restrict dst, CPtr __restrict src, size_t count) {
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+  // SVE register is at least 16 bytes, we can use it to avoid branching in
+  // small cases. Here we use 2 SVE registers to always cover cases where count
+  // <= 32.
+  if (count <= 32)
+    return inline_memcpy_aarch64_sve_32_bytes(dst, src, count);
+#else
   if (count == 0)
     return;
   if (count == 1)
@@ -34,6 +59,7 @@ inline_memcpy_aarch64(Ptr __restrict dst, CPtr __restrict src, size_t count) {
     return builtin::Memcpy<8>::head_tail(dst, src, count);
   if (count < 32)
     return builtin::Memcpy<16>::head_tail(dst, src, count);
+#endif
   if (count < 64)
     return builtin::Memcpy<32>::head_tail(dst, src, count);
   if (count < 128)

SchrodingerZhu · 2025-11-11T04:03:34Z

michaelrj-google · 2025-11-11T18:14:00Z

Adding @gchatelet as the expert on memory optimizations

Sterling-Augustine · 2025-11-11T21:57:52Z

libc/src/string/memory_utils/aarch64/inline_memcpy.h

+  // SVE register is at least 16 bytes, we can use it to avoid branching in
+  // small cases. Here we use 2 SVE registers to always cover cases where count
+  // <= 32.
+  if (count <= 32)


Not sure how common the zero-size case is, but this introduces four memory-subsystem accesses for it. I know they won't fault, but I believe something does go out and do some work outside the processor. I would need to review how this affects cache behavior. Programmatically it should be (and is a no-op), and obviously won't affect the abstract machine's state, but I'm less confident of the memory subsystem.

Not sure it is worth it to change.

SchrodingerZhu · 2025-11-12T01:12:10Z

@Sterling-Augustine
BTW, I know that except neoverse V1 and some fujitsu implementations, SVE implementations typical sticks to 32b.

However, since there are possibilities for much longer vectors. Do you think it may be useful to change this to if (count <= svcntb() * 2)

Sterling-Augustine · 2025-11-12T17:07:27Z

@Sterling-Augustine BTW, I know that expect neoverse V1 and some fujitsu implementations, SVE implementations typical sticks to 32b.

However, since there are possibilities for much longer vectors. Do you think it may be useful to change this to if (count <= svcntb() * 2)

I'm in favor. Makes for a free performance bonus when new hardware appears.

gchatelet · 2025-11-12T17:15:30Z

I'm just seeing this patch, please give me one more day so I can review it correctly 🙏

gchatelet · 2025-11-13T12:12:19Z

SVE vectors are at least 128-bit wide (16B).

In case the CPU implementation offers wider vectors the code becomes suboptimal as it presumably goes through two (or more) predicated loads and stores which would extend access to 64B (or more).

As @Sterling-Augustine mentioned, it seems wasteful to "load"/"store" 32B if we copy 0 byte, granted that the operations are predicated but it doesn't mean they are free. In data centers we value latency for small sizes so it's important to handle them swiftly.

To me it seems that SVE becomes increasingly interesting for large sizes not for small ones.

Now an interesting experience would be to fully write memcpy in SVE for platforms that support it.

#include <arm_sve.h>
#include <cstddef>
#include <cstdint>

void memcpy(void* dest, const void* src, size_t n) {
    uint8_t* dest_b = (uint8_t*)dest;
    const uint8_t* src_b = (const uint8_t*)src;
    size_t i = 0;
    const uint64_t vl = svcntb();
    svbool_t p;
    do {
        p = svwhilelt_b8(i, n);
        const svuint8_t vec_data = svld1_u8(p, src_b + i);
        svst1_u8(p, dest_b + i, vec_data);
        i += vl;
   } while (svptest_first(svptrue_b8(), p));
}

memcpy(void*, void const*, unsigned long):
        mov     x8, xzr
.LBB0_1:
        whilelo p0.b, x8, x2
        ld1b    { z0.b }, p0/z, [x1, x8]
        st1b    { z0.b }, p0, [x0, x8]
        incb    x8
        b.mi    .LBB0_1
        ret

My gut feeling is that it would still be beneficial to handle small sizes (up to 16B) through GPRs as it is done today and only then to switch to SVE. This way we have the best of both worlds.

I think this deserves a few experiments on different ARM platforms (neoverse N1 and N2) before landing anyways. I've seen implementations that perform great on microbenchmarks but end up being neutral or negative in production, especially on different micro-architectures.

SchrodingerZhu · 2025-11-13T17:01:02Z

SVE vectors are at least 128-bit wide (16B).

In case the CPU implementation offers wider vectors the code becomes suboptimal as it presumably goes through two (or more) predicated loads and stores which would extend access to 64B (or more).

As @Sterling-Augustine mentioned, it seems wasteful to "load"/"store" 32B if we copy 0 byte, granted that the operations are predicated but it doesn't mean they are free. In data centers we value latency for small sizes so it's important to handle them swiftly.

To me it seems that SVE becomes increasingly interesting for large sizes not for small ones.

Now an interesting experience would be to fully write memcpy in SVE for platforms that support it.
#include <arm_sve.h>
#include <cstddef>
#include <cstdint>

void memcpy(void* dest, const void* src, size_t n) {
    uint8_t* dest_b = (uint8_t*)dest;
    const uint8_t* src_b = (const uint8_t*)src;
    size_t i = 0;
    const uint64_t vl = svcntb();
    svbool_t p;
    do {
        p = svwhilelt_b8(i, n);
        const svuint8_t vec_data = svld1_u8(p, src_b + i);
        svst1_u8(p, dest_b + i, vec_data);
        i += vl;
   } while (svptest_first(svptrue_b8(), p));
}
memcpy(void*, void const*, unsigned long):
        mov     x8, xzr
.LBB0_1:
        whilelo p0.b, x8, x2
        ld1b    { z0.b }, p0/z, [x1, x8]
        st1b    { z0.b }, p0, [x0, x8]
        incb    x8
        b.mi    .LBB0_1
        ret
My gut feeling is that it would still be beneficial to handle small sizes (up to 16B) through GPRs as it is done today and only then to switch to SVE. This way we have the best of both worlds.

I think this deserves a few experiments on different ARM platforms (neoverse N1 and N2) before landing anyways. I've seen implementations that perform great on microbenchmarks but end up being neutral or negative in production, especially on different micro-architectures.

For the SVE long loop, according to my observation in #167624, we still need to parallelize the loop with two or more whilelo incremented by vlen to get optimal performance in microbench.

Would it be hard to find a sweet point on the amount of parallelism? For now, it seems that 16b vector may be in favor of more (may be 4) explicit parallelism. However, I would expect longer vector may impose longer latency and hence the number would change.

gchatelet · 2025-11-14T14:48:43Z

I agree, I think a better scheme is to keep if-cascade for size <= 16, then use SVE to reach the next cache line, copy cache line wise as much as possible (eventually unrolling), manage the remainder with SVE. This way we keep latency low for small sizes and prevent regressions when SVE moves to wider registers.

Another track to explore is the performance of MOPS for architectures that supports it ( I believe ARMv8.8)
https://godbolt.org/z/Y5ns8xs6h

gchatelet · 2025-11-19T15:56:06Z

There are internal discussions going on about mem-functions for aarch64, can we postpone this patch until we reach a conclusion?

SchrodingerZhu · 2025-11-20T15:04:02Z

Sure

[libc][SVE] add sve handling for memcpy with count less than 32b

a383cc9

llvmbot added the libc label Nov 11, 2025

SchrodingerZhu requested review from Sterling-Augustine, lntue and michaelrj-google November 11, 2025 04:19

michaelrj-google requested a review from gchatelet November 11, 2025 18:13

Sterling-Augustine reviewed Nov 11, 2025

View reviewed changes

[libc][SVE] add sve handling for memcpy with count less than 32b #167446

Are you sure you want to change the base?

[libc][SVE] add sve handling for memcpy with count less than 32b #167446

Uh oh!

Conversation

SchrodingerZhu commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 11, 2025

Uh oh!

SchrodingerZhu commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelrj-google commented Nov 11, 2025

Uh oh!

Sterling-Augustine Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

SchrodingerZhu commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sterling-Augustine commented Nov 12, 2025

Uh oh!

gchatelet commented Nov 12, 2025

Uh oh!

gchatelet commented Nov 13, 2025

Uh oh!

SchrodingerZhu commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gchatelet commented Nov 14, 2025

Uh oh!

gchatelet commented Nov 19, 2025

Uh oh!

SchrodingerZhu commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SchrodingerZhu commented Nov 11, 2025 •

edited

Loading

SchrodingerZhu commented Nov 11, 2025 •

edited

Loading

SchrodingerZhu commented Nov 12, 2025 •

edited

Loading

SchrodingerZhu commented Nov 13, 2025 •

edited

Loading