-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[libc][SVE] add sve handling for memcpy with count less than 32b #167446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[libc][SVE] add sve handling for memcpy with count less than 32b #167446
Conversation
|
@llvm/pr-subscribers-libc Author: Schrodinger ZHU Yifan (SchrodingerZhu) ChangesFull diff: https://github.com/llvm/llvm-project/pull/167446.diff 1 Files Affected:
diff --git a/libc/src/string/memory_utils/aarch64/inline_memcpy.h b/libc/src/string/memory_utils/aarch64/inline_memcpy.h
index 11cf022e12b1f..7af0963c7e11e 100644
--- a/libc/src/string/memory_utils/aarch64/inline_memcpy.h
+++ b/libc/src/string/memory_utils/aarch64/inline_memcpy.h
@@ -9,15 +9,40 @@
#define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_AARCH64_INLINE_MEMCPY_H
#include "src/__support/macros/attributes.h" // LIBC_INLINE
+#include "src/__support/macros/properties/cpu_features.h"
#include "src/string/memory_utils/op_builtin.h"
#include "src/string/memory_utils/utils.h"
#include <stddef.h> // size_t
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+#include <arm_sve.h>
+#endif
namespace LIBC_NAMESPACE_DECL {
-
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+[[maybe_unused, gnu::always_inline]] LIBC_INLINE void
+inline_memcpy_aarch64_sve_32_bytes(Ptr __restrict dst, CPtr __restrict src,
+ size_t count) {
+ auto src_ptr = reinterpret_cast<const uint8_t *>(src);
+ auto dst_ptr = reinterpret_cast<uint8_t *>(dst);
+ const size_t vlen = svcntb();
+ svbool_t less_than_count_fst = svwhilelt_b8_u64(0, count);
+ svbool_t less_than_count_snd = svwhilelt_b8_u64(vlen, count);
+ svuint8_t fst = svld1_u8(less_than_count_fst, &src_ptr[0]);
+ svuint8_t snd = svld1_u8(less_than_count_snd, &src_ptr[vlen]);
+ svst1_u8(less_than_count_fst, &dst_ptr[0], fst);
+ svst1_u8(less_than_count_snd, &dst_ptr[vlen], snd);
+}
+#endif
[[maybe_unused]] LIBC_INLINE void
inline_memcpy_aarch64(Ptr __restrict dst, CPtr __restrict src, size_t count) {
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+ // SVE register is at least 16 bytes, we can use it to avoid branching in
+ // small cases. Here we use 2 SVE registers to always cover cases where count
+ // <= 32.
+ if (count <= 32)
+ return inline_memcpy_aarch64_sve_32_bytes(dst, src, count);
+#else
if (count == 0)
return;
if (count == 1)
@@ -34,6 +59,7 @@ inline_memcpy_aarch64(Ptr __restrict dst, CPtr __restrict src, size_t count) {
return builtin::Memcpy<8>::head_tail(dst, src, count);
if (count < 32)
return builtin::Memcpy<16>::head_tail(dst, src, count);
+#endif
if (count < 64)
return builtin::Memcpy<32>::head_tail(dst, src, count);
if (count < 128)
|
|
Adding @gchatelet as the expert on memory optimizations |
| // SVE register is at least 16 bytes, we can use it to avoid branching in | ||
| // small cases. Here we use 2 SVE registers to always cover cases where count | ||
| // <= 32. | ||
| if (count <= 32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how common the zero-size case is, but this introduces four memory-subsystem accesses for it. I know they won't fault, but I believe something does go out and do some work outside the processor. I would need to review how this affects cache behavior. Programmatically it should be (and is a no-op), and obviously won't affect the abstract machine's state, but I'm less confident of the memory subsystem.
Not sure it is worth it to change.
|
@Sterling-Augustine However, since there are possibilities for much longer vectors. Do you think it may be useful to change this to |
I'm in favor. Makes for a free performance bonus when new hardware appears. |
|
I'm just seeing this patch, please give me one more day so I can review it correctly 🙏 |
|
SVE vectors are at least 128-bit wide (16B). In case the CPU implementation offers wider vectors the code becomes suboptimal as it presumably goes through two (or more) predicated loads and stores which would extend access to 64B (or more). As @Sterling-Augustine mentioned, it seems wasteful to "load"/"store" 32B if we copy To me it seems that SVE becomes increasingly interesting for large sizes not for small ones. Now an interesting experience would be to fully write #include <arm_sve.h>
#include <cstddef>
#include <cstdint>
void memcpy(void* dest, const void* src, size_t n) {
uint8_t* dest_b = (uint8_t*)dest;
const uint8_t* src_b = (const uint8_t*)src;
size_t i = 0;
const uint64_t vl = svcntb();
svbool_t p;
do {
p = svwhilelt_b8(i, n);
const svuint8_t vec_data = svld1_u8(p, src_b + i);
svst1_u8(p, dest_b + i, vec_data);
i += vl;
} while (svptest_first(svptrue_b8(), p));
}memcpy(void*, void const*, unsigned long):
mov x8, xzr
.LBB0_1:
whilelo p0.b, x8, x2
ld1b { z0.b }, p0/z, [x1, x8]
st1b { z0.b }, p0, [x0, x8]
incb x8
b.mi .LBB0_1
retMy gut feeling is that it would still be beneficial to handle small sizes (up to 16B) through GPRs as it is done today and only then to switch to SVE. This way we have the best of both worlds. I think this deserves a few experiments on different ARM platforms (neoverse N1 and N2) before landing anyways. I've seen implementations that perform great on microbenchmarks but end up being neutral or negative in production, especially on different micro-architectures. |
For the SVE long loop, according to my observation in #167624, we still need to parallelize the loop with two or more whilelo incremented by vlen to get optimal performance in microbench. Would it be hard to find a sweet point on the amount of parallelism? For now, it seems that 16b vector may be in favor of more (may be 4) explicit parallelism. However, I would expect longer vector may impose longer latency and hence the number would change. |
|
I agree, I think a better scheme is to keep if-cascade for Another track to explore is the performance of MOPS for architectures that supports it ( I believe ARMv8.8) |
|
There are internal discussions going on about mem-functions for aarch64, can we postpone this patch until we reach a conclusion? |
|
Sure |

Add SVE optimization for AArch64 architectures. The idea is to use predicate registers to avoid branching.
Microbench in repo shows considerable improvements on NV GB10 (locked on largest X925):
The beginning of the memcpy function looks like the following: