-
Notifications
You must be signed in to change notification settings - Fork 2.4k
AArch64: Add SVE2 path for convertSequences_noRepcodes #4440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/compress/zstd_compress.c
Outdated
| #include "zstd_ldm.h" | ||
| #include "zstd_compress_superblock.h" | ||
| #include "../common/bits.h" /* ZSTD_highbit32, ZSTD_rotateRight_U64 */ | ||
| #include <stdbool.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not part of standard C90, so we shouldn't use this #include in this library.
One could locally define boolean values,
or simply rely on the known convention that int 0 == false and any other value == true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks for highlighting it for me, fixed in 2849f3a.
|
Thanks for the submission @arpadpanyik-arm , It's also validated via CI, using this test (that you added in a previous PR) : https://github.com/facebook/zstd/blob/dev/.github/workflows/dev-short-tests.yml#L437-L438 . So it's a pretty good start. My only comment here is about the selection of But is that the case ? For reference, it has been my experience (through I was wondering if this experience extends to |
Add an 8-way vector length agnostic (VLA) SVE2 code path for
convertSequences_noRepcodes. It works with any SVE vector length.
Relative performance to GCC-13 using: `./fullbench -b18 -l5 enwik5`
Neon SVE2
Neoverse-V2 before after uplift
GCC-13: 100.000% 103.209% 1.032x
GCC-14: 100.309% 134.872% 1.344x
GCC-15: 100.355% 134.827% 1.343x
Clang-18: 123.614% 128.565% 1.040x
Clang-19: 123.587% 132.984% 1.076x
Clang-20: 123.629% 133.023% 1.075x
Neon SVE2
Cortex-A720 before after uplift
GCC-13: 100.000% 116.032% 1.160x
GCC-14: 99.700% 116.648% 1.169x
GCC-15: 100.354% 117.047% 1.166x
Clang-18: 100.447% 116.762% 1.162x
Clang-19: 100.454% 116.627% 1.160x
Clang-20: 100.452% 116.649% 1.161x
36ccef5 to
2849f3a
Compare
|
The proposed SVE implementation relies on the two-register TBL instruction and therefore requires SVE2. It also differs from the Neon version: SVE performs efficient vector comparisons and branches on the results, whereas the Neon implementation updates indices in a fully vectorized manner. With a well-predicted branch, the SVE2 variant achieves the same effect with fewer instructions and scales naturally to wider vector lengths. Some benchmarks on client CPUs normalized to their Neon code path compiled with GCC-15: GCC-15 emits suboptimal Neon code: it inserts extra vector moves to make TBL operands consecutive. This penalizes Cortex-A710, which lacks zero-latency vector register copies; newer Cortex-A cores mitigate this. Clang-20 produces markedly better Neon code in this case. By contrast, SVE2 code generation is strong in both compilers. We also applied aggressive unrolling, since on Cortex-X and Neoverse-V, SVE comparisons can issue on a single vector pipeline. In a 128-bit configuration, SVE is expected to deliver performance similar to Neon. Where the algorithm can take advantage of SVE’s extended instruction set and capabilities, it can readily outperform Neon. I think SVE2 is at least as good as Neon in this case. You can learn more about the aforementioned CPUs and their features in the Software Optimisation Guides here: https://developer.arm.com/documentation#f-navigationhierarchiescontenttype=Software%20Optimization%20Guide&numberOfResults=48 |
Add an 8-way vector length agnostic (VLA) SVE2 code path for
convertSequences_noRepcodes. It works with any SVE vector length.Relative performance to GCC-13 using:
./fullbench -b18 -l5 enwik5