AArch64: Add SVE2 path for convertSequences_noRepcodes #4440

arpadpanyik-arm · 2025-07-17T09:07:38Z

Add an 8-way vector length agnostic (VLA) SVE2 code path for convertSequences_noRepcodes. It works with any SVE vector length.

Relative performance to GCC-13 using: ./fullbench -b18 -l5 enwik5

               Neon      SVE2
Neoverse-V2   before     after    uplift
GCC-13:      100.000%  103.209%   1.032x
GCC-14:      100.309%  134.872%   1.344x
GCC-15:      100.355%  134.827%   1.343x
Clang-18:    123.614%  128.565%   1.040x
Clang-19:    123.587%  132.984%   1.076x
Clang-20:    123.629%  133.023%   1.075x

               Neon      SVE2
Cortex-A720   before     after    uplift
GCC-13:      100.000%  116.032%   1.160x
GCC-14:       99.700%  116.648%   1.169x
GCC-15:      100.354%  117.047%   1.166x
Clang-18:    100.447%  116.762%   1.162x
Clang-19:    100.454%  116.627%   1.160x
Clang-20:    100.452%  116.649%   1.161x

Cyan4973 · 2025-08-19T21:37:09Z

lib/compress/zstd_compress.c

 #include "zstd_ldm.h"
 #include "zstd_compress_superblock.h"
 #include  "../common/bits.h"      /* ZSTD_highbit32, ZSTD_rotateRight_U64 */
+#include <stdbool.h>


This is not part of standard C90, so we shouldn't use this #include in this library.
One could locally define boolean values,
or simply rely on the known convention that int 0 == false and any other value == true.

Indeed, thanks for highlighting it for me, fixed in 2849f3a.

Cyan4973 · 2025-08-19T22:04:27Z

Thanks for the submission @arpadpanyik-arm ,
the code is well located, in a single #ifdef section, making it easy to maintain.

It's also validated via CI, using this test (that you added in a previous PR) : https://github.com/facebook/zstd/blob/dev/.github/workflows/dev-short-tests.yml#L437-L438 .

So it's a pretty good start.

My only comment here is about the selection of SVE2 as opposed to NEON:
in the proposed code, if the compilers sees both SVE2 and NEON enabled for the target architecture,
it always selects SVE2.
This strategy implies that SVE2 is expected to be always better, or at least never worse, than NEON.

But is that the case ?

For reference, it has been my experience (through libxxhash) that performance comparison between SVE vs NEON code is not a straightforward conclusion.
SVE will work faster on cpus designed with wide vector units in mind,
but a significant portion of aarch64 cpus are consumer-grade, and even a decent number of server ones,
support NEON and also advertise support for SVE.
But in fact, it's just a shim layer on top of the same 128-bit NEON vector unit,
and as a consequence, SVE code actually runs slower than NEON one.

I was wondering if this experience extends to SVE2.
Is there any reason to believe that SVE2 is never worse than NEON ?
Or does it depend on the specific hardware ?

Add an 8-way vector length agnostic (VLA) SVE2 code path for convertSequences_noRepcodes. It works with any SVE vector length. Relative performance to GCC-13 using: `./fullbench -b18 -l5 enwik5` Neon SVE2 Neoverse-V2 before after uplift GCC-13: 100.000% 103.209% 1.032x GCC-14: 100.309% 134.872% 1.344x GCC-15: 100.355% 134.827% 1.343x Clang-18: 123.614% 128.565% 1.040x Clang-19: 123.587% 132.984% 1.076x Clang-20: 123.629% 133.023% 1.075x Neon SVE2 Cortex-A720 before after uplift GCC-13: 100.000% 116.032% 1.160x GCC-14: 99.700% 116.648% 1.169x GCC-15: 100.354% 117.047% 1.166x Clang-18: 100.447% 116.762% 1.162x Clang-19: 100.454% 116.627% 1.160x Clang-20: 100.452% 116.649% 1.161x

arpadpanyik-arm · 2025-08-21T19:21:55Z

The proposed SVE implementation relies on the two-register TBL instruction and therefore requires SVE2. It also differs from the Neon version: SVE performs efficient vector comparisons and branches on the results, whereas the Neon implementation updates indices in a fully vectorized manner. With a well-predicted branch, the SVE2 variant achieves the same effect with fewer instructions and scales naturally to wider vector lengths.

Some benchmarks on client CPUs normalized to their Neon code path compiled with GCC-15:

               GCC-15  Clang-20    GCC-15  Clang-20
                 Neon      Neon      SVE2      SVE2
Cortex-A510:  100.00%   104.12%   102.13%   122.77%
Cortex-A520:  100.00%   106.06%   115.31%   145.24%

Cortex-A710:  100.00%   128.36%   149.31%   145.17%
Cortex-A715:  100.00%   101.16%   114.07%   114.18%
Cortex-A720:  100.00%   100.16%   116.59%   116.44%

Cortex-X2:    100.00%   120.45%   122.22%   124.58%
Cortex-X3:    100.00%   123.91%   134.91%   133.32%
Cortex-X4:    100.00%   124.19%   140.32%   141.15%

GCC-15 emits suboptimal Neon code: it inserts extra vector moves to make TBL operands consecutive. This penalizes Cortex-A710, which lacks zero-latency vector register copies; newer Cortex-A cores mitigate this. Clang-20 produces markedly better Neon code in this case. By contrast, SVE2 code generation is strong in both compilers. We also applied aggressive unrolling, since on Cortex-X and Neoverse-V, SVE comparisons can issue on a single vector pipeline.

In a 128-bit configuration, SVE is expected to deliver performance similar to Neon. Where the algorithm can take advantage of SVE’s extended instruction set and capabilities, it can readily outperform Neon. I think SVE2 is at least as good as Neon in this case.

You can learn more about the aforementioned CPUs and their features in the Software Optimisation Guides here: https://developer.arm.com/documentation#f-navigationhierarchiescontenttype=Software%20Optimization%20Guide&numberOfResults=48

meta-cla bot added the CLA Signed label Jul 17, 2025

Cyan4973 self-assigned this Jul 17, 2025

Cyan4973 reviewed Aug 19, 2025

View reviewed changes

arpadpanyik-arm force-pushed the convert_seq_sve2 branch from 36ccef5 to 2849f3a Compare August 21, 2025 17:38

Cyan4973 approved these changes Aug 21, 2025

View reviewed changes

Cyan4973 merged commit b5c294e into facebook:dev Aug 22, 2025
104 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AArch64: Add SVE2 path for convertSequences_noRepcodes #4440

AArch64: Add SVE2 path for convertSequences_noRepcodes #4440

Uh oh!

arpadpanyik-arm commented Jul 17, 2025

Uh oh!

Cyan4973 Aug 19, 2025

Uh oh!

arpadpanyik-arm Aug 21, 2025

Uh oh!

Cyan4973 commented Aug 19, 2025 •

edited

Loading

Uh oh!

arpadpanyik-arm commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AArch64: Add SVE2 path for convertSequences_noRepcodes #4440

AArch64: Add SVE2 path for convertSequences_noRepcodes #4440

Uh oh!

Conversation

arpadpanyik-arm commented Jul 17, 2025

Uh oh!

Cyan4973 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

arpadpanyik-arm Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Cyan4973 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arpadpanyik-arm commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cyan4973 commented Aug 19, 2025 •

edited

Loading