Skip to content

Conversation

@arpadpanyik-arm
Copy link
Contributor

Add an 8-way vector length agnostic (VLA) SVE2 code path for convertSequences_noRepcodes. It works with any SVE vector length.

Relative performance to GCC-13 using: ./fullbench -b18 -l5 enwik5

               Neon      SVE2
Neoverse-V2   before     after    uplift
GCC-13:      100.000%  103.209%   1.032x
GCC-14:      100.309%  134.872%   1.344x
GCC-15:      100.355%  134.827%   1.343x
Clang-18:    123.614%  128.565%   1.040x
Clang-19:    123.587%  132.984%   1.076x
Clang-20:    123.629%  133.023%   1.075x
               Neon      SVE2
Cortex-A720   before     after    uplift
GCC-13:      100.000%  116.032%   1.160x
GCC-14:       99.700%  116.648%   1.169x
GCC-15:      100.354%  117.047%   1.166x
Clang-18:    100.447%  116.762%   1.162x
Clang-19:    100.454%  116.627%   1.160x
Clang-20:    100.452%  116.649%   1.161x

@meta-cla meta-cla bot added the CLA Signed label Jul 17, 2025
@Cyan4973 Cyan4973 self-assigned this Jul 17, 2025
#include "zstd_ldm.h"
#include "zstd_compress_superblock.h"
#include "../common/bits.h" /* ZSTD_highbit32, ZSTD_rotateRight_U64 */
#include <stdbool.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of standard C90, so we shouldn't use this #include in this library.
One could locally define boolean values,
or simply rely on the known convention that int 0 == false and any other value == true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks for highlighting it for me, fixed in 2849f3a.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Aug 19, 2025

Thanks for the submission @arpadpanyik-arm ,
the code is well located, in a single #ifdef section, making it easy to maintain.

It's also validated via CI, using this test (that you added in a previous PR) : https://github.com/facebook/zstd/blob/dev/.github/workflows/dev-short-tests.yml#L437-L438 .

So it's a pretty good start.

My only comment here is about the selection of SVE2 as opposed to NEON:
in the proposed code, if the compilers sees both SVE2 and NEON enabled for the target architecture,
it always selects SVE2.
This strategy implies that SVE2 is expected to be always better, or at least never worse, than NEON.

But is that the case ?

For reference, it has been my experience (through libxxhash) that performance comparison between SVE vs NEON code is not a straightforward conclusion.
SVE will work faster on cpus designed with wide vector units in mind,
but a significant portion of aarch64 cpus are consumer-grade, and even a decent number of server ones,
support NEON and also advertise support for SVE.
But in fact, it's just a shim layer on top of the same 128-bit NEON vector unit,
and as a consequence, SVE code actually runs slower than NEON one.

I was wondering if this experience extends to SVE2.
Is there any reason to believe that SVE2 is never worse than NEON ?
Or does it depend on the specific hardware ?

Add an 8-way vector length agnostic (VLA) SVE2 code path for
convertSequences_noRepcodes. It works with any SVE vector length.

Relative performance to GCC-13 using: `./fullbench -b18 -l5 enwik5`

               Neon      SVE2
Neoverse-V2   before     after    uplift
GCC-13:      100.000%  103.209%   1.032x
GCC-14:      100.309%  134.872%   1.344x
GCC-15:      100.355%  134.827%   1.343x
Clang-18:    123.614%  128.565%   1.040x
Clang-19:    123.587%  132.984%   1.076x
Clang-20:    123.629%  133.023%   1.075x

               Neon      SVE2
Cortex-A720   before     after    uplift
GCC-13:      100.000%  116.032%   1.160x
GCC-14:       99.700%  116.648%   1.169x
GCC-15:      100.354%  117.047%   1.166x
Clang-18:    100.447%  116.762%   1.162x
Clang-19:    100.454%  116.627%   1.160x
Clang-20:    100.452%  116.649%   1.161x
@arpadpanyik-arm
Copy link
Contributor Author

The proposed SVE implementation relies on the two-register TBL instruction and therefore requires SVE2. It also differs from the Neon version: SVE performs efficient vector comparisons and branches on the results, whereas the Neon implementation updates indices in a fully vectorized manner. With a well-predicted branch, the SVE2 variant achieves the same effect with fewer instructions and scales naturally to wider vector lengths.

Some benchmarks on client CPUs normalized to their Neon code path compiled with GCC-15:

               GCC-15  Clang-20    GCC-15  Clang-20
                 Neon      Neon      SVE2      SVE2
Cortex-A510:  100.00%   104.12%   102.13%   122.77%
Cortex-A520:  100.00%   106.06%   115.31%   145.24%

Cortex-A710:  100.00%   128.36%   149.31%   145.17%
Cortex-A715:  100.00%   101.16%   114.07%   114.18%
Cortex-A720:  100.00%   100.16%   116.59%   116.44%

Cortex-X2:    100.00%   120.45%   122.22%   124.58%
Cortex-X3:    100.00%   123.91%   134.91%   133.32%
Cortex-X4:    100.00%   124.19%   140.32%   141.15%

GCC-15 emits suboptimal Neon code: it inserts extra vector moves to make TBL operands consecutive. This penalizes Cortex-A710, which lacks zero-latency vector register copies; newer Cortex-A cores mitigate this. Clang-20 produces markedly better Neon code in this case. By contrast, SVE2 code generation is strong in both compilers. We also applied aggressive unrolling, since on Cortex-X and Neoverse-V, SVE comparisons can issue on a single vector pipeline.

In a 128-bit configuration, SVE is expected to deliver performance similar to Neon. Where the algorithm can take advantage of SVE’s extended instruction set and capabilities, it can readily outperform Neon. I think SVE2 is at least as good as Neon in this case.

You can learn more about the aforementioned CPUs and their features in the Software Optimisation Guides here: https://developer.arm.com/documentation#f-navigationhierarchiescontenttype=Software%20Optimization%20Guide&numberOfResults=48

@Cyan4973 Cyan4973 merged commit b5c294e into facebook:dev Aug 22, 2025
104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants