Improve speed of ZSTD_compressSequencesAndLiterals using Neon #4429

arpadpanyik-arm · 2025-07-09T14:15:12Z

This PR improves the generic implementation of ZSTD_get1BlockSummary and adds a Neon implementation of convertSequences_noRepcodes to speed up ZSTD_compressSequencesAndLiterals. Details are in the commit messages. Unit tests included.

Relative performance of ZSTD_get1BlockSummary with ./fullbench -b19 -l5 enwik5 using GCC-13 as baseline:

    Neoverse-V2   before     after    uplift
    GCC-13:      100.000%  290.527%   2.905x
    GCC-14:      100.000%  291.714%   2.917x
    GCC-15:       99.914%  291.495%   2.917x
    Clang-18:    148.072%  264.524%   1.786x
    Clang-19:    148.075%  264.512%   1.786x
    Clang-20:    148.062%  264.490%   1.786x

    Cortex-A720   before     after    uplift
    GCC-13:      100.000%  235.261%   2.352x
    GCC-14:      101.064%  234.903%   2.324x
    GCC-15:      112.977%  218.547%   1.934x
    Clang-18:    127.135%  180.359%   1.418x
    Clang-19:    127.149%  180.297%   1.417x
    Clang-20:    127.154%  180.260%   1.417x

Relative performance of convertSequences_noRepcodes with ./fullbench -b18 -l5 enwik5 using Clang-18 as baseline:

    Neoverse-V2   before     after    uplift
    Clang-18:    100.000%  311.703%   3.117x
    Clang-19:    100.191%  311.714%   3.111x
    Clang-20:    100.181%  311.723%   3.111x
    GCC-13:      107.520%  252.309%   2.346x
    GCC-14:      107.652%  253.158%   2.351x
    GCC-15:      107.674%  253.168%   2.351x

    Cortex-A720   before     after    uplift
    Clang-18:    100.000%  204.512%   2.045x
    Clang-19:    102.825%  204.600%   1.989x
    Clang-20:    102.807%  204.558%   1.989x
    GCC-13:      110.668%  203.594%   1.839x
    GCC-14:      110.684%  203.978%   1.842x
    GCC-15:      102.864%  204.299%   1.986x

Relative performance of ZSTD_compressSequencesAndLiterals with ./fullbench -b17 -l5 enwik5 using GCC-13 as baseline:

    Neoverse-V2   before     after    uplift
    GCC-13:      100.000%  108.730%   1.087x
    GCC-14:      100.428%  109.235%   1.087x
    GCC-15:       98.730%  106.345%   1.077x
    Clang-18:    107.889%  117.064%   1.085x
    Clang-19:    108.088%  117.354%   1.085x
    Clang-20:    108.027%  117.614%   1.088x

    Cortex-A720   before     after    uplift
    GCC-13:      100.000%  108.104%   1.081x
    GCC-14:       99.860%  108.164%   1.083x
    GCC-15:       98.028%  105.675%   1.078x
    Clang-18:    104.380%  113.002%   1.082x
    Clang-19:    105.117%  113.998%   1.084x
    Clang-20:    105.276%  114.038%   1.083x

An additional commit enables optimized SVE2 builds for QEMU, reducing its runtime from ~27 minutes to ~14 minutes.

Co-authored by, Thomas Daubney Thomas.Daubney@arm.com

Cyan4973 · 2025-07-10T02:19:05Z

This looks good to me.
Performance impact is well documented.
There is a relevant test added to the existing framework.
The modified code is rather well concentrated (notably the neon specific variant).
I think the optimization of ZSTD_get1BlockSummary() is pretty clever, on top of being generic which is very welcome,
but also it's not so obvious to the reader, and a short explanation would been welcome, for future reviewers.

Add a faster scalar implementation of ZSTD_get1BlockSummary which removes the data dependency of the accumulators in the hot loop to leverage the superscalar potential of recent out-of-order CPUs. The new algorithm leverages SWAR (SIMD Within A Register) methodology to exploit the capabilities of 64-bit architectures. It achieves this by packing two 32-bit data elements into a single 64-bit register, enabling parallel operations on these subcomponents while ensuring that the 32-bit boundaries prevent overflow, thereby optimizing computational efficiency. Corresponding unit tests are included. Relative performance to GCC-13 using: `./fullbench -b19 -l5 enwik5` Neoverse-V2 before after GCC-13: 100.000% 290.527% GCC-14: 100.000% 291.714% GCC-15: 99.914% 291.495% Clang-18: 148.072% 264.524% Clang-19: 148.075% 264.512% Clang-20: 148.062% 264.490% Cortex-A720 before after GCC-13: 100.000% 235.261% GCC-14: 101.064% 234.903% GCC-15: 112.977% 218.547% Clang-18: 127.135% 180.359% Clang-19: 127.149% 180.297% Clang-20: 127.154% 180.260% Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>

Add a 4-way Neon implementation for the convertSequences_noRepcodes function. Remove 'static' keywords from all of its implementations to be able to add unit tests. Relative performance to Clang-18 using: `./fullbench -b18 -l5 enwik5` Neoverse-V2 before after Clang-18: 100.000% 311.703% Clang-19: 100.191% 311.714% Clang-20: 100.181% 311.723% GCC-13: 107.520% 252.309% GCC-14: 107.652% 253.158% GCC-15: 107.674% 253.168% Cortex-A720 before after Clang-18: 100.000% 204.512% Clang-19: 102.825% 204.600% Clang-20: 102.807% 204.558% GCC-13: 110.668% 203.594% GCC-14: 110.684% 203.978% GCC-15: 102.864% 204.299% Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>

Add missing `-O3` flag to the compilation of AArch64 SVE2 builds executed by QEMU. This can decrease the CI job runtime considerably.

arpadpanyik-arm · 2025-07-10T18:26:28Z

Thanks for the review! You are right, the internals of ZSTD_get1BlockSummary function is not obvious at first sight, so added some relevant clarifications in 8e44004.

facebook-github-bot added the CLA Signed label Jul 9, 2025

arpadpanyik-arm force-pushed the convertSequences_Neon branch 2 times, most recently from 144c3c9 to 1841830 Compare July 9, 2025 17:48

Cyan4973 self-assigned this Jul 10, 2025

Cyan4973 approved these changes Jul 10, 2025

View reviewed changes

arpadpanyik-arm added 3 commits July 10, 2025 18:20

AArch64: Enable optimized QEMU CI builds

703f855

Add missing `-O3` flag to the compilation of AArch64 SVE2 builds executed by QEMU. This can decrease the CI job runtime considerably.

arpadpanyik-arm force-pushed the convertSequences_Neon branch from 1841830 to 703f855 Compare July 10, 2025 18:21

Cyan4973 approved these changes Jul 10, 2025

View reviewed changes

Cyan4973 merged commit afa96bb into facebook:dev Jul 14, 2025
102 of 104 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of ZSTD_compressSequencesAndLiterals using Neon #4429

Improve speed of ZSTD_compressSequencesAndLiterals using Neon #4429

Uh oh!

arpadpanyik-arm commented Jul 9, 2025 •

edited

Loading

Uh oh!

Cyan4973 commented Jul 10, 2025 •

edited

Loading

Uh oh!

arpadpanyik-arm commented Jul 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve speed of ZSTD_compressSequencesAndLiterals using Neon #4429

Improve speed of ZSTD_compressSequencesAndLiterals using Neon #4429

Uh oh!

Conversation

arpadpanyik-arm commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyan4973 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arpadpanyik-arm commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arpadpanyik-arm commented Jul 9, 2025 •

edited

Loading

Cyan4973 commented Jul 10, 2025 •

edited

Loading

arpadpanyik-arm commented Jul 10, 2025 •

edited

Loading