Skip to content

Conversation

@arpadpanyik-arm
Copy link
Contributor

@arpadpanyik-arm arpadpanyik-arm commented Jul 9, 2025

This PR improves the generic implementation of ZSTD_get1BlockSummary and adds a Neon implementation of convertSequences_noRepcodes to speed up ZSTD_compressSequencesAndLiterals. Details are in the commit messages. Unit tests included.

Relative performance of ZSTD_get1BlockSummary with ./fullbench -b19 -l5 enwik5 using GCC-13 as baseline:

    Neoverse-V2   before     after    uplift
    GCC-13:      100.000%  290.527%   2.905x
    GCC-14:      100.000%  291.714%   2.917x
    GCC-15:       99.914%  291.495%   2.917x
    Clang-18:    148.072%  264.524%   1.786x
    Clang-19:    148.075%  264.512%   1.786x
    Clang-20:    148.062%  264.490%   1.786x

    Cortex-A720   before     after    uplift
    GCC-13:      100.000%  235.261%   2.352x
    GCC-14:      101.064%  234.903%   2.324x
    GCC-15:      112.977%  218.547%   1.934x
    Clang-18:    127.135%  180.359%   1.418x
    Clang-19:    127.149%  180.297%   1.417x
    Clang-20:    127.154%  180.260%   1.417x

Relative performance of convertSequences_noRepcodes with ./fullbench -b18 -l5 enwik5 using Clang-18 as baseline:

    Neoverse-V2   before     after    uplift
    Clang-18:    100.000%  311.703%   3.117x
    Clang-19:    100.191%  311.714%   3.111x
    Clang-20:    100.181%  311.723%   3.111x
    GCC-13:      107.520%  252.309%   2.346x
    GCC-14:      107.652%  253.158%   2.351x
    GCC-15:      107.674%  253.168%   2.351x

    Cortex-A720   before     after    uplift
    Clang-18:    100.000%  204.512%   2.045x
    Clang-19:    102.825%  204.600%   1.989x
    Clang-20:    102.807%  204.558%   1.989x
    GCC-13:      110.668%  203.594%   1.839x
    GCC-14:      110.684%  203.978%   1.842x
    GCC-15:      102.864%  204.299%   1.986x

Relative performance of ZSTD_compressSequencesAndLiterals with ./fullbench -b17 -l5 enwik5 using GCC-13 as baseline:

    Neoverse-V2   before     after    uplift
    GCC-13:      100.000%  108.730%   1.087x
    GCC-14:      100.428%  109.235%   1.087x
    GCC-15:       98.730%  106.345%   1.077x
    Clang-18:    107.889%  117.064%   1.085x
    Clang-19:    108.088%  117.354%   1.085x
    Clang-20:    108.027%  117.614%   1.088x

    Cortex-A720   before     after    uplift
    GCC-13:      100.000%  108.104%   1.081x
    GCC-14:       99.860%  108.164%   1.083x
    GCC-15:       98.028%  105.675%   1.078x
    Clang-18:    104.380%  113.002%   1.082x
    Clang-19:    105.117%  113.998%   1.084x
    Clang-20:    105.276%  114.038%   1.083x

An additional commit enables optimized SVE2 builds for QEMU, reducing its runtime from ~27 minutes to ~14 minutes.

Co-authored by, Thomas Daubney Thomas.Daubney@arm.com

@arpadpanyik-arm arpadpanyik-arm force-pushed the convertSequences_Neon branch 2 times, most recently from 144c3c9 to 1841830 Compare July 9, 2025 17:48
@Cyan4973 Cyan4973 self-assigned this Jul 10, 2025
@Cyan4973
Copy link
Contributor

Cyan4973 commented Jul 10, 2025

This looks good to me.
Performance impact is well documented.
There is a relevant test added to the existing framework.
The modified code is rather well concentrated (notably the neon specific variant).
I think the optimization of ZSTD_get1BlockSummary() is pretty clever, on top of being generic which is very welcome,
but also it's not so obvious to the reader, and a short explanation would been welcome, for future reviewers.

Add a faster scalar implementation of ZSTD_get1BlockSummary which
removes the data dependency of the accumulators in the hot loop to
leverage the superscalar potential of recent out-of-order CPUs.
The new algorithm leverages SWAR (SIMD Within A Register) methodology
to exploit the capabilities of 64-bit architectures. It achieves this
by packing two 32-bit data elements into a single 64-bit register,
enabling parallel operations on these subcomponents while ensuring
that the 32-bit boundaries prevent overflow, thereby optimizing
computational efficiency.

Corresponding unit tests are included.

Relative performance to GCC-13 using: `./fullbench -b19 -l5 enwik5`

Neoverse-V2   before     after
GCC-13:      100.000%  290.527%
GCC-14:      100.000%  291.714%
GCC-15:       99.914%  291.495%
Clang-18:    148.072%  264.524%
Clang-19:    148.075%  264.512%
Clang-20:    148.062%  264.490%

Cortex-A720   before     after
GCC-13:      100.000%  235.261%
GCC-14:      101.064%  234.903%
GCC-15:      112.977%  218.547%
Clang-18:    127.135%  180.359%
Clang-19:    127.149%  180.297%
Clang-20:    127.154%  180.260%

Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
Add a 4-way Neon implementation for the convertSequences_noRepcodes
function. Remove 'static' keywords from all of its implementations to
be able to add unit tests.

Relative performance to Clang-18 using: `./fullbench -b18 -l5 enwik5`

Neoverse-V2   before     after
Clang-18:    100.000%  311.703%
Clang-19:    100.191%  311.714%
Clang-20:    100.181%  311.723%
GCC-13:      107.520%  252.309%
GCC-14:      107.652%  253.158%
GCC-15:      107.674%  253.168%

Cortex-A720   before     after
Clang-18:    100.000%  204.512%
Clang-19:    102.825%  204.600%
Clang-20:    102.807%  204.558%
GCC-13:      110.668%  203.594%
GCC-14:      110.684%  203.978%
GCC-15:      102.864%  204.299%

Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
Add missing `-O3` flag to the compilation of AArch64 SVE2 builds
executed by QEMU. This can decrease the CI job runtime considerably.
@arpadpanyik-arm arpadpanyik-arm force-pushed the convertSequences_Neon branch from 1841830 to 703f855 Compare July 10, 2025 18:21
@arpadpanyik-arm
Copy link
Contributor Author

arpadpanyik-arm commented Jul 10, 2025

Thanks for the review! You are right, the internals of ZSTD_get1BlockSummary function is not obvious at first sight, so added some relevant clarifications in 8e44004.

@Cyan4973 Cyan4973 merged commit afa96bb into facebook:dev Jul 14, 2025
102 of 104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants