Skip to content

Conversation

@brian-pane
Copy link

Inspired by @nmoinvaz suggestion in issue #18, I went through the functions that currently aren't inlined and tried adding a compiler hint to inline them. Some didn't help or even caused regressions when inlined, but this one seems to improve performance at compression level 1:

  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.6ms ± 3.79ms    71.6ms …  101ms          4 ( 6%)        0%
  peak_rss           26.7MB ± 76.8KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          281M  ± 14.9M      278M  …  402M           2 ( 3%)        0%
  instructions        568M  ±  265       568M  …  568M           0 ( 0%)        0%
  cache_references    265K  ± 2.95K      263K  …  285K           6 ( 9%)        0%
  cache_misses        232K  ± 9.17K      201K  …  245K          14 (21%)        0%
  branch_misses      2.94M  ± 6.20K     2.90M  … 2.95M           5 ( 7%)        0%
Benchmark 2 (70 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          71.5ms ±  710us    70.5ms … 74.1ms          2 ( 3%)        ⚡-  2.8% ±  1.2%
  peak_rss           26.7MB ± 80.1KB    26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          274M  ±  500K      273M  …  275M           0 ( 0%)        ⚡-  2.6% ±  1.2%
  instructions        565M  ±  277       565M  …  565M           1 ( 1%)          -  0.6% ±  0.0%
  cache_references    266K  ± 4.67K      263K  …  300K           2 ( 3%)          +  0.5% ±  0.5%
  cache_misses        231K  ± 9.01K      203K  …  245K          12 (17%)          -  0.0% ±  1.3%
  branch_misses      3.03M  ± 8.03K     3.01M  … 3.05M           0 ( 0%)        💩+  3.1% ±  0.1%

while not hurting performance at higher compression levels:

Benchmark 1 (12 runs): ./blogpost-compress-baseline 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           447ms ± 1.47ms     446ms …  451ms          1 ( 8%)        0%
  peak_rss           24.5MB ± 64.5KB    24.4MB … 24.5MB          0 ( 0%)        0%
  cpu_cycles         1.88G  ± 2.25M     1.88G  … 1.88G           5 (42%)        0%
  instructions       3.18G  ±  238      3.18G  … 3.18G           0 ( 0%)        0%
  cache_references    274K  ± 4.73K      269K  …  283K           0 ( 0%)        0%
  cache_misses        240K  ± 3.72K      229K  …  244K           2 (17%)        0%
  branch_misses      19.4M  ± 73.3K     19.3M  … 19.5M           0 ( 0%)        0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           447ms ±  672us     445ms …  448ms          1 ( 8%)          -  0.0% ±  0.2%
  peak_rss           24.4MB ± 68.5KB    24.4MB … 24.5MB          0 ( 0%)          -  0.1% ±  0.2%
  cpu_cycles         1.88G  ± 1.63M     1.88G  … 1.88G           0 ( 0%)          +  0.1% ±  0.1%
  instructions       3.19G  ±  358      3.19G  … 3.19G           0 ( 0%)          +  0.2% ±  0.0%
  cache_references    274K  ± 4.74K      269K  …  285K           0 ( 0%)          +  0.1% ±  1.5%
  cache_misses        240K  ± 3.28K      231K  …  244K           1 ( 8%)          +  0.2% ±  1.2%
  branch_misses      19.3M  ± 37.7K     19.3M  … 19.4M           0 ( 0%)          -  0.4% ±  0.3%

@nmoinvaz
Copy link

compare256 functions are not inlined to longest_match?

@brian-pane
Copy link
Author

compare256 functions are not inlined to longest_match?

They are in zlib-rs, although in zlib-ng they seem to be called through a function pointer (I guess to allow run-time selection of different SSE/AVX implementations from a single x86 build)

@brian-pane
Copy link
Author

zlib-rs also selects the AVX2 version of compare256 at runtime, using a conditional branch. It looks like that ends up being faster than the function-pointer approach because it enables the compiler to inline the hardware-specific implementation.
Screenshot from 2024-12-22 13-14-16

@folkertdev
Copy link
Member

What happens exactly depends on the target features that are enabled at compile time: on most x86_64 CPUs-Ctarget-cpu=native will enable avx2 and will statically pick the right implementation. My reading of the zlib-ng code and output assembly is that it does the same thing. When the feature is not enabled statically, then we check for it at runtime. This check is cached in an atomic, so usually has low cost, although in isolated tests it can be beaten by using a function pointer approach. In our measurements the difference was so small though that it did not matter for real-world input.

Combined with that, by default the avx2 intrinsics are not inlined, unless the surrounding function is marked as having #[target_feature(enable = "avx2")], I believe we've made sure that that is the case, but of course we may have missed a spot (a clippy lint for verifying this automatically is under review).


@brian-pane I'm seeing an improvement for level 1, but a regression for compression level 2

Benchmark 1 (40 runs): target/release/examples/compress-baseline 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           127ms ±  962us     125ms …  129ms          0 ( 0%)        0%
  peak_rss           25.0MB ± 66.3KB    24.9MB … 25.0MB          0 ( 0%)        0%
  cpu_cycles          518M  ± 3.89M      510M  …  528M           0 ( 0%)        0%
  instructions       1.09G  ±  285      1.09G  … 1.09G           1 ( 3%)        0%
  cache_references   34.2M  ±  370K     33.6M  … 35.5M           3 ( 8%)        0%
  cache_misses       1.01M  ±  212K      735K  … 1.69M           2 ( 5%)        0%
  branch_misses      6.94M  ± 3.69K     6.93M  … 6.95M           3 ( 8%)        0%
Benchmark 2 (39 runs): target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           130ms ± 1.06ms     127ms …  133ms          1 ( 3%)        💩+  2.1% ±  0.4%
  peak_rss           25.0MB ± 63.7KB    24.9MB … 25.0MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          529M  ± 4.49M      519M  …  541M           4 (10%)        💩+  2.1% ±  0.4%
  instructions       1.09G  ±  327      1.09G  … 1.09G           1 ( 3%)          +  0.0% ±  0.0%
  cache_references   34.4M  ±  359K     34.0M  … 35.4M           2 ( 5%)          +  0.6% ±  0.5%
  cache_misses       1.16M  ±  231K      885K  … 1.88M           4 (10%)        💩+ 14.3% ±  9.8%
  branch_misses      6.93M  ± 3.13K     6.93M  … 6.94M           1 ( 3%)          -  0.1% ±  0.0%

Level 3 and onward appear fine. So this might need some further tweaking (maybe moving some cold code out of the function you now inline?)

@brian-pane
Copy link
Author

I moved what I could out of the inlined part in bb2c510. It might not be enough, though.

@nmoinvaz
Copy link

Zlib-ng inlines compare256 by having associated versions of longest_match, then longest_match which is called more infrequently is called by a function pointer. Only direct function pointer to compare256 is called in deflate_quick.

@folkertdev
Copy link
Member

I've been playing around with this, and there is definitely something here, but I believe that the only case where this matters for quick at least is when we run on a cpu with avx2, but compile for a generic x86_64 target. E.g. for aarch64, neon is already enabled by default (so there is no runtime branch), and for simd128 on wasm, the flag must be enabled at compile time, runtime detection is not available on that platform. Furthermore compare256 does not benefit from avx512, so that is not relevant either.

I've done some experiments with adding #[target_feature(enable = "avx2")] to some functions, and it helps a lot. With -target-cpu=native we do now roughly match the performance of zlib-ng, so we lose the performance in the dispatching and just worse codegen in certain functions when avx2 is not statically enabled.

But, there is not a very ergonomic way to specialize for a range of target features now (though there is a proposed project goal that hopefully brings a solution closer). I did experiment with using a function pointer dispatch method, and that does help a bit #273, but as mentioned using avx2 in more places is advantageous. We'll need to carefully weigh performance and maintainability here.

Copy link
Member

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the recent changes (i guess to the state), just the inlining is an improvement now

Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.7ms ±  813us    77.6ms … 82.5ms          4 ( 6%)        ⚡-  3.1% ±  0.4%
  peak_rss           26.7MB ± 93.3KB    26.6MB … 26.9MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          286M  ± 2.82M      283M  …  299M           6 ( 9%)        ⚡-  4.3% ±  0.4%
  instructions        591M  ±  268       591M  …  591M           0 ( 0%)        ⚡-  1.7% ±  0.0%
  cache_references   19.9M  ±  151K     19.7M  … 20.5M           3 ( 5%)          -  0.4% ±  0.3%
  cache_misses        405K  ± 73.2K      306K  …  650K           3 ( 5%)          -  0.7% ±  8.9%
  branch_misses      2.97M  ± 5.83K     2.96M  … 3.00M           6 ( 9%)          -  0.4% ±  0.1%

for the other levels there are some small improvements to instructions, but in any case no regressions.

@folkertdev folkertdev merged commit be3740f into trifectatechfoundation:main Jan 9, 2025
20 checks passed
@brian-pane brian-pane deleted the inline-fill-window branch April 1, 2025 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants