Hint to the compiler to inline the fill_window function #270

brian-pane · 2024-12-22T20:54:10Z

Inspired by @nmoinvaz suggestion in issue #18, I went through the functions that currently aren't inlined and tried adding a compiler hint to inline them. Some didn't help or even caused regressions when inlined, but this one seems to improve performance at compression level 1:

  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.6ms ± 3.79ms    71.6ms …  101ms          4 ( 6%)        0%
  peak_rss           26.7MB ± 76.8KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          281M  ± 14.9M      278M  …  402M           2 ( 3%)        0%
  instructions        568M  ±  265       568M  …  568M           0 ( 0%)        0%
  cache_references    265K  ± 2.95K      263K  …  285K           6 ( 9%)        0%
  cache_misses        232K  ± 9.17K      201K  …  245K          14 (21%)        0%
  branch_misses      2.94M  ± 6.20K     2.90M  … 2.95M           5 ( 7%)        0%
Benchmark 2 (70 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          71.5ms ±  710us    70.5ms … 74.1ms          2 ( 3%)        ⚡-  2.8% ±  1.2%
  peak_rss           26.7MB ± 80.1KB    26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          274M  ±  500K      273M  …  275M           0 ( 0%)        ⚡-  2.6% ±  1.2%
  instructions        565M  ±  277       565M  …  565M           1 ( 1%)          -  0.6% ±  0.0%
  cache_references    266K  ± 4.67K      263K  …  300K           2 ( 3%)          +  0.5% ±  0.5%
  cache_misses        231K  ± 9.01K      203K  …  245K          12 (17%)          -  0.0% ±  1.3%
  branch_misses      3.03M  ± 8.03K     3.01M  … 3.05M           0 ( 0%)        💩+  3.1% ±  0.1%

while not hurting performance at higher compression levels:

Benchmark 1 (12 runs): ./blogpost-compress-baseline 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           447ms ± 1.47ms     446ms …  451ms          1 ( 8%)        0%
  peak_rss           24.5MB ± 64.5KB    24.4MB … 24.5MB          0 ( 0%)        0%
  cpu_cycles         1.88G  ± 2.25M     1.88G  … 1.88G           5 (42%)        0%
  instructions       3.18G  ±  238      3.18G  … 3.18G           0 ( 0%)        0%
  cache_references    274K  ± 4.73K      269K  …  283K           0 ( 0%)        0%
  cache_misses        240K  ± 3.72K      229K  …  244K           2 (17%)        0%
  branch_misses      19.4M  ± 73.3K     19.3M  … 19.5M           0 ( 0%)        0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           447ms ±  672us     445ms …  448ms          1 ( 8%)          -  0.0% ±  0.2%
  peak_rss           24.4MB ± 68.5KB    24.4MB … 24.5MB          0 ( 0%)          -  0.1% ±  0.2%
  cpu_cycles         1.88G  ± 1.63M     1.88G  … 1.88G           0 ( 0%)          +  0.1% ±  0.1%
  instructions       3.19G  ±  358      3.19G  … 3.19G           0 ( 0%)          +  0.2% ±  0.0%
  cache_references    274K  ± 4.74K      269K  …  285K           0 ( 0%)          +  0.1% ±  1.5%
  cache_misses        240K  ± 3.28K      231K  …  244K           1 ( 8%)          +  0.2% ±  1.2%
  branch_misses      19.3M  ± 37.7K     19.3M  … 19.4M           0 ( 0%)          -  0.4% ±  0.3%

nmoinvaz · 2024-12-22T21:05:00Z

compare256 functions are not inlined to longest_match?

brian-pane · 2024-12-22T21:13:08Z

compare256 functions are not inlined to longest_match?

They are in zlib-rs, although in zlib-ng they seem to be called through a function pointer (I guess to allow run-time selection of different SSE/AVX implementations from a single x86 build)

brian-pane · 2024-12-22T21:22:56Z

zlib-rs also selects the AVX2 version of compare256 at runtime, using a conditional branch. It looks like that ends up being faster than the function-pointer approach because it enables the compiler to inline the hardware-specific implementation.

folkertdev · 2024-12-22T22:27:33Z

What happens exactly depends on the target features that are enabled at compile time: on most x86_64 CPUs-Ctarget-cpu=native will enable avx2 and will statically pick the right implementation. My reading of the zlib-ng code and output assembly is that it does the same thing. When the feature is not enabled statically, then we check for it at runtime. This check is cached in an atomic, so usually has low cost, although in isolated tests it can be beaten by using a function pointer approach. In our measurements the difference was so small though that it did not matter for real-world input.

Combined with that, by default the avx2 intrinsics are not inlined, unless the surrounding function is marked as having #[target_feature(enable = "avx2")], I believe we've made sure that that is the case, but of course we may have missed a spot (a clippy lint for verifying this automatically is under review).

@brian-pane I'm seeing an improvement for level 1, but a regression for compression level 2

Benchmark 1 (40 runs): target/release/examples/compress-baseline 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           127ms ±  962us     125ms …  129ms          0 ( 0%)        0%
  peak_rss           25.0MB ± 66.3KB    24.9MB … 25.0MB          0 ( 0%)        0%
  cpu_cycles          518M  ± 3.89M      510M  …  528M           0 ( 0%)        0%
  instructions       1.09G  ±  285      1.09G  … 1.09G           1 ( 3%)        0%
  cache_references   34.2M  ±  370K     33.6M  … 35.5M           3 ( 8%)        0%
  cache_misses       1.01M  ±  212K      735K  … 1.69M           2 ( 5%)        0%
  branch_misses      6.94M  ± 3.69K     6.93M  … 6.95M           3 ( 8%)        0%
Benchmark 2 (39 runs): target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           130ms ± 1.06ms     127ms …  133ms          1 ( 3%)        💩+  2.1% ±  0.4%
  peak_rss           25.0MB ± 63.7KB    24.9MB … 25.0MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          529M  ± 4.49M      519M  …  541M           4 (10%)        💩+  2.1% ±  0.4%
  instructions       1.09G  ±  327      1.09G  … 1.09G           1 ( 3%)          +  0.0% ±  0.0%
  cache_references   34.4M  ±  359K     34.0M  … 35.4M           2 ( 5%)          +  0.6% ±  0.5%
  cache_misses       1.16M  ±  231K      885K  … 1.88M           4 (10%)        💩+ 14.3% ±  9.8%
  branch_misses      6.93M  ± 3.13K     6.93M  … 6.94M           1 ( 3%)          -  0.1% ±  0.0%

Level 3 and onward appear fine. So this might need some further tweaking (maybe moving some cold code out of the function you now inline?)

brian-pane · 2024-12-22T23:26:26Z

I moved what I could out of the inlined part in bb2c510. It might not be enough, though.

nmoinvaz · 2024-12-23T00:19:23Z

Zlib-ng inlines compare256 by having associated versions of longest_match, then longest_match which is called more infrequently is called by a function pointer. Only direct function pointer to compare256 is called in deflate_quick.

folkertdev · 2024-12-24T20:59:17Z

I've been playing around with this, and there is definitely something here, but I believe that the only case where this matters for quick at least is when we run on a cpu with avx2, but compile for a generic x86_64 target. E.g. for aarch64, neon is already enabled by default (so there is no runtime branch), and for simd128 on wasm, the flag must be enabled at compile time, runtime detection is not available on that platform. Furthermore compare256 does not benefit from avx512, so that is not relevant either.

I've done some experiments with adding #[target_feature(enable = "avx2")] to some functions, and it helps a lot. With -target-cpu=native we do now roughly match the performance of zlib-ng, so we lose the performance in the dispatching and just worse codegen in certain functions when avx2 is not statically enabled.

But, there is not a very ergonomic way to specialize for a range of target features now (though there is a proposed project goal that hopefully brings a solution closer). I did experiment with using a function pointer dispatch method, and that does help a bit #273, but as mentioned using avx2 in more places is advantageous. We'll need to carefully weigh performance and maintainability here.

folkertdev

after the recent changes (i guess to the state), just the inlining is an improvement now

Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.7ms ±  813us    77.6ms … 82.5ms          4 ( 6%)        ⚡-  3.1% ±  0.4%
  peak_rss           26.7MB ± 93.3KB    26.6MB … 26.9MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          286M  ± 2.82M      283M  …  299M           6 ( 9%)        ⚡-  4.3% ±  0.4%
  instructions        591M  ±  268       591M  …  591M           0 ( 0%)        ⚡-  1.7% ±  0.0%
  cache_references   19.9M  ±  151K     19.7M  … 20.5M           3 ( 5%)          -  0.4% ±  0.3%
  cache_misses        405K  ± 73.2K      306K  …  650K           3 ( 5%)          -  0.7% ±  8.9%
  branch_misses      2.97M  ± 5.83K     2.96M  … 3.00M           6 ( 9%)          -  0.4% ±  0.1%

for the other levels there are some small improvements to instructions, but in any case no regressions.

Hint to the compiler to inline the fill_window function

cc6f005

folkertdev force-pushed the inline-fill-window branch from bb2c510 to cc6f005 Compare January 9, 2025 14:05

folkertdev approved these changes Jan 9, 2025

View reviewed changes

folkertdev merged commit be3740f into trifectatechfoundation:main Jan 9, 2025
20 checks passed

brian-pane deleted the inline-fill-window branch April 1, 2025 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Hint to the compiler to inline the fill_window function #270

Hint to the compiler to inline the fill_window function #270

Uh oh!

brian-pane commented Dec 22, 2024

Uh oh!

nmoinvaz commented Dec 22, 2024

Uh oh!

brian-pane commented Dec 22, 2024

Uh oh!

brian-pane commented Dec 22, 2024

Uh oh!

folkertdev commented Dec 22, 2024

Uh oh!

brian-pane commented Dec 22, 2024

Uh oh!

nmoinvaz commented Dec 23, 2024

Uh oh!

folkertdev commented Dec 24, 2024

Uh oh!

folkertdev left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Hint to the compiler to inline the fill_window function #270

Hint to the compiler to inline the fill_window function #270

Uh oh!

Conversation

brian-pane commented Dec 22, 2024

Uh oh!

nmoinvaz commented Dec 22, 2024

Uh oh!

brian-pane commented Dec 22, 2024

Uh oh!

brian-pane commented Dec 22, 2024

Uh oh!

folkertdev commented Dec 22, 2024

Uh oh!

brian-pane commented Dec 22, 2024

Uh oh!

nmoinvaz commented Dec 23, 2024

Uh oh!

folkertdev commented Dec 24, 2024

Uh oh!

folkertdev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants