-
-
Notifications
You must be signed in to change notification settings - Fork 29
Hint to the compiler to inline the fill_window function #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hint to the compiler to inline the fill_window function #270
Conversation
|
compare256 functions are not inlined to longest_match? |
They are in zlib-rs, although in zlib-ng they seem to be called through a function pointer (I guess to allow run-time selection of different SSE/AVX implementations from a single x86 build) |
|
What happens exactly depends on the target features that are enabled at compile time: on most x86_64 CPUs Combined with that, by default the avx2 intrinsics are not inlined, unless the surrounding function is marked as having @brian-pane I'm seeing an improvement for level 1, but a regression for compression level 2 Level 3 and onward appear fine. So this might need some further tweaking (maybe moving some cold code out of the function you now inline?) |
|
I moved what I could out of the inlined part in bb2c510. It might not be enough, though. |
|
Zlib-ng inlines compare256 by having associated versions of longest_match, then longest_match which is called more infrequently is called by a function pointer. Only direct function pointer to compare256 is called in deflate_quick. |
|
I've been playing around with this, and there is definitely something here, but I believe that the only case where this matters for I've done some experiments with adding But, there is not a very ergonomic way to specialize for a range of target features now (though there is a proposed project goal that hopefully brings a solution closer). I did experiment with using a function pointer dispatch method, and that does help a bit #273, but as mentioned using avx2 in more places is advantageous. We'll need to carefully weigh performance and maintainability here. |
bb2c510 to
cc6f005
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the recent changes (i guess to the state), just the inlining is an improvement now
Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 78.7ms ± 813us 77.6ms … 82.5ms 4 ( 6%) ⚡- 3.1% ± 0.4%
peak_rss 26.7MB ± 93.3KB 26.6MB … 26.9MB 0 ( 0%) + 0.1% ± 0.1%
cpu_cycles 286M ± 2.82M 283M … 299M 6 ( 9%) ⚡- 4.3% ± 0.4%
instructions 591M ± 268 591M … 591M 0 ( 0%) ⚡- 1.7% ± 0.0%
cache_references 19.9M ± 151K 19.7M … 20.5M 3 ( 5%) - 0.4% ± 0.3%
cache_misses 405K ± 73.2K 306K … 650K 3 ( 5%) - 0.7% ± 8.9%
branch_misses 2.97M ± 5.83K 2.96M … 3.00M 6 ( 9%) - 0.4% ± 0.1%
for the other levels there are some small improvements to instructions, but in any case no regressions.

Inspired by @nmoinvaz suggestion in issue #18, I went through the functions that currently aren't inlined and tried adding a compiler hint to inline them. Some didn't help or even caused regressions when inlined, but this one seems to improve performance at compression level 1:
while not hurting performance at higher compression levels: