- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 29
Hint to the compiler to inline the fill_window function #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hint to the compiler to inline the fill_window function #270
Conversation
| compare256 functions are not inlined to longest_match? | 
| 
 They are in zlib-rs, although in zlib-ng they seem to be called through a function pointer (I guess to allow run-time selection of different SSE/AVX implementations from a single x86 build) | 
| What happens exactly depends on the target features that are enabled at compile time: on most x86_64 CPUs Combined with that, by default the avx2 intrinsics are not inlined, unless the surrounding function is marked as having  @brian-pane I'm seeing an improvement for level 1, but a regression for compression level 2 Level 3 and onward appear fine. So this might need some further tweaking (maybe moving some cold code out of the function you now inline?) | 
| I moved what I could out of the inlined part in bb2c510. It might not be enough, though. | 
| Zlib-ng inlines compare256 by having associated versions of longest_match, then longest_match which is called more infrequently is called by a function pointer. Only direct function pointer to compare256 is called in deflate_quick. | 
| I've been playing around with this, and there is definitely something here, but I believe that the only case where this matters for  I've done some experiments with adding  But, there is not a very ergonomic way to specialize for a range of target features now (though there is a proposed project goal that hopefully brings a solution closer). I did experiment with using a function pointer dispatch method, and that does help a bit #273, but as mentioned using avx2 in more places is advantageous. We'll need to carefully weigh performance and maintainability here. | 
bb2c510    to
    cc6f005      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the recent changes (i guess to the state), just the inlining is an improvement now
Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.7ms ±  813us    77.6ms … 82.5ms          4 ( 6%)        ⚡-  3.1% ±  0.4%
  peak_rss           26.7MB ± 93.3KB    26.6MB … 26.9MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          286M  ± 2.82M      283M  …  299M           6 ( 9%)        ⚡-  4.3% ±  0.4%
  instructions        591M  ±  268       591M  …  591M           0 ( 0%)        ⚡-  1.7% ±  0.0%
  cache_references   19.9M  ±  151K     19.7M  … 20.5M           3 ( 5%)          -  0.4% ±  0.3%
  cache_misses        405K  ± 73.2K      306K  …  650K           3 ( 5%)          -  0.7% ±  8.9%
  branch_misses      2.97M  ± 5.83K     2.96M  … 3.00M           6 ( 9%)          -  0.4% ±  0.1%
for the other levels there are some small improvements to instructions, but in any case no regressions.

Inspired by @nmoinvaz suggestion in issue #18, I went through the functions that currently aren't inlined and tried adding a compiler hint to inline them. Some didn't help or even caused regressions when inlined, but this one seems to improve performance at compression level 1:
while not hurting performance at higher compression levels: