-
Notifications
You must be signed in to change notification settings - Fork 14k
[Perf experiment] Make checked unsigned add not use intrinsics in debug mode #138193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
r? @nnethercote rustbot has assigned @nnethercote. Use |
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
[Perf experiment] Make checked unsigned add not use intrinsics in debug mode **NOT MEANT FOR MERGING YET** This PR replaces unsigned checked addition LLVM intrinsic with simpler, inline implementations *in debug mode*. # Motivation According to the [LLVM performance tips]( https://llvm.org/docs/Frontend/PerformanceTips.html), arithmetics intrinsic (including overflow checks) should be avoided if possible, since the optimizer is not as good at reasoning about them: > Avoid using arithmetic intrinsics unless you are required by your source language specification to emit a particular code sequence. The optimizer is quite good at reasoning about general control flow and arithmetic, it is not anywhere near as strong at reasoning about the various intrinsics. If profitable for code generation purposes, the optimizer will likely form the intrinsics itself late in the optimization pipeline. It is very rarely profitable to emit these directly in the language frontend. This item explicitly includes the use of the overflow intrinsics. The goal of this PR is to check if using a simple, inline implementation of overflow checks is beneficial for compile times, compared to just using the intrinsic. From my research, in debug mode, the runtime performance of manual overflow checks is comparable to the performance of intrinsics. I will perform a perf run to see if: 1. This PR improves compiletimes. 2. This PR does not degrade debug runtime performance in a noticeable way. # Implementation The inline overflow checks are performed by comparing the sum of two numbers to the result of or-ing them together. The sum `a + b` is smaller than `a | b` if and only if an overflow occurs. This is used to detect overflows.
|
☀️ Try build successful - checks-actions |
Isn't it enough to compare the sum to either of the operands? That is, |
Seem so... it definetly holds for u8, and u16, so I assume it should work for all unsigned intiegers too. I'll change that, and then do a perf run. |
|
https://godbolt.org/z/PY53ax5PY So, this should also work fine in release. |
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
[Perf experiment] Make checked unsigned add not use intrinsics in debug mode **NOT MEANT FOR MERGING YET** This PR replaces unsigned checked addition LLVM intrinsic with simpler, inline implementations *in debug mode*. # Motivation According to the [LLVM performance tips]( https://llvm.org/docs/Frontend/PerformanceTips.html), arithmetics intrinsic (including overflow checks) should be avoided if possible, since the optimizer is not as good at reasoning about them: > Avoid using arithmetic intrinsics unless you are required by your source language specification to emit a particular code sequence. The optimizer is quite good at reasoning about general control flow and arithmetic, it is not anywhere near as strong at reasoning about the various intrinsics. If profitable for code generation purposes, the optimizer will likely form the intrinsics itself late in the optimization pipeline. It is very rarely profitable to emit these directly in the language frontend. This item explicitly includes the use of the overflow intrinsics. The goal of this PR is to check if using a simple, inline implementation of overflow checks is beneficial for compile times, compared to just using the intrinsic. From my research, in debug mode, the runtime performance of manual overflow checks is comparable to the performance of intrinsics. I will perform a perf run to see if: 1. This PR improves compiletimes. 2. This PR does not degrade debug runtime performance in a noticeable way. # Implementation The inline overflow checks are performed by comparing the sum of two numbers to the result of or-ing them together. The sum `a + b` is smaller than `a | b` if and only if an overflow occurs. This is used to detect overflows.
|
@rust-timer build b96f962 |
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (b96f962): comparison URL. Overall result: ❌ regressions - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary -2.2%, secondary -2.6%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (secondary -3.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.0%, secondary 0.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 765.734s -> 766.119s (0.05%) |
|
☀️ Try build successful - checks-actions |
|
Maybe the cannonical version will do better? @rust-timer build 6021669 |
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (6021669): comparison URL. Overall result: ❌ regressions - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResults (primary -1.6%, secondary 0.4%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.0%, secondary 0.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 766.8s -> 766.847s (0.01%) |
NOT MEANT FOR MERGING YET
This PR replaces unsigned checked addition LLVM intrinsic with simpler, inline implementations in debug mode.
Motivation
According to the LLVM performance tips, arithmetics intrinsic (including overflow checks) should be avoided if possible, since the optimizer is not as good at reasoning about them:
The goal of this PR is to check if using a simple, inline implementation of overflow checks is beneficial for compile times, compared to just using the intrinsic.
From my research, in debug mode, the runtime performance of manual overflow checks is comparable to the performance of intrinsics.
I will perform a perf run to see if:
Implementation
The inline overflow checks are performed by comparing the sum of two numbers to the result of or-ing them together.
The sum
a + bis smaller thana | bif and only if an overflow occurs. This is used to detect overflows.