-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[X86][CodeGen] - Use shift operators for const value shifts, instead of built-ins for SSE emulation of MMX intrinsics. #129197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Enable support for fcanonicalize intrinsic lowering.
… to generic dag combiner
…lation of MMX intrinsics When performing constant value shifts, the generated code using SSE emulation via intrinsics is less efficient than using standard left/right shift operators. allow for better performance by using operators instead of built-ins.
|
@e-kud @phoebewang @mahesh-attarde @arsenm could you please review. |
| if (__builtin_constant_p(__count)) | ||
| return (__m64)((__count > 63) ? 0 : ((long long)__m << __count)); | ||
| return __trunc64(__builtin_ia32_psllqi128((__v2di)__anyext128(__m), __count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do this unconditionally? The fold can always be done in the backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-constant shifts become worse
https://godbolt.org/z/9xv5hxjv7
I think to accommodate both the conditions at DAG level might not be as trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on the whole concept of doing this in the header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm do we have options? Because solving it in CG is too late as middle end could have optimized shifts.
| static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_slli_si64(__m64 __m, | ||
| int __count) { | ||
| if (__builtin_constant_p(__count)) | ||
| return (__m64)((__count > 63) ? 0 : ((long long)__m << __count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should limit it to 64-bit only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean also make similar change for _mm_slli_si32? if yes then we may not have an example at hand where that is needed. I will add as needed/requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean long long is less efficient on 32-bit: https://godbolt.org/z/4KeY4hrhq
| int __count) { | ||
| if (__builtin_constant_p(__count)) | ||
| return (__m64)((__count > 63) ? 0 : ((long long)__m >> __count)); | ||
| return __trunc64(__builtin_ia32_psrlqi128((__v2di)__anyext128(__m), __count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to notice that we change the behavior for negative immediates. Before this change we returned a zero, now we return it as is because shift on negative number is UB. Intrinsic description doesn't specify what should happen in case of negative __count
|
ping @arsenm |
This patch improves the performance on Intel platforms MMX intrinsics of constant value shifts by replacing
built-in intrinsics with standard C/C++ operators.
When performing constant value shifts, the generated code using SSE emulation via intrinsics is less performant than using standard left/right shift operators. Specifically, the latency in the execution stage increases for subsequent iterations when the pipeline is fully filled. This patch addresses that by preferring operators for such cases.
At runtime there is higher latency in execution stage of subsequent iterations in case of emulation via intrinsics when pipeline is fully filled, this delay is smaller in case of using normal left shift/right shift operator. For better clarity refer the following pipeline trace tables.
Pros of using C++ Operator instead of SSE intrinsic
Intrinsic buitlins case : https://uica.uops.info/tmp/989f067a9cc44cd99898bae7c214c436_trace.html
Using operators : https://uica.uops.info/tmp/70d4593610b84b8fba5a670949d596fb_trace.html