Skip to content

Conversation

@pawan-nirpal-031
Copy link
Contributor

@pawan-nirpal-031 pawan-nirpal-031 commented Feb 28, 2025

This patch improves the performance on Intel platforms MMX intrinsics of constant value shifts by replacing
built-in intrinsics with standard C/C++ operators.

When performing constant value shifts, the generated code using SSE emulation via intrinsics is less performant than using standard left/right shift operators. Specifically, the latency in the execution stage increases for subsequent iterations when the pipeline is fully filled. This patch addresses that by preferring operators for such cases.

  • When using Left shift operator
movq    %r12, %r15
shlq    $6, %r15
  • When using built-ins
vpsllq  $6, %xmm0, %xmm1
vmovq   %xmm1, %r15

At runtime there is higher latency in execution stage of subsequent iterations in case of emulation via intrinsics when pipeline is fully filled, this delay is smaller in case of using normal left shift/right shift operator. For better clarity refer the following pipeline trace tables.

Pros of using C++ Operator instead of SSE intrinsic

  • For single iteration, Latency is reduced due to skipping predecode phase
  • Further iterations start early due to breaking vpsllq into instructions

Intrinsic buitlins case : https://uica.uops.info/tmp/989f067a9cc44cd99898bae7c214c436_trace.html
Using operators : https://uica.uops.info/tmp/70d4593610b84b8fba5a670949d596fb_trace.html

@pawan-nirpal-031
Copy link
Contributor Author

@e-kud @phoebewang @mahesh-attarde @arsenm could you please review.

Comment on lines +885 to +887
if (__builtin_constant_p(__count))
return (__m64)((__count > 63) ? 0 : ((long long)__m << __count));
return __trunc64(__builtin_ia32_psllqi128((__v2di)__anyext128(__m), __count));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do this unconditionally? The fold can always be done in the backend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-constant shifts become worse

https://godbolt.org/z/9xv5hxjv7

I think to accommodate both the conditions at DAG level might not be as trivial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on the whole concept of doing this in the header

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm do we have options? Because solving it in CG is too late as middle end could have optimized shifts.

static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_slli_si64(__m64 __m,
int __count) {
if (__builtin_constant_p(__count))
return (__m64)((__count > 63) ? 0 : ((long long)__m << __count));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should limit it to 64-bit only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean also make similar change for _mm_slli_si32? if yes then we may not have an example at hand where that is needed. I will add as needed/requested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean long long is less efficient on 32-bit: https://godbolt.org/z/4KeY4hrhq

@pawan-nirpal-031 pawan-nirpal-031 changed the title [X86][CodeGen] - Use shift operators instead of built-ins for SSE emulation of MMX intrinsics. [X86][CodeGen] - Use shift operators for const value shifts, instead of built-ins for SSE emulation of MMX intrinsics. Feb 28, 2025
int __count) {
if (__builtin_constant_p(__count))
return (__m64)((__count > 63) ? 0 : ((long long)__m >> __count));
return __trunc64(__builtin_ia32_psrlqi128((__v2di)__anyext128(__m), __count));
Copy link
Contributor

@e-kud e-kud Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to notice that we change the behavior for negative immediates. Before this change we returned a zero, now we return it as is because shift on negative number is UB. Intrinsic description doesn't specify what should happen in case of negative __count

@pawan-nirpal-031 pawan-nirpal-031 requested a review from arsenm March 17, 2025 13:50
@pawan-nirpal-031
Copy link
Contributor Author

ping @arsenm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants