[X86][CodeGen] - Use shift operators for const value shifts, instead of built-ins for SSE emulation of MMX intrinsics. #129197

pawan-nirpal-031 · 2025-02-28T06:15:53Z

This patch improves the performance on Intel platforms MMX intrinsics of constant value shifts by replacing
built-in intrinsics with standard C/C++ operators.

When performing constant value shifts, the generated code using SSE emulation via intrinsics is less performant than using standard left/right shift operators. Specifically, the latency in the execution stage increases for subsequent iterations when the pipeline is fully filled. This patch addresses that by preferring operators for such cases.

When using Left shift operator

movq    %r12, %r15
shlq    $6, %r15

When using built-ins

vpsllq  $6, %xmm0, %xmm1
vmovq   %xmm1, %r15

At runtime there is higher latency in execution stage of subsequent iterations in case of emulation via intrinsics when pipeline is fully filled, this delay is smaller in case of using normal left shift/right shift operator. For better clarity refer the following pipeline trace tables.

Pros of using C++ Operator instead of SSE intrinsic

For single iteration, Latency is reduced due to skipping predecode phase
Further iterations start early due to breaking vpsllq into instructions

Intrinsic buitlins case : https://uica.uops.info/tmp/989f067a9cc44cd99898bae7c214c436_trace.html
Using operators : https://uica.uops.info/tmp/70d4593610b84b8fba5a670949d596fb_trace.html

Enable support for fcanonicalize intrinsic lowering.

…comments

… to generic dag combiner

…lation of MMX intrinsics When performing constant value shifts, the generated code using SSE emulation via intrinsics is less efficient than using standard left/right shift operators. allow for better performance by using operators instead of built-ins.

pawan-nirpal-031 · 2025-02-28T06:19:11Z

@e-kud @phoebewang @mahesh-attarde @arsenm could you please review.

arsenm · 2025-02-28T06:29:46Z

clang/lib/Headers/mmintrin.h

+  if (__builtin_constant_p(__count))
+    return (__m64)((__count > 63) ? 0 : ((long long)__m << __count));
+  return __trunc64(__builtin_ia32_psllqi128((__v2di)__anyext128(__m), __count));


Why not just do this unconditionally? The fold can always be done in the backend

The non-constant shifts become worse

https://godbolt.org/z/9xv5hxjv7

I think to accommodate both the conditions at DAG level might not be as trivial.

-1 on the whole concept of doing this in the header

@arsenm do we have options? Because solving it in CG is too late as middle end could have optimized shifts.

phoebewang · 2025-02-28T08:03:24Z

clang/lib/Headers/mmintrin.h

+static __inline__ __m64 __DEFAULT_FN_ATTRS_SSE2 _mm_slli_si64(__m64 __m,
+                                                              int __count) {
+  if (__builtin_constant_p(__count))
+    return (__m64)((__count > 63) ? 0 : ((long long)__m << __count));


Should limit it to 64-bit only?

You mean also make similar change for _mm_slli_si32? if yes then we may not have an example at hand where that is needed. I will add as needed/requested.

No, I mean long long is less efficient on 32-bit: https://godbolt.org/z/4KeY4hrhq

e-kud · 2025-02-28T13:07:02Z

clang/lib/Headers/mmintrin.h

+                                                              int __count) {
+  if (__builtin_constant_p(__count))
+    return (__m64)((__count > 63) ? 0 : ((long long)__m >> __count));
+  return __trunc64(__builtin_ia32_psrlqi128((__v2di)__anyext128(__m), __count));


I'd like to notice that we change the behavior for negative immediates. Before this change we returned a zero, now we return it as is because shift on negative number is UB. Intrinsic description doesn't specify what should happen in case of negative __count

pawan-nirpal-031 · 2025-03-20T06:50:10Z

ping @arsenm

pawan-nirpal-031 and others added 22 commits August 28, 2024 13:11

[X86][SelectionDAG] - Add support for llvm.canonicalize intrinsic

a824ded

Enable support for fcanonicalize intrinsic lowering.

Merge branch 'llvm:main' into main

5ebc6b4

Move combine operations to DAG combiner over from legalizer, address …

34d5244

…comments

Merge branch 'llvm:main' into main

3c961a8

Merge branch 'llvm:main' into main

74ae03e

addressing review comments, simplify condtions

d405230

Merge branch 'llvm:main' into main

96f7c43

Removed constant folding for another patch, moving undef canonicalize…

317dd6f

… to generic dag combiner

Merge branch 'llvm:main' into main

06f09f4

fix run lines to reuse checks

cbe7d0b

Merge branch 'llvm:main' into main

d48773a

fix lit failure for sse2 mode

9e37e86

Merge branch 'llvm:main' into main

26ee8a9

minor refactors

b9d2cf8

handling vector inputs and moving to lowering

7a77677

Merge branch 'llvm:main' into main

9970720

Merge branch 'llvm:main' into main

a71759d

remove the rouge comment

fa04409

remove the rouge comment

ad86002

Merge branch 'llvm:main' into main

e6a6646

Merge branch 'llvm:main' into main

f303f37

arsenm reviewed Feb 28, 2025

View reviewed changes

phoebewang reviewed Feb 28, 2025

View reviewed changes

pawan-nirpal-031 changed the title ~~[X86][CodeGen] - Use shift operators instead of built-ins for SSE emulation of MMX intrinsics.~~ [X86][CodeGen] - Use shift operators for const value shifts, instead of built-ins for SSE emulation of MMX intrinsics. Feb 28, 2025

e-kud reviewed Feb 28, 2025

View reviewed changes

pawan-nirpal-031 requested a review from arsenm March 17, 2025 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86][CodeGen] - Use shift operators for const value shifts, instead of built-ins for SSE emulation of MMX intrinsics. #129197

[X86][CodeGen] - Use shift operators for const value shifts, instead of built-ins for SSE emulation of MMX intrinsics. #129197

Uh oh!

pawan-nirpal-031 commented Feb 28, 2025 •

edited

Loading

Uh oh!

pawan-nirpal-031 commented Feb 28, 2025

Uh oh!

arsenm Feb 28, 2025

Uh oh!

pawan-nirpal-031 Feb 28, 2025

Uh oh!

arsenm Mar 3, 2025

Uh oh!

e-kud Mar 3, 2025

Uh oh!

phoebewang Feb 28, 2025

Uh oh!

pawan-nirpal-031 Feb 28, 2025

Uh oh!

phoebewang Feb 28, 2025

Uh oh!

e-kud Feb 28, 2025 •

edited

Loading

Uh oh!

pawan-nirpal-031 commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[X86][CodeGen] - Use shift operators for const value shifts, instead of built-ins for SSE emulation of MMX intrinsics. #129197

Are you sure you want to change the base?

[X86][CodeGen] - Use shift operators for const value shifts, instead of built-ins for SSE emulation of MMX intrinsics. #129197

Uh oh!

Conversation

pawan-nirpal-031 commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawan-nirpal-031 commented Feb 28, 2025

Uh oh!

arsenm Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

pawan-nirpal-031 Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

e-kud Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

phoebewang Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

pawan-nirpal-031 Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

phoebewang Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

e-kud Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawan-nirpal-031 commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pawan-nirpal-031 commented Feb 28, 2025 •

edited

Loading

e-kud Feb 28, 2025 •

edited

Loading