Skip to content

Conversation

saucecontrol
Copy link
Member

Use the instruction table data to determine whether a given constant can be converted to embedded broadcast, rather than the intrinsic's base type.

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 16, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 16, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

Looks like there's some regressions related to certain bitwise instructions (smaller encoding, but larger data size; we want to prefer smaller data size for improved cache locality):

-       vpandq   ymm0, ymm0, qword ptr [reloc @RWD144] {1to4}
+       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD160]

The few bitwise instructions (including vpternlog) are swappable between the 32-bit and 64-bit base type. Either is fine since we can also update it to be the other if there is an embedded masking support scenario.

@saucecontrol
Copy link
Member Author

saucecontrol commented Jul 16, 2025

Looks like there's some regressions related to certain bitwise instructions (smaller encoding, but larger data size;

I was fixing it while you were reviewing 😄.

I think this will end up being zero diff for the SPMI collections, but in addition to allowing for removal of the special-casing for GFNI instructions, this catches a few others where broadcast size doesn't match base type. ex:

static Vector128<int> vnni(Vector128<byte> v) =>
    AvxVnni.MultiplyWideningAndAdd(Vector128<int>.Zero, v, Vector128<sbyte>.One);
 G_M42192_IG02:  ;; offset=0x0000
        C5F857C0             vxorps   xmm0, xmm0, xmm0
        C5F8100A             vmovups  xmm1, xmmword ptr [rdx]
-       C4E27150050F000000   vpdpbusd xmm0, xmm1, xmmword ptr [reloc @RWD00]
+       62F2751850050E000000 vpdpbusd xmm0, xmm1, dword ptr [reloc @RWD00] {1to4}
        C5F81101             vmovups  xmmword ptr [rcx], xmm0
        488BC1               mov      rax, rcx
-						;; size=24 bbWeight=1 PerfScore 12.58
+						;; size=25 bbWeight=1 PerfScore 12.58
 
-G_M42192_IG03:  ;; offset=0x0018
+G_M42192_IG03:  ;; offset=0x0019
        C3                   ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dq	0101010101010101h, 0101010101010101h
-; Total bytes of code: 25
+RWD00  	dd	01010101h
+; Total bytes of code: 26

@saucecontrol saucecontrol marked this pull request as ready for review July 16, 2025 19:11
@saucecontrol
Copy link
Member Author

SPMI shows zero diffs but as noted above, there are a few intrinsics with base types that don't match the instruction's broadcast size that light up with this change.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM and this is a positive simplification, even with zero diffs.

I did note some handling that's "missing" and could be added if we wanted to try and get some positive diffs as part of this.

@tannergooding tannergooding requested a review from EgorBo July 17, 2025 05:43
@saucecontrol
Copy link
Member Author

saucecontrol commented Jul 22, 2025

Latest commit implements the suggested optimizations. Diffs show some places where code size is the same but data size is smaller. There are also patterns where we can now use embedded masking that would have been blocked.

Example where shrinking the broadcast enables use of a dword-granularity masked op:

static Vector128<uint> mask_andd(Vector128<ulong> v, Vector128<uint> w) => Vector128.ConditionalSelect(
    Vector128.GreaterThan(w, Vector128<uint>.Zero),
    (v & Vector128<uint>.One.AsUInt64()).AsUInt32(),
    Vector128<uint>.Zero
);
 G_M4931_IG02:  ;; offset=0x0000
        vmovups  xmm0, xmmword ptr [r8]
        vxorps   xmm1, xmm1, xmm1
        vpcmpud  k1, xmm0, xmm1, 6
        vmovups  xmm0, xmmword ptr [rdx]
-       vpandq   xmm0, xmm0, qword ptr [reloc @RWD00] {1to2}
-       vpblendmd xmm0 {k1}{z}, xmm0, xmm0
+       vpand    xmm0 {k1}{z}, xmm0, dword ptr [reloc @RWD00] {1to4}
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
-						;; size=43 bbWeight=1 PerfScore 15.92
+						;; size=37 bbWeight=1 PerfScore 15.58
 
-G_M4931_IG03:  ;; offset=0x002B
+G_M4931_IG03:  ;; offset=0x0025
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dq	0000000100000001h
-; Total bytes of code: 44
+RWD00  	dd	00000001h
+; Total bytes of code: 38
 

And where expanding the broadcast enables use of a qword-granularity masked op:

static Vector128<ulong> mask_andq(Vector128<uint> v, Vector128<ulong> w) => Vector128.ConditionalSelect(
    Vector128.GreaterThan(w, Vector128<ulong>.Zero),
    (v & Vector128<uint>.One).AsUInt64(),
    Vector128<ulong>.Zero
);
 G_M6891_IG02:  ;; offset=0x0000
        vmovups  xmm0, xmmword ptr [r8]
        vxorps   xmm1, xmm1, xmm1
        vpcmpuq  k1, xmm0, xmm1, 6
        vmovups  xmm0, xmmword ptr [rdx]
-       vpandd   xmm0, xmm0, dword ptr [reloc @RWD00] {1to4}
-       vpblendmq xmm0 {k1}{z}, xmm0, xmm0
+       vpandq   xmm0 {k1}{z}, xmm0, qword ptr [reloc @RWD00] {1to2}
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
-						;; size=43 bbWeight=1 PerfScore 15.92
+						;; size=37 bbWeight=1 PerfScore 15.58
 
-G_M6891_IG03:  ;; offset=0x002B
+G_M6891_IG03:  ;; offset=0x0025
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
-RWD00  	dd	00000001h
-; Total bytes of code: 44
+RWD00  	dq	0000000100000001h
+; Total bytes of code: 38
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants