-
Notifications
You must be signed in to change notification settings - Fork 5.2k
JIT: Allow embedded broadcast regardless of intrinsic base type #117700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Looks like there's some regressions related to certain bitwise instructions (smaller encoding, but larger data size; we want to prefer smaller data size for improved cache locality): - vpandq ymm0, ymm0, qword ptr [reloc @RWD144] {1to4}
+ vpand ymm0, ymm0, ymmword ptr [reloc @RWD160] The few bitwise instructions (including vpternlog) are swappable between the |
e59b5f4
to
e164839
Compare
I was fixing it while you were reviewing 😄. I think this will end up being zero diff for the SPMI collections, but in addition to allowing for removal of the special-casing for GFNI instructions, this catches a few others where broadcast size doesn't match base type. ex: static Vector128<int> vnni(Vector128<byte> v) =>
AvxVnni.MultiplyWideningAndAdd(Vector128<int>.Zero, v, Vector128<sbyte>.One); G_M42192_IG02: ;; offset=0x0000
C5F857C0 vxorps xmm0, xmm0, xmm0
C5F8100A vmovups xmm1, xmmword ptr [rdx]
- C4E27150050F000000 vpdpbusd xmm0, xmm1, xmmword ptr [reloc @RWD00]
+ 62F2751850050E000000 vpdpbusd xmm0, xmm1, dword ptr [reloc @RWD00] {1to4}
C5F81101 vmovups xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
- ;; size=24 bbWeight=1 PerfScore 12.58
+ ;; size=25 bbWeight=1 PerfScore 12.58
-G_M42192_IG03: ;; offset=0x0018
+G_M42192_IG03: ;; offset=0x0019
C3 ret
;; size=1 bbWeight=1 PerfScore 1.00
-RWD00 dq 0101010101010101h, 0101010101010101h
-; Total bytes of code: 25
+RWD00 dd 01010101h
+; Total bytes of code: 26 |
SPMI shows zero diffs but as noted above, there are a few intrinsics with base types that don't match the instruction's broadcast size that light up with this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM and this is a positive simplification, even with zero diffs.
I did note some handling that's "missing" and could be added if we wanted to try and get some positive diffs as part of this.
Latest commit implements the suggested optimizations. Diffs show some places where code size is the same but data size is smaller. There are also patterns where we can now use embedded masking that would have been blocked. Example where shrinking the broadcast enables use of a dword-granularity masked op: static Vector128<uint> mask_andd(Vector128<ulong> v, Vector128<uint> w) => Vector128.ConditionalSelect(
Vector128.GreaterThan(w, Vector128<uint>.Zero),
(v & Vector128<uint>.One.AsUInt64()).AsUInt32(),
Vector128<uint>.Zero
); G_M4931_IG02: ;; offset=0x0000
vmovups xmm0, xmmword ptr [r8]
vxorps xmm1, xmm1, xmm1
vpcmpud k1, xmm0, xmm1, 6
vmovups xmm0, xmmword ptr [rdx]
- vpandq xmm0, xmm0, qword ptr [reloc @RWD00] {1to2}
- vpblendmd xmm0 {k1}{z}, xmm0, xmm0
+ vpand xmm0 {k1}{z}, xmm0, dword ptr [reloc @RWD00] {1to4}
vmovups xmmword ptr [rcx], xmm0
mov rax, rcx
- ;; size=43 bbWeight=1 PerfScore 15.92
+ ;; size=37 bbWeight=1 PerfScore 15.58
-G_M4931_IG03: ;; offset=0x002B
+G_M4931_IG03: ;; offset=0x0025
ret
;; size=1 bbWeight=1 PerfScore 1.00
-RWD00 dq 0000000100000001h
-; Total bytes of code: 44
+RWD00 dd 00000001h
+; Total bytes of code: 38
And where expanding the broadcast enables use of a qword-granularity masked op: static Vector128<ulong> mask_andq(Vector128<uint> v, Vector128<ulong> w) => Vector128.ConditionalSelect(
Vector128.GreaterThan(w, Vector128<ulong>.Zero),
(v & Vector128<uint>.One).AsUInt64(),
Vector128<ulong>.Zero
); G_M6891_IG02: ;; offset=0x0000
vmovups xmm0, xmmword ptr [r8]
vxorps xmm1, xmm1, xmm1
vpcmpuq k1, xmm0, xmm1, 6
vmovups xmm0, xmmword ptr [rdx]
- vpandd xmm0, xmm0, dword ptr [reloc @RWD00] {1to4}
- vpblendmq xmm0 {k1}{z}, xmm0, xmm0
+ vpandq xmm0 {k1}{z}, xmm0, qword ptr [reloc @RWD00] {1to2}
vmovups xmmword ptr [rcx], xmm0
mov rax, rcx
- ;; size=43 bbWeight=1 PerfScore 15.92
+ ;; size=37 bbWeight=1 PerfScore 15.58
-G_M6891_IG03: ;; offset=0x002B
+G_M6891_IG03: ;; offset=0x0025
ret
;; size=1 bbWeight=1 PerfScore 1.00
-RWD00 dd 00000001h
-; Total bytes of code: 44
+RWD00 dq 0000000100000001h
+; Total bytes of code: 38
|
Use the instruction table data to determine whether a given constant can be converted to embedded broadcast, rather than the intrinsic's base type.