-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Open
Description
8-bit shifts by a variable amount are implemented by three selections between a shift by a power of two and the unshifted value based on the corresponding amount bits:
shiftVarU8_sse41:
movdqa xmm2, xmm0
movdqa xmm3, xmm0
psllw xmm3, 4
pand xmm3, xmmword ptr [rip + .LCPI1_0]
psllw xmm1, 5
movdqa xmm0, xmm1
pblendvb xmm2, xmm3, xmm0
movdqa xmm3, xmm2
psllw xmm3, 2
pand xmm3, xmmword ptr [rip + .LCPI1_1]
paddb xmm1, xmm1
movdqa xmm0, xmm1
pblendvb xmm2, xmm3, xmm0
movdqa xmm3, xmm2
paddb xmm3, xmm2
paddb xmm1, xmm1
movdqa xmm0, xmm1
pblendvb xmm2, xmm3, xmm0
movdqa xmm0, xmm2
retIf SSSE3 is available, it is cheaper to instead multiply by a power of two which can be obtained by using a shuffle as a lookup table (although clang tends to implement multiplies less efficiently):
shiftVarU8_ideal:
pand xmm1, xmmword ptr [rip + .LCPI2_0]
movq xmm2, qword ptr [rip + .LCPI2_1]
pshufb xmm2, xmm1
movdqa xmm1, xmm2
pmullw xmm1, xmm0
pand xmm1, xmmword ptr [rip + .LCPI2_2]
pand xmm2, xmmword ptr [rip + .LCPI2_3]
psrlw xmm0, 8
pmullw xmm0, xmm2
por xmm0, xmm1
retThe first pand may be skipped if amounts greater than 7 is considered undefined. This method appears to be optimal until AVX512BW.