Skip to content

[X86] Vector 8-bit shifts by variable amounts should use power of two multiply #165964

@WalterKruger

Description

@WalterKruger

8-bit shifts by a variable amount are implemented by three selections between a shift by a power of two and the unshifted value based on the corresponding amount bits:

shiftVarU8_sse41:
        movdqa  xmm2, xmm0
        movdqa  xmm3, xmm0
        psllw   xmm3, 4
        pand    xmm3, xmmword ptr [rip + .LCPI1_0]
        psllw   xmm1, 5
        movdqa  xmm0, xmm1
        pblendvb        xmm2, xmm3, xmm0
        movdqa  xmm3, xmm2
        psllw   xmm3, 2
        pand    xmm3, xmmword ptr [rip + .LCPI1_1]
        paddb   xmm1, xmm1
        movdqa  xmm0, xmm1
        pblendvb        xmm2, xmm3, xmm0
        movdqa  xmm3, xmm2
        paddb   xmm3, xmm2
        paddb   xmm1, xmm1
        movdqa  xmm0, xmm1
        pblendvb        xmm2, xmm3, xmm0
        movdqa  xmm0, xmm2
        ret

If SSSE3 is available, it is cheaper to instead multiply by a power of two which can be obtained by using a shuffle as a lookup table (although clang tends to implement multiplies less efficiently):

shiftVarU8_ideal:
        pand    xmm1, xmmword ptr [rip + .LCPI2_0]
        movq    xmm2, qword ptr [rip + .LCPI2_1]
        pshufb  xmm2, xmm1
        movdqa  xmm1, xmm2
        pmullw  xmm1, xmm0
        pand    xmm1, xmmword ptr [rip + .LCPI2_2]
        pand    xmm2, xmmword ptr [rip + .LCPI2_3]
        psrlw   xmm0, 8
        pmullw  xmm0, xmm2
        por     xmm0, xmm1
        ret

The first pand may be skipped if amounts greater than 7 is considered undefined. This method appears to be optimal until AVX512BW.

https://godbolt.org/z/GabW4h6TG

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions