Skip to content

[X86] Use vpinsrq in building 2-element vector of 64-bit int loads #136519

@dzaima

Description

@dzaima

For building a two 64-bit element vector, clang currently does separate loads and packs them together, e.g. this code:

typedef uint64_t u64x2 __attribute__((vector_size(16)));
u64x2 generic_int(uint64_t* a, uint64_t* b) {
    return (u64x2){*a, *b};
}

__m128i intrinsics(uint64_t* a, uint64_t* b) {
    __m128i lo = _mm_loadu_si64(a);
    return _mm_insert_epi64(lo, *b, 1);
}

__m128i intrinsics_int_domain(uint64_t* a, uint64_t* b) {
    __m128i lo = _mm_loadu_si64(a);
    __m128i t = _mm_insert_epi64(lo, *b, 1);
    return _mm_add_epi64(t, t);
}

via -O3 -march=haswell compiles to:

generic_int:
        vmovsd  xmm0, qword ptr [rsi]
        vmovsd  xmm1, qword ptr [rdi]
        vmovlhps        xmm0, xmm1, xmm0
        ret

intrinsics:
        vmovsd  xmm0, qword ptr [rsi]
        vmovsd  xmm1, qword ptr [rdi]
        vmovlhps        xmm0, xmm1, xmm0
        ret

intrinsics_int_domain:
        vmovq   xmm0, qword ptr [rsi]
        vmovq   xmm1, qword ptr [rdi]
        vpunpcklqdq     xmm0, xmm1, xmm0
        vpaddq  xmm0, xmm0, xmm0
        ret

even though the load of b could be done together with the packing via vpinsrq for integer domain, and vmovhps for unspecified domain if preferring float is desired, i.e.:

vmovq  xmm0, qword ptr [rdi]
vpinsrq xmm0, xmm0, qword ptr [rsi], 1

Additionally, per uops.info data, post-icelake, vpinsrq has higher throughput than vmovhps, and via some local microbenchmarking on Haswell I don't see any domain crossing penalties for either in any direction, so for it could make sense to always use vpinsrq and never vmovhps (or at least on the applicable targets).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions