[X86] Use `vpinsrq` in building 2-element vector of 64-bit int loads

For building a two 64-bit element vector, clang currently does separate loads and packs them together, e.g. this code:

```c
typedef uint64_t u64x2 __attribute__((vector_size(16)));
u64x2 generic_int(uint64_t* a, uint64_t* b) {
    return (u64x2){*a, *b};
}

__m128i intrinsics(uint64_t* a, uint64_t* b) {
    __m128i lo = _mm_loadu_si64(a);
    return _mm_insert_epi64(lo, *b, 1);
}

__m128i intrinsics_int_domain(uint64_t* a, uint64_t* b) {
    __m128i lo = _mm_loadu_si64(a);
    __m128i t = _mm_insert_epi64(lo, *b, 1);
    return _mm_add_epi64(t, t);
}
```
via `-O3 -march=haswell` compiles to:
```asm
generic_int:
        vmovsd  xmm0, qword ptr [rsi]
        vmovsd  xmm1, qword ptr [rdi]
        vmovlhps        xmm0, xmm1, xmm0
        ret

intrinsics:
        vmovsd  xmm0, qword ptr [rsi]
        vmovsd  xmm1, qword ptr [rdi]
        vmovlhps        xmm0, xmm1, xmm0
        ret

intrinsics_int_domain:
        vmovq   xmm0, qword ptr [rsi]
        vmovq   xmm1, qword ptr [rdi]
        vpunpcklqdq     xmm0, xmm1, xmm0
        vpaddq  xmm0, xmm0, xmm0
        ret
```

even though the load of `b` could be done together with the packing via `vpinsrq` for integer domain, and `vmovhps` for unspecified domain if preferring float is desired, i.e.:
```asm
vmovq  xmm0, qword ptr [rdi]
vpinsrq xmm0, xmm0, qword ptr [rsi], 1
```

Additionally, per uops.info data, post-icelake, `vpinsrq` has higher throughput than `vmovhps`, and via some local microbenchmarking on Haswell I don't see any domain crossing penalties for either in any direction, so for it could make sense to always use `vpinsrq` and never `vmovhps` (or at least on the applicable targets).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] Use `vpinsrq` in building 2-element vector of 64-bit int loads #136519

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[X86] Use vpinsrq in building 2-element vector of 64-bit int loads #136519

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[X86] Use `vpinsrq` in building 2-element vector of 64-bit int loads #136519