-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Open
Description
For building a two 64-bit element vector, clang currently does separate loads and packs them together, e.g. this code:
typedef uint64_t u64x2 __attribute__((vector_size(16)));
u64x2 generic_int(uint64_t* a, uint64_t* b) {
return (u64x2){*a, *b};
}
__m128i intrinsics(uint64_t* a, uint64_t* b) {
__m128i lo = _mm_loadu_si64(a);
return _mm_insert_epi64(lo, *b, 1);
}
__m128i intrinsics_int_domain(uint64_t* a, uint64_t* b) {
__m128i lo = _mm_loadu_si64(a);
__m128i t = _mm_insert_epi64(lo, *b, 1);
return _mm_add_epi64(t, t);
}via -O3 -march=haswell compiles to:
generic_int:
vmovsd xmm0, qword ptr [rsi]
vmovsd xmm1, qword ptr [rdi]
vmovlhps xmm0, xmm1, xmm0
ret
intrinsics:
vmovsd xmm0, qword ptr [rsi]
vmovsd xmm1, qword ptr [rdi]
vmovlhps xmm0, xmm1, xmm0
ret
intrinsics_int_domain:
vmovq xmm0, qword ptr [rsi]
vmovq xmm1, qword ptr [rdi]
vpunpcklqdq xmm0, xmm1, xmm0
vpaddq xmm0, xmm0, xmm0
reteven though the load of b could be done together with the packing via vpinsrq for integer domain, and vmovhps for unspecified domain if preferring float is desired, i.e.:
vmovq xmm0, qword ptr [rdi]
vpinsrq xmm0, xmm0, qword ptr [rsi], 1Additionally, per uops.info data, post-icelake, vpinsrq has higher throughput than vmovhps, and via some local microbenchmarking on Haswell I don't see any domain crossing penalties for either in any direction, so for it could make sense to always use vpinsrq and never vmovhps (or at least on the applicable targets).