use AVX-512F setzero instead of set1_epi8#1101
Conversation
|
from intel docs: _mm512_set1_epi8 and _mm512_setzero_si512 are both AVX512F. But the latency and throughput of _mm512_setzero_si512 are better. |
Note how it says "Sequence" - if only compiled targeting AVX-512F (which it is in this case) it may generate a functionally equivalent but strictly inferior sequence of instructions. |
a quick test https://godbolt.org/z/de4aEdjas
see functions in the above test: __attribute__((target("avx512f,no-avx512bw"), optimize("O0")))
__m512i test_f_set1_epi8_O0() {
// NOTE: vpbroadcastb ymm0 (AVX2)
return _mm512_set1_epi8(0xAB);
}
__attribute__((target("avx512f,avx512bw"), optimize("O0")))
__m512i test_bw_set1_epi8_O0() {
// NOTE: vpbroadcastb zmm0 (512BW)
return _mm512_set1_epi8(0xAB);
}
the previous document in another document https://www.felixcloutier.com/x86/vpbroadcast |
But |
Most likely, I think _mm512_setzero is exactly the right choice for performance. (better also post some perf test results) Both intrinsics functions are correct in terms of semantics and the generated instruction sequences. The intrinsics functions is not mapped 1-to-1 instruction, but implement in different instruction sequence for different constrains. In particular, for And _mm512_setzero_si512(), always generate ``vpxor xmm0, xmm0, xmm0 |
For zero-initializing a 512-bit vector, it is better to use
_mm512_setzero_si512(). This shouldn't change anything in compiled binaries, but could avoid issues under unique optimization scenarios.