Skip to content

use AVX-512F setzero instead of set1_epi8#1101

Open
NexusXe wants to merge 1 commit intoCyan4973:devfrom
NexusXe:dev
Open

use AVX-512F setzero instead of set1_epi8#1101
NexusXe wants to merge 1 commit intoCyan4973:devfrom
NexusXe:dev

Conversation

@NexusXe
Copy link

@NexusXe NexusXe commented Mar 17, 2026

For zero-initializing a 512-bit vector, it is better to use _mm512_setzero_si512(). This shouldn't change anything in compiled binaries, but could avoid issues under unique optimization scenarios.

@gzm55
Copy link
Contributor

gzm55 commented Mar 18, 2026

from intel docs: _mm512_set1_epi8 and _mm512_setzero_si512 are both AVX512F.

But the latency and throughput of _mm512_setzero_si512 are better.

@NexusXe NexusXe changed the title use AVX-512F setzero instead of AVX-512BW set1_epi8 use AVX-512F setzero instead of set1_epi8 Mar 18, 2026
@NexusXe
Copy link
Author

NexusXe commented Mar 18, 2026

from intel docs: _mm512_set1_epi8 and _mm512_setzero_si512 are both AVX512F.

But the latency and throughput of _mm512_setzero_si512 are better.

Note how it says "Sequence" - if only compiled targeting AVX-512F (which it is in this case) it may generate a functionally equivalent but strictly inferior sequence of instructions. vpbroadcastb (which _mm512_set1_epi8 prefers to compile to) is only available with AVX-512BW.

@gzm55
Copy link
Contributor

gzm55 commented Mar 18, 2026

from intel docs: _mm512_set1_epi8 and _mm512_setzero_si512 are both AVX512F.
But the latency and throughput of _mm512_setzero_si512 are better.

Note how it says "Sequence" - if only compiled targeting AVX-512F (which it is in this case) it may generate a functionally equivalent but strictly inferior sequence of instructions. vpbroadcastb (which _mm512_set1_epi8 prefers to compile to) is only available with AVX-512BW.

a quick test https://godbolt.org/z/de4aEdjas

_mm512_set1_epi8 as a function, it should generate 512F-only instructions when found no 512bw.

see functions in the above test:

__attribute__((target("avx512f,no-avx512bw"), optimize("O0")))
__m512i test_f_set1_epi8_O0() {
    // NOTE: vpbroadcastb ymm0  (AVX2)
    return _mm512_set1_epi8(0xAB);
}
__attribute__((target("avx512f,avx512bw"), optimize("O0")))
__m512i test_bw_set1_epi8_O0() {
    // NOTE: vpbroadcastb zmm0  (512BW)
    return _mm512_set1_epi8(0xAB);
}

vpbroadcastb ymm should be part of AVX2.

the previous document
only describe the broadcast with mask version of vpbroadcastb.

in another document https://www.felixcloutier.com/x86/vpbroadcast
we can find VPBROADCASTB ymm without mask is from AVX2.

@NexusXe
Copy link
Author

NexusXe commented Mar 18, 2026

vpbroadcastb ymm should be part of AVX2.

the previous document only describe the broadcast with mask version of vpbroadcastb.

in another document https://www.felixcloutier.com/x86/vpbroadcast we can find VPBROADCASTB ymm without mask is from AVX2.

But vpbroadcastb zmm is only available with AVX-512BW, which _mm512_set1_epi8 is equivalent to. Either way, using the setzero intrinsic is the correct way to do it.

@gzm55
Copy link
Contributor

gzm55 commented Mar 19, 2026

But vpbroadcastb zmm is only available with AVX-512BW, which _mm512_set1_epi8 is equivalent to. Either way, using the setzero intrinsic is the correct way to do it.

Most likely, I think _mm512_setzero is exactly the right choice for performance. (better also post some perf test results)

Both intrinsics functions are correct in terms of semantics and the generated instruction sequences. The intrinsics functions is not mapped 1-to-1 instruction, but implement in different instruction sequence for different constrains.

In particular, for _mm512_set1_epi8(0), it could be interpreted as vpbroadcastb ymm (AVX2) --> vinserti64x4 zmm (512F) or vpxor xmm0, xmm0, xmm0 (AVX) based on -O? and -m??? constrains.

And _mm512_setzero_si512(), always generate ``vpxor xmm0, xmm0, xmm0which is an AVX instruction, shorter, faster thenvpxor zmm0 (AVX512F)`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants