-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Open
Labels
Description
Looks like there's an optimisation regression between clang 21.1.0 and trunk in handling AVX2 shuffle operations. It seems to only manifest when shuffling the result of a comparison, (e.g. _mm256_cmpeq_epi32). clang now prefers to do the shuffling in packed 128 bit xmm registers and then reconstruct the 256 bit ymm output.
Test case (Compiler Explorer: https://gcc.godbolt.org/z/xGGs8hW5P)
Compiled with -O2 -mavx2
#include <immintrin.h>
__m256i foo(__m256i a, __m256i b) {
__m256i x = _mm256_cmpeq_epi32(a, b);
return _mm256_shuffle_epi32(x, 0b11101111);
}
Clang trunk clang version 22.0.0git (https://github.com/llvm/llvm-project.git 03e66aeb96928592ee6cd51913bf72a6e21066fc)
foo(long long vector[4], long long vector[4]):
vpcmpeqd ymm0, ymm0, ymm1
vextracti128 xmm1, ymm0, 1
vpackssdw xmm0, xmm0, xmm1
vpshuflw xmm0, xmm0, 239
vpshufhw xmm0, xmm0, 239
vpmovsxwd ymm0, xmm0
ret
Clang 21.1.0:
foo(long long vector[4], long long vector[4]):
vpcmpeqd ymm0, ymm0, ymm1
vpshufd ymm0, ymm0, 239
ret