Skip to content

[clang] Optimisation regression in trunk for x86-64 AVX2 shuffle #165813

@desal

Description

@desal

Looks like there's an optimisation regression between clang 21.1.0 and trunk in handling AVX2 shuffle operations. It seems to only manifest when shuffling the result of a comparison, (e.g. _mm256_cmpeq_epi32). clang now prefers to do the shuffling in packed 128 bit xmm registers and then reconstruct the 256 bit ymm output.

Test case (Compiler Explorer: https://gcc.godbolt.org/z/xGGs8hW5P)
Compiled with -O2 -mavx2

#include <immintrin.h>

__m256i foo(__m256i a, __m256i b) {
    __m256i x = _mm256_cmpeq_epi32(a, b);
    return _mm256_shuffle_epi32(x, 0b11101111); 
}

Clang trunk clang version 22.0.0git (https://github.com/llvm/llvm-project.git 03e66aeb96928592ee6cd51913bf72a6e21066fc)

foo(long long vector[4], long long vector[4]):
  vpcmpeqd ymm0, ymm0, ymm1
  vextracti128 xmm1, ymm0, 1
  vpackssdw xmm0, xmm0, xmm1
  vpshuflw xmm0, xmm0, 239
  vpshufhw xmm0, xmm0, 239
  vpmovsxwd ymm0, xmm0
  ret

Clang 21.1.0:

foo(long long vector[4], long long vector[4]):
  vpcmpeqd ymm0, ymm0, ymm1
  vpshufd ymm0, ymm0, 239
  ret

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions