Skip to content

[Clang][x86] Bad codegen for assume_aligned with structured binding packs #156206

@kaimfrai

Description

@kaimfrai

I've been trying to reduce intrinsic-dependent code for SIMD and rely more on compiler optimizations. With structured binding packs this becomes very terse, easy to read and generic. This is where I noticed the following: Clang does not emit an aligned load for assume_aligned when used with a pack of indices:

template<typename T, int N>
using vec[[gnu::vector_size(sizeof(T)*N)]] = T;

template<int N>
consteval auto Indices()
{
    std::array<int, N> arr;
    std::iota(arr.begin(), arr.end(), 0);
    return arr;
}

template<int A, typename T, int N>
auto load1(T* p) -> vec<T, N>
{
    auto const[...i] = Indices<N>();
    return vec<T, N>{ std::assume_aligned<alignof(T) * A>(p)[i]... };
}
auto a1 = load1<16, float, 16>;
auto u1 = load1<1, float, 16>;

Where I would expect a moveaps and moveups respectively when using -std=c++26 -O3 -mavx512f -mavx512vl, instead it becomes this:

        vmovsd  xmm0, qword ptr [rdi]
        vmovsd  xmm1, qword ptr [rdi + 8]
        vmovsd  xmm2, qword ptr [rdi + 16]
        vmovsd  xmm3, qword ptr [rdi + 32]
        vinsertf128     ymm0, ymm0, xmm2, 1
        vbroadcastsd    ymm2, qword ptr [rdi + 24]
        vunpcklpd       ymm0, ymm0, ymm1
        vblendpd        ymm0, ymm0, ymm2, 8
        vinsertf32x4    zmm0, zmm0, xmm3, 2
        vmovsd  xmm1, qword ptr [rdi + 40]
        vmovapd zmm2, zmmword ptr [rip + .LCPI0_0]
        vpermi2pd       zmm2, zmm0, zmm1
        vmovsd  xmm0, qword ptr [rdi + 48]
        vinsertf32x4    zmm1, zmm2, xmm0, 3
        vmovsd  xmm2, qword ptr [rdi + 56]
        vmovapd zmm0, zmmword ptr [rip + .LCPI0_1]
        vpermi2pd       zmm0, zmm1, zmm2
        ret

Even when written by hand and not using a pack, the result is the same. Though in that case you would most likely use a separate variable, in which case the issue goes away. So it's more about the convenience of writing it in a single expression. However, GCC optimizes it as expected. Even up to Clang 16 in AVX2 mode this optimizes to a single instruction. However, as soon as -mavx512vl is set, even while still using an vector of 8 float, the code gen suddenly becomes much worse.

        vmovsd  xmm0, qword ptr [rdi]
        vmovsd  xmm1, qword ptr [rdi + 8]
        vmovsd  xmm2, qword ptr [rdi + 16]
        vmovsd  xmm3, qword ptr [rdi + 24]
        vinsertf128     ymm3, ymm0, xmm3, 1
        vinsertf128     ymm1, ymm0, xmm1, 1
        vperm2f128      ymm1, ymm1, ymm3, 49
        vinsertf128     ymm0, ymm0, xmm2, 1
        vunpcklpd       ymm0, ymm0, ymm1
        ret

I also tried casting to an aligned structure, which I would prefer not to do. While this resulted in a single instruction, despite the alignment it was still a moveups. See https://godbolt.org/z/TdTarrKsK for the full example.

This isn't a pressing issue, as there is a workaround by using a separate variable, still the inconsistency was surprising.

Metadata

Metadata

Assignees

No one assigned

    Labels

    clangClang issues not falling into any other category

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions