[x86] Using `select` instead of `pblendvb` leads to very poor codegen on AVX2

I tried this code:

```ll
define <32 x i8> @cond_double_blendv(<32 x i8> %a, <32 x i8> %mask) {
    %aa = add <32 x i8> %a, %a
    %ret = call <32 x i8> @llvm.x86.avx2.pblendvb(<32 x i8> %a, <32 x i8> %aa, <32 x i8> %mask)
    ret <32 x i8> %ret
}

define <32 x i8> @cond_double_select(<32 x i8> %a, <32 x i8> %mask) {
    %aa = add <32 x i8> %a, %a
    %bitmask = icmp slt <32 x i8> %mask, splat (i8 0)
    %ret = select <32 x i1> %bitmask, <32 x i8> %aa, <32 x i8> %a
    ret <32 x i8> %ret
}
```

Both functions have the same behavior, doubling lanes of `%a` if the MSB of the corresponding lane in `%b` is set. However, they generate wildly different assembly (using clang 21.1.0 with `-O3 -march=x86-64-v3`):

```asm
cond_double_blendv:
        vpaddb  ymm2, ymm0, ymm0
        vpblendvb       ymm0, ymm0, ymm2, ymm1
        ret

.LCPI1_0:
        .zero   32,252
.LCPI1_1:
        .zero   32,32
cond_double_select:
        vpsllw  ymm2, ymm0, 2
        vpand   ymm2, ymm2, ymmword ptr [rip + .LCPI1_0]
        vpsrlw  ymm1, ymm1, 2
        vpand   ymm1, ymm1, ymmword ptr [rip + .LCPI1_1]
        vpaddb  ymm1, ymm1, ymm1
        vpblendvb       ymm0, ymm0, ymm2, ymm1
        vpaddb  ymm2, ymm0, ymm0
        vpaddb  ymm1, ymm1, ymm1
        vpblendvb       ymm0, ymm0, ymm2, ymm1
        ret
```

The version using the `@llvm.x86.avx2.pblendvb` intrinsic emits the expected assembly. After staring at the version using `select` for a while, I can say that all instructions except `vpaddb ymm2, ymm0, ymm0` and the last `vpblendvb` form an elaborate no-op. I have no idea however what the code generator's intent was with these instructions. I don't see any reason why these functions should not just emit the exact same assembly.

Note: I originally encountered this while using AVX2 intrinsics in Rust, where the output from rustc was much worse than the output from clang for an equivalent function, with the difference being that rustc lowers `_mm256_blendv_epi8` to `icmp slt + select`, whereas clang lowers it to `call @llvm.x86.avx2.pblendvb`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[x86] Using `select` instead of `pblendvb` leads to very poor codegen on AVX2 #162812

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[x86] Using select instead of pblendvb leads to very poor codegen on AVX2 #162812

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[x86] Using `select` instead of `pblendvb` leads to very poor codegen on AVX2 #162812