-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
I tried this code:
define <32 x i8> @cond_double_blendv(<32 x i8> %a, <32 x i8> %mask) {
%aa = add <32 x i8> %a, %a
%ret = call <32 x i8> @llvm.x86.avx2.pblendvb(<32 x i8> %a, <32 x i8> %aa, <32 x i8> %mask)
ret <32 x i8> %ret
}
define <32 x i8> @cond_double_select(<32 x i8> %a, <32 x i8> %mask) {
%aa = add <32 x i8> %a, %a
%bitmask = icmp slt <32 x i8> %mask, splat (i8 0)
%ret = select <32 x i1> %bitmask, <32 x i8> %aa, <32 x i8> %a
ret <32 x i8> %ret
}
Both functions have the same behavior, doubling lanes of %a
if the MSB of the corresponding lane in %b
is set. However, they generate wildly different assembly (using clang 21.1.0 with -O3 -march=x86-64-v3
):
cond_double_blendv:
vpaddb ymm2, ymm0, ymm0
vpblendvb ymm0, ymm0, ymm2, ymm1
ret
.LCPI1_0:
.zero 32,252
.LCPI1_1:
.zero 32,32
cond_double_select:
vpsllw ymm2, ymm0, 2
vpand ymm2, ymm2, ymmword ptr [rip + .LCPI1_0]
vpsrlw ymm1, ymm1, 2
vpand ymm1, ymm1, ymmword ptr [rip + .LCPI1_1]
vpaddb ymm1, ymm1, ymm1
vpblendvb ymm0, ymm0, ymm2, ymm1
vpaddb ymm2, ymm0, ymm0
vpaddb ymm1, ymm1, ymm1
vpblendvb ymm0, ymm0, ymm2, ymm1
ret
The version using the @llvm.x86.avx2.pblendvb
intrinsic emits the expected assembly. After staring at the version using select
for a while, I can say that all instructions except vpaddb ymm2, ymm0, ymm0
and the last vpblendvb
form an elaborate no-op. I have no idea however what the code generator's intent was with these instructions. I don't see any reason why these functions should not just emit the exact same assembly.
Note: I originally encountered this while using AVX2 intrinsics in Rust, where the output from rustc was much worse than the output from clang for an equivalent function, with the difference being that rustc lowers _mm256_blendv_epi8
to icmp slt + select
, whereas clang lowers it to call @llvm.x86.avx2.pblendvb
.