-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Open
Description
Given the following code
define <16 x i16> @mulbyconst(<16 x i16> %"a") #0 {
top:
%0 = mul <16 x i16> %"a", <i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4>
ret <16 x i16> %0
}
LLVM compiles this to a single vpsllvw instruction with AVX512, but in the absence of AVX512, it instead compiles to two vpsllw and a vpblendw (as shown in https://godbolt.org/z/PMehWerEd).
The issue is that although avx2 CPUs are missing the vpsllvw instruction (because avx2 is a bit of a mess), it includes the vpmullw instruction, so this could have compiled to a single vpmullw instruction by an alternating vector of 256 and 16. This missed optimization is especially annoying because LLVM went through a bunch of work to canonicalize the variable multiplication by powers of 2 into a variable shift left, even though just leaving it as a multiply would have been more efficient.