-
Notifications
You must be signed in to change notification settings - Fork 14.7k
Open
Labels
Description
Introduction of v2f32 support appears to cause performance regression in IR that uses <1 x float>
vectors.
LLVM then tends to use shufflevector
to build <2 x float>
and our lowering for that ends up doing it the hard way, which does regress performance in some of our benchmarks.
Minimized reproducer: https://godbolt.org/z/8efcrna8b
One kernel constructs <2 x float>
using insertelement
and all of it is removed during lowering, and the case that uses <1 x float>
and shufflevector
ends up doing a lot more unnecessary work.
%i4 = shufflevector <1 x float> %i1, <1 x float> %i2, <2 x i32> <i32 0, i32 1>
->
cvt.u64.u32 %rd3, %r1;
cvt.u64.u32 %rd4, %r2;
shl.b64 %rd5, %rd4, 32;
or.b64 %rd6, %rd3, %rd5;
Vs:
%i4 = insertelement <2 x float> undef, float %i1, i64 0
%a = insertelement <2 x float> %i4, float %i2, i64 1
->
...[nothing. LLVM removes vector creation and uses the original inputs %i1/%i2 ]