Skip to content

[NVPTX] Performance regression in IR that uses <1 x float> #153109

@Artem-B

Description

@Artem-B

Introduction of v2f32 support appears to cause performance regression in IR that uses <1 x float> vectors.

LLVM then tends to use shufflevector to build <2 x float> and our lowering for that ends up doing it the hard way, which does regress performance in some of our benchmarks.

Minimized reproducer: https://godbolt.org/z/8efcrna8b

One kernel constructs <2 x float> using insertelement and all of it is removed during lowering, and the case that uses <1 x float> and shufflevector ends up doing a lot more unnecessary work.

 %i4 = shufflevector <1 x float> %i1, <1 x float> %i2, <2 x i32> <i32 0, i32 1>

->
        cvt.u64.u32     %rd3, %r1;
        cvt.u64.u32     %rd4, %r2;
        shl.b64         %rd5, %rd4, 32;
        or.b64  %rd6, %rd3, %rd5;

Vs:

  %i4 = insertelement <2 x float> undef, float %i1, i64 0
  %a = insertelement <2 x float> %i4, float %i2, i64 1

->
 ...[nothing. LLVM removes vector creation and uses the original inputs %i1/%i2 ]

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions