[NVPTX] Performance regression in IR that uses `<1 x float>`

Introduction of v2f32 support appears to cause performance regression in IR that uses `<1 x float>` vectors. 

LLVM then tends to use `shufflevector` to build `<2 x float>` and our lowering for that ends up doing it the hard way, which does regress performance in some of our benchmarks.

Minimized reproducer: https://godbolt.org/z/8efcrna8b  

One kernel constructs `<2 x float>` using `insertelement` and all of it is removed during lowering, and the case that uses `<1 x float>` and `shufflevector` ends up doing a lot more unnecessary work.

```
 %i4 = shufflevector <1 x float> %i1, <1 x float> %i2, <2 x i32> <i32 0, i32 1>

->
        cvt.u64.u32     %rd3, %r1;
        cvt.u64.u32     %rd4, %r2;
        shl.b64         %rd5, %rd4, 32;
        or.b64  %rd6, %rd3, %rd5;
```

Vs:
```
  %i4 = insertelement <2 x float> undef, float %i1, i64 0
  %a = insertelement <2 x float> %i4, float %i2, i64 1

->
 ...[nothing. LLVM removes vector creation and uses the original inputs %i1/%i2 ]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVPTX] Performance regression in IR that uses `<1 x float>` #153109

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NVPTX] Performance regression in IR that uses <1 x float> #153109

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[NVPTX] Performance regression in IR that uses `<1 x float>` #153109