[LLVMGPU] Add ROCDLLoadToTransposeLoadPass to TileAndFuse pipeline #23317

Max191 · 2026-01-28T22:08:46Z

Adds the ROCDLLoadToTransposeLoad pass to the LLVMGPUTileAndFuse pipeline. This is only enabled for ROCDL, and a test flag is added to turn off the feature if needed (mainly for benchmark testing).

Convolution Benchmark Results

The data below is all for bf16 convolutions. I didn't put comprehensive GEMM (MxK @ KxN) data in a spreadsheet, but for GEMM, the speedup is in the range of 0-17% for f16 and 0-60% for i8 GEMMs.

Full spreadsheet of results: https://docs.google.com/spreadsheets/d/1QEwemqviUzk4GginGdaDT7u8pP9x_Bku7r1aAvOUOCk/edit?usp=sharing

Weight Backward Convolutions

Equates to KxM @ KxN GEMM layout.

Benchmark Summary:
Total benchmarks: 162
Significant changes (>2.0%): 142
Improvements (transpose_load faster): 133
Regressions (default faster): 9
Mean % change: -20.44%
Range: -56.29% to 12.22%

Input Backward Convolutions

Equates to MxK @ KxN GEMM layout.

Benchmark Summary:
Total benchmarks: 146
Significant changes (>2.0%): 43
Improvements (transpose_load faster): 24
Regressions (default faster): 19
Mean % change: -1.09%
Range: -34.02% to 12.08%

Additional notes

All benchmarks were collected on a single MI355 GPU
There are "regressions, but the top regressions appear to be from noise. Retesting them did not show real regressions.
Forward convolution data is not included, since forward convolutions do not target transpose_load instructions.

ci-extra: test_torch

Max191 · 2026-01-28T22:09:40Z

Based on #23267

I haven't collected comprehensive benchmark numbers, so I will share them in the PR description once I have collected them.

Signed-off-by: Max Dawkins <[email protected]>

Max191 · 2026-02-02T18:58:08Z

I was finally able to get some good benchmark runs, and I added the results to the description.

cc @nirvedhmeshram @yzhang93 @MaheshRavishankar

nirvedhmeshram

LGTM, great to see performance improvement on weight backward!

[LLVMGPU] Add ROCDLLoadToTransposeLoadPass to TileAndFuse pipeline

fa98bda

Signed-off-by: Max Dawkins <[email protected]>

Max191 force-pushed the enable-transpose-load branch from 7323b43 to fa98bda Compare February 2, 2026 18:56

Max191 marked this pull request as ready for review February 2, 2026 18:57

Max191 requested review from Groverkss, krzysz00, kuhar, nirvedhmeshram and qedawkins as code owners February 2, 2026 18:57

nirvedhmeshram approved these changes Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVMGPU] Add ROCDLLoadToTransposeLoadPass to TileAndFuse pipeline #23317

[LLVMGPU] Add ROCDLLoadToTransposeLoadPass to TileAndFuse pipeline #23317

Max191 commented Jan 28, 2026 •

edited

Loading

Uh oh!

Max191 commented Jan 28, 2026

Uh oh!

Max191 commented Feb 2, 2026

Uh oh!

nirvedhmeshram left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[LLVMGPU] Add ROCDLLoadToTransposeLoadPass to TileAndFuse pipeline #23317

Are you sure you want to change the base?

[LLVMGPU] Add ROCDLLoadToTransposeLoadPass to TileAndFuse pipeline #23317

Conversation

Max191 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Convolution Benchmark Results

Weight Backward Convolutions

Input Backward Convolutions

Additional notes

Uh oh!

Max191 commented Jan 28, 2026

Uh oh!

Max191 commented Feb 2, 2026

Uh oh!

nirvedhmeshram left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Max191 commented Jan 28, 2026 •

edited

Loading