Skip to content

Conversation

@Max191
Copy link
Contributor

@Max191 Max191 commented Jan 28, 2026

Adds the ROCDLLoadToTransposeLoad pass to the LLVMGPUTileAndFuse pipeline. This is only enabled for ROCDL, and a test flag is added to turn off the feature if needed (mainly for benchmark testing).

Convolution Benchmark Results

The data below is all for bf16 convolutions. I didn't put comprehensive GEMM (MxK @ KxN) data in a spreadsheet, but for GEMM, the speedup is in the range of 0-17% for f16 and 0-60% for i8 GEMMs.

Full spreadsheet of results: https://docs.google.com/spreadsheets/d/1QEwemqviUzk4GginGdaDT7u8pP9x_Bku7r1aAvOUOCk/edit?usp=sharing

Weight Backward Convolutions

Equates to KxM @ KxN GEMM layout.

Benchmark Summary:
Total benchmarks: 162
Significant changes (>2.0%): 142
Improvements (transpose_load faster): 133
Regressions (default faster): 9
Mean % change: -20.44%
Range: -56.29% to 12.22%

Input Backward Convolutions

Equates to MxK @ KxN GEMM layout.

Benchmark Summary:
Total benchmarks: 146
Significant changes (>2.0%): 43
Improvements (transpose_load faster): 24
Regressions (default faster): 19
Mean % change: -1.09%
Range: -34.02% to 12.08%

Additional notes

  • All benchmarks were collected on a single MI355 GPU
  • There are "regressions, but the top regressions appear to be from noise. Retesting them did not show real regressions.
  • Forward convolution data is not included, since forward convolutions do not target transpose_load instructions.

ci-extra: test_torch

@Max191
Copy link
Contributor Author

Max191 commented Jan 28, 2026

Based on #23267

I haven't collected comprehensive benchmark numbers, so I will share them in the PR description once I have collected them.

@Max191 Max191 force-pushed the enable-transpose-load branch from 7323b43 to fa98bda Compare February 2, 2026 18:56
@Max191 Max191 marked this pull request as ready for review February 2, 2026 18:57
@Max191
Copy link
Contributor Author

Max191 commented Feb 2, 2026

I was finally able to get some good benchmark runs, and I added the results to the description.

cc @nirvedhmeshram @yzhang93 @MaheshRavishankar

Copy link
Contributor

@nirvedhmeshram nirvedhmeshram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great to see performance improvement on weight backward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants