Do not set block load attribute for transposed A matrices #2443

alexbaden · 2024-10-09T02:13:19Z

We cannot lower a transposed A matrix to a transposed 2D block load. Instead, the load is lowered via the LLVM path introduced in #2181 . There appears to be a performance regression in this path which is slower than materializing the block in SLM and then reading into registers and computing the dot product from there. Using the work in #2420 I am able to drop the block load attribute for this case and go down the non block ptr path.

Performance on main:

Compute A x B
✅ Triton and Torch match
Time for torch: 0.32444801926612854 ms
Time for triton: 0.44371041655540466 ms
Compute A x B.T
✅ Triton and Torch match
Time for torch: 0.32708799839019775 ms
Time for triton: 0.634996771812439 ms
Compute A.T x B
✅ Triton and Torch match
Time for torch: 0.31204161047935486 ms
Time for triton: 3.4140689373016357 ms
Compute A.T x B.T
✅ Triton and Torch match
Time for torch: 0.45701122283935547 ms
Time for triton: 3.7463345527648926 ms

Performance on this PR:

Compute A x B
✅ Triton and Torch match
Time for torch: 0.3081200122833252 ms
Time for triton: 0.44333598017692566 ms
Compute A x B.T
✅ Triton and Torch match
Time for torch: 0.33799198269844055 ms
Time for triton: 0.6391856074333191 ms
Compute A.T x B
✅ Triton and Torch match
Time for torch: 0.31700319051742554 ms
Time for triton: 1.5733630657196045 ms
Compute A.T x B.T
✅ Triton and Torch match
Time for torch: 0.45083683729171753 ms
Time for triton: 1.8271965980529785 ms

Note that the important commit is 31386ef1132c3f6cf9cb5f1063ecfab705f4c2a1. Once #2420 is merged I will rebase this.

Depends on #2420. Links to #1795.

python/test/unit/language/test_core.py

third_party/intel/lib/TritonIntelGPUTransforms/MaterializeBlockPointer.cpp

whitneywhtsang

This change can be used to workaround the performance regression in the short term, let's improve the block pointer to tensor of pointer lowering in the long term.

alexbaden · 2024-10-10T15:28:14Z

Sounds good to me!

alexbaden requested review from chengjunlu, etiotto and whitneywhtsang October 9, 2024 02:13

chengjunlu approved these changes Oct 9, 2024

View reviewed changes

chengjunlu reviewed Oct 9, 2024

View reviewed changes

python/test/unit/language/test_core.py Outdated Show resolved Hide resolved

whitneywhtsang reviewed Oct 9, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/MaterializeBlockPointer.cpp Show resolved Hide resolved

vlad-penkin linked an issue Oct 10, 2024 that may be closed by this pull request

[GEMM-perf] matmul is slower when one input needs to be transposed #1795

Closed

Do not set block load attribute for tranposed A matrices

61ef8a7

alexbaden force-pushed the alex/skip_transposed_a_matrix_in_mbp branch from 31386ef to 61ef8a7 Compare October 10, 2024 11:41

alexbaden requested a review from whitneywhtsang October 10, 2024 13:35

whitneywhtsang approved these changes Oct 10, 2024

View reviewed changes

alexbaden merged commit 89868e2 into main Oct 10, 2024

alexbaden deleted the alex/skip_transposed_a_matrix_in_mbp branch October 10, 2024 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not set block load attribute for transposed A matrices #2443

Do not set block load attribute for transposed A matrices #2443

Uh oh!

alexbaden commented Oct 9, 2024

Uh oh!

Uh oh!

Uh oh!

whitneywhtsang left a comment

Uh oh!

alexbaden commented Oct 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Do not set block load attribute for transposed A matrices #2443

Do not set block load attribute for transposed A matrices #2443

Uh oh!

Conversation

alexbaden commented Oct 9, 2024

Uh oh!

Uh oh!

Uh oh!

whitneywhtsang left a comment

Choose a reason for hiding this comment

Uh oh!

alexbaden commented Oct 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants