Skip to content

Conversation

@alexbaden
Copy link
Contributor

We cannot lower a transposed A matrix to a transposed 2D block load. Instead, the load is lowered via the LLVM path introduced in #2181 . There appears to be a performance regression in this path which is slower than materializing the block in SLM and then reading into registers and computing the dot product from there. Using the work in #2420 I am able to drop the block load attribute for this case and go down the non block ptr path.

Performance on main:

Compute A x B
✅ Triton and Torch match
Time for torch: 0.32444801926612854 ms
Time for triton: 0.44371041655540466 ms
Compute A x B.T
✅ Triton and Torch match
Time for torch: 0.32708799839019775 ms
Time for triton: 0.634996771812439 ms
Compute A.T x B
✅ Triton and Torch match
Time for torch: 0.31204161047935486 ms
Time for triton: 3.4140689373016357 ms
Compute A.T x B.T
✅ Triton and Torch match
Time for torch: 0.45701122283935547 ms
Time for triton: 3.7463345527648926 ms

Performance on this PR:

Compute A x B
✅ Triton and Torch match
Time for torch: 0.3081200122833252 ms
Time for triton: 0.44333598017692566 ms
Compute A x B.T
✅ Triton and Torch match
Time for torch: 0.33799198269844055 ms
Time for triton: 0.6391856074333191 ms
Compute A.T x B
✅ Triton and Torch match
Time for torch: 0.31700319051742554 ms
Time for triton: 1.5733630657196045 ms
Compute A.T x B.T
✅ Triton and Torch match
Time for torch: 0.45083683729171753 ms
Time for triton: 1.8271965980529785 ms

Note that the important commit is 31386ef1132c3f6cf9cb5f1063ecfab705f4c2a1. Once #2420 is merged I will rebase this.

Depends on #2420. Links to #1795.

@vlad-penkin vlad-penkin linked an issue Oct 10, 2024 that may be closed by this pull request
@alexbaden alexbaden force-pushed the alex/skip_transposed_a_matrix_in_mbp branch from 31386ef to 61ef8a7 Compare October 10, 2024 11:41
Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be used to workaround the performance regression in the short term, let's improve the block pointer to tensor of pointer lowering in the long term.

@alexbaden
Copy link
Contributor Author

Sounds good to me!

@alexbaden alexbaden merged commit 89868e2 into main Oct 10, 2024
@alexbaden alexbaden deleted the alex/skip_transposed_a_matrix_in_mbp branch October 10, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GEMM-perf] matmul is slower when one input needs to be transposed

4 participants