[AMD][Draft] Eliminate redundant matmul by adjusting HeadDot wavefront partitioning #8449
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related RFC
[RFC][AMD] Optimizations for Paged Attention: Proposal with Multiple Features(#8281)
[Feature 1] Elimination of Redundant Matrix Multiplications
Problem
We found that the QK computation contains redundancy. Analysis shows that
AccelerateAMDMatmuluses{numWarps, 1}by default forHeadDot, meaning that matrix Q is partitioned by rows (seq_len * num_q_heads) across different wavefronts, while matrix K is replicated to all wavefronts.Due to the particularity of PA computation (
seq_len = 1,num_q_heads = 16), row-based partitioning of Q is insufficient, which leads to a large amount of redundant computation. The goal of eliminating redundant matrix multiplications is to adjust the wavefront partitioning strategy to avoid this redundancy.Core Process
HeadDotso that the left operand is prioritized for row partitioning. When the axis length is insufficient for partitioning, allocate the remaining wavefronts to column partitioning of the right operand.