Skip to content

Conversation

@the-strawhat
Copy link

Related RFC

[RFC][AMD] Optimizations for Paged Attention: Proposal with Multiple Features(#8281)

[Feature 1] Elimination of Redundant Matrix Multiplications

Problem
We found that the QK computation contains redundancy. Analysis shows that AccelerateAMDMatmul uses {numWarps, 1} by default for HeadDot, meaning that matrix Q is partitioned by rows (seq_len * num_q_heads) across different wavefronts, while matrix K is replicated to all wavefronts.
Due to the particularity of PA computation (seq_len = 1, num_q_heads = 16), row-based partitioning of Q is insufficient, which leads to a large amount of redundant computation. The goal of eliminating redundant matrix multiplications is to adjust the wavefront partitioning strategy to avoid this redundancy.

Core Process

  1. Modify the partitioning of HeadDot so that the left operand is prioritized for row partitioning. When the axis length is insufficient for partitioning, allocate the remaining wavefronts to column partitioning of the right operand.


// -----

#blocked = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please minimize the test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants