[AMD][Draft] Eliminate redundant matmul by adjusting HeadDot wavefront partitioning #8449

the-strawhat · 2025-10-16T04:27:51Z

Related RFC

[RFC][AMD] Optimizations for Paged Attention: Proposal with Multiple Features(#8281)

[Feature 1] Elimination of Redundant Matrix Multiplications

Problem
We found that the QK computation contains redundancy. Analysis shows that AccelerateAMDMatmul uses {numWarps, 1} by default for HeadDot, meaning that matrix Q is partitioned by rows (seq_len * num_q_heads) across different wavefronts, while matrix K is replicated to all wavefronts.
Due to the particularity of PA computation (seq_len = 1, num_q_heads = 16), row-based partitioning of Q is insufficient, which leads to a large amount of redundant computation. The goal of eliminating redundant matrix multiplications is to adjust the wavefront partitioning strategy to avoid this redundancy.

Core Process

Modify the partitioning of HeadDot so that the left operand is prioritized for row partitioning. When the axis length is insufficient for partitioning, allocate the remaining wavefronts to column partitioning of the right operand.

…tioning

ThomasRaoux · 2025-10-16T14:55:06Z

test/TritonGPU/amd/accelerate-amd-matmul-chain-dot.mlir

+
+// -----
+
+#blocked = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>


please minimize the test

[AMD] Eliminate redundant matmul by adjusting HeadDot wavefront parti…

357110e

…tioning

the-strawhat requested review from antiagainst, ptillet and zhanglx13 as code owners October 16, 2025 04:27

the-strawhat mentioned this pull request Oct 16, 2025

[RFC][AMD] Optimizations for Paged Attention: Proposal with Multiple Features #8281

Open

ThomasRaoux reviewed Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD][Draft] Eliminate redundant matmul by adjusting HeadDot wavefront partitioning #8449

[AMD][Draft] Eliminate redundant matmul by adjusting HeadDot wavefront partitioning #8449

Uh oh!

the-strawhat commented Oct 16, 2025

Uh oh!

ThomasRaoux Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		// -----

		#blocked = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>

[AMD][Draft] Eliminate redundant matmul by adjusting HeadDot wavefront partitioning #8449

Are you sure you want to change the base?

[AMD][Draft] Eliminate redundant matmul by adjusting HeadDot wavefront partitioning #8449

Uh oh!

Conversation

the-strawhat commented Oct 16, 2025

Related RFC

[Feature 1] Elimination of Redundant Matrix Multiplications

Uh oh!

ThomasRaoux Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants