[FA]:Optimize FlashAttention for N_CTX <= 512 #2600

quintinwang5 · 2024-10-31T04:31:29Z

This change can make the cache behavior of N_CTX=512 better. We can get 20%+ performance gain from the change, but it may be harmful to large N_CTX. So restrict this change to N_CTX <= 512. CI data

chengjunlu · 2024-11-01T00:48:54Z

The code change LGTM.
Can you explain the difference of the memory foot print between the old one and the new one?

quintinwang5 · 2024-11-01T01:05:27Z

The code change LGTM. Can you explain the difference of the memory foot print between the old one and the new one?

Actually, I'm still working on figuring it out. It's puzzling.

whitneywhtsang · 2024-11-03T12:01:14Z

Did you get the idea from XeTLA, i.e., is the implementation closer to XeTLA now?

quintinwang5 · 2024-11-04T03:06:57Z

Did you get the idea from XeTLA, i.e., is the implementation closer to XeTLA now?

It was mainly by many tries with profiling data. XeTLA has a different arange that's suitable for all the shapes.

This change (`grid` order adjustment to improve cache hit) originating from #2600. Batched gemm only. ~99% of XeTLA for `4096x8x128x16384`. ![image](https://github.com/user-attachments/assets/ef7e9750-b3f7-4adc-aa66-5be704383e40)

quintinwang5 · 2024-11-26T08:26:44Z

The code change LGTM. Can you explain the difference of the memory foot print between the old one and the new one?

The reason is: we use BLOCK_M to split Q along Y-axis. For different BLOCK_M, they shares the same K, V. So if these workgroups which process different BLOCK_M are scheduled continuously, K, V's cache can be reused. Taking (32, 32, 512, 64) as an example, the old nd_range is {(4, 32, 32), (128, 1, 1)}. There are 4 x 128(BLOCK_M) = 512 blocks. 4 blocks are not consecutive, the stride is 32x32 workgroups. If we change it to {(32, 32, 4), (128, 1, 1)}, 4 blocks are consecutive now.

[FA]:Optimize cache hit for N_CTX <= 512

9c40aaf

quintinwang5 requested review from Dewei-Wang-sh, chengjunlu, etiotto and whitneywhtsang October 31, 2024 04:32

quintinwang5 linked an issue Oct 31, 2024 that may be closed by this pull request

[FA] Improve performance of shapes <95% on advanced path - 32x32x512, 4x32x4096, 2x32x8192 #2442

Closed

quintinwang5 mentioned this pull request Oct 31, 2024

[FA] Improve performance of shapes <95% on advanced path - 32x32x512, 4x32x4096, 2x32x8192 #2442

Closed

etiotto changed the title ~~[FA]:Optimize FlalshAttention for N_CTX <= 512~~ [FA]:Optimize FlashAttention for N_CTX <= 512 Oct 31, 2024

etiotto approved these changes Oct 31, 2024

View reviewed changes

chengjunlu approved these changes Nov 1, 2024

View reviewed changes

quintinwang5 merged commit b8fc4b9 into main Nov 1, 2024
5 checks passed

quintinwang5 deleted the quintin/perf_n_ctx_512 branch November 1, 2024 01:12

quintinwang5 removed a link to an issue Nov 4, 2024

[FA] Improve performance of shapes <95% on advanced path - 32x32x512, 4x32x4096, 2x32x8192 #2442

Closed

quintinwang5 linked an issue Nov 4, 2024 that may be closed by this pull request

[FA] Improve performance of shapes <95% on advanced path - 32x32x512, 4x32x4096, 2x32x8192 #2442

Closed

yudongsi mentioned this pull request Nov 7, 2024

Improve GEMM performance of shape 4096x8x128x16384 #2646

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FA]:Optimize FlashAttention for N_CTX <= 512 #2600

[FA]:Optimize FlashAttention for N_CTX <= 512 #2600

Uh oh!

quintinwang5 commented Oct 31, 2024 •

edited

Loading

Uh oh!

chengjunlu commented Nov 1, 2024

Uh oh!

quintinwang5 commented Nov 1, 2024

Uh oh!

Uh oh!

whitneywhtsang commented Nov 3, 2024

Uh oh!

quintinwang5 commented Nov 4, 2024

Uh oh!

quintinwang5 commented Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[FA]:Optimize FlashAttention for N_CTX <= 512 #2600

[FA]:Optimize FlashAttention for N_CTX <= 512 #2600

Uh oh!

Conversation

quintinwang5 commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chengjunlu commented Nov 1, 2024

Uh oh!

quintinwang5 commented Nov 1, 2024

Uh oh!

Uh oh!

whitneywhtsang commented Nov 3, 2024

Uh oh!

quintinwang5 commented Nov 4, 2024

Uh oh!

quintinwang5 commented Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

quintinwang5 commented Oct 31, 2024 •

edited

Loading