This PR implements the flash attention algorithm from the original paper #12

vnnm404 · 2025-12-23T02:44:06Z

In the original algorithm, the intermediate scores S_i are not stored in shared memory. Instead, the output O_i is accumulated incrementally as each block is processed. This PR adopts that approach, removing the need to materialize S_i and aligning the implementation more directly with the paper.

Additionally, the kernel launch configuration has been changed to use one thread per row. This removes the outer T_c loop, making the control flow much closer to the pseudocode in the paper and easier to reason about and compare against the reference algorithm.

For the particular tensor sizes used, this code uses more shared memory and runs slightly slower (by ~1ms) on a 3060, however it may be easier to understand and extend, especially for readers learning how the algorithm works.

Results

=== profiling manual attention ===
Self CPU time total: 97.501ms
Self CUDA time total: 97.638ms

=== profiling minimal flash attention ===
Self CPU time total: 15.558ms
Self CUDA time total: 6.453ms

algorithm that matches the paper

9627b2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This PR implements the flash attention algorithm from the original paper #12

This PR implements the flash attention algorithm from the original paper #12

Uh oh!

vnnm404 commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

This PR implements the flash attention algorithm from the original paper #12

Are you sure you want to change the base?

This PR implements the flash attention algorithm from the original paper #12

Uh oh!

Conversation

vnnm404 commented Dec 23, 2025

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant