Skip to content

Conversation

@vnnm404
Copy link

@vnnm404 vnnm404 commented Dec 23, 2025

In the original algorithm, the intermediate scores S_i are not stored in shared memory. Instead, the output O_i is accumulated incrementally as each block is processed. This PR adopts that approach, removing the need to materialize S_i and aligning the implementation more directly with the paper.

Additionally, the kernel launch configuration has been changed to use one thread per row. This removes the outer T_c loop, making the control flow much closer to the pseudocode in the paper and easier to reason about and compare against the reference algorithm.

For the particular tensor sizes used, this code uses more shared memory and runs slightly slower (by ~1ms) on a 3060, however it may be easier to understand and extend, especially for readers learning how the algorithm works.

image

Results

=== profiling manual attention ===
Self CPU time total: 97.501ms
Self CUDA time total: 97.638ms

=== profiling minimal flash attention ===
Self CPU time total: 15.558ms
Self CUDA time total: 6.453ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant