Comparing performance

Hi Peter,

Thanks for this nice repo. When I ran in for the first time, the naive attention algorithm is indeed much slower. But on the second run, it was drastically faster than the flash attention kernel. I take it this is the performance after the system is warmed up. 

Is it fair to say that naive attention on these small input sizes is faster than minimal flash attention? I think that would make sense intuitively since the gains from Flash Attention should come from long sequence lengths.

<img width="1263" height="455" alt="Image" src="https://github.com/user-attachments/assets/6da69714-3d19-4e21-a20c-d3dcf1f5f684" />

<img width="1101" height="365" alt="Image" src="https://github.com/user-attachments/assets/f2c681a7-aefc-4843-bf39-841bcb59b541" />



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing performance #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Comparing performance #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions