[Attention] Update to lastest FA3 code#13111
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
@khluu can we run the perf CI on this? would be nice to check for regressions since theres alot of FA changes |
61a411e to
c65ddce
Compare
There was a problem hiding this comment.
do we need to handle the fact that flash_attn_varlen_diff_headdims returns both output and *rest in this case?
There was a problem hiding this comment.
we did, but the return was slicing a tensor
|
This pull request has merge conflicts that must be resolved before it can be |
c69b970 to
d0ce155
Compare
There was a problem hiding this comment.
Why does this not work for ROCm?
There was a problem hiding this comment.
sorry this was cruft from an earlier slack discussion that cast doubts on if return_softmax_lse was supported on RoCM
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
eb958e7 to
fcc54d8
Compare
|
Hi, @LucasWilkinson @simon-mo. This PR uses a new function It seems that a new release of |
|
https://pypi.org/project/vllm-flash-attn/ It’s been a long time since |
We currently ship |
|
Hi, @LucasWilkinson. Does the latest FA3 still fail on Lovelace GPUs due to shared memory limits for some shapes? |
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
NOTE: Tested MLA on AMD V0 is working, V1 is broken but is also broken on main
Perf: https://docs.google.com/spreadsheets/d/1U5lsoCKuWq99Cz1QbWkc0dBn1bij1Ifb3tphE2UXJj0/edit?usp=sharing
Main:
This PR:
See accuracy drops for:
Due to the dynamic split scheduler