[Attention] Update to lastest FA3 code by LucasWilkinson · Pull Request #13111 · vllm-project/vllm

LucasWilkinson · 2025-02-11T19:42:16Z

NOTE: Tested MLA on AMD V0 is working, V1 is broken but is also broken on main

Perf: https://docs.google.com/spreadsheets/d/1U5lsoCKuWq99Cz1QbWkc0dBn1bij1Ifb3tphE2UXJj0/edit?usp=sharing

Main:

--------------------------------------
Full Command:
VLLM_USE_V1=0 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,max_num_batched_tokens=1024,enable_chunked_prefill=1 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|
Log file saved at: logs/deepseek_v0_chunked_20250326_005420.log

--------------------------------------
Full Command:
VLLM_USE_V1=0 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|
Log file saved at: logs/deepseek_v0_nchunked_20250326_005531.log

--------------------------------------
Full Command:
VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,max_num_batched_tokens=1024,enable_chunked_prefill=1 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|
Log file saved at: logs/deepseek_v1_chunked_20250326_005722.log

--------------------------------------
Full Command:
VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|
Log file saved at: logs/deepseek_v1_nchunked_20250326_005836.log

--------------------------------------
Full Command:
VLLM_USE_V1=0 lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=8192 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.5|±  |0.1667|
|     |       |strict-match    |     5|exact_match|↑  |  0.5|±  |0.1667|
Log file saved at: logs/metalla_v0_20250326_005934.log

--------------------------------------
Full Command:
VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=8192 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.5|±  |0.1667|
|     |       |strict-match    |     5|exact_match|↑  |  0.5|±  |0.1667|
Log file saved at: logs/metalla_v1_20250326_010112.log

This PR:

--------------------------------------
Full Command:
VLLM_USE_V1=0 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,max_num_batched_tokens=1024,enable_chunked_prefill=1 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|
Log file saved at: logs/deepseek_v0_chunked_20250326_012034.log

--------------------------------------
Full Command:
VLLM_USE_V1=0 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|
Log file saved at: logs/deepseek_v0_nchunked_20250326_012150.log

--------------------------------------
Full Command:
VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,max_num_batched_tokens=1024,enable_chunked_prefill=1 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|
Log file saved at: logs/deepseek_v1_chunked_20250326_012340.log

--------------------------------------
Full Command:
VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|
Log file saved at: logs/deepseek_v1_nchunked_20250326_012458.log

--------------------------------------
Full Command:
VLLM_USE_V1=0 lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=8192 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.5|±  |0.1667|
|     |       |strict-match    |     5|exact_match|↑  |  0.5|±  |0.1667|
Log file saved at: logs/metalla_v0_20250326_012559.log

--------------------------------------
Full Command:
VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=8192 --task gsm8k --num_fewshot 5 --limit 10

Extracted Result Table:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.5|±  |0.1633|
|     |       |strict-match    |     5|exact_match|↑  |  0.5|±  |0.1633|
Log file saved at: logs/metalla_v1_20250326_012743.log

See accuracy drops for:

VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=8192 --task gsm8k --num_fewshot 5 --limit 10

Due to the dynamic split scheduler

github-actions · 2025-02-11T19:42:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

LucasWilkinson · 2025-02-11T20:57:20Z

@khluu can we run the perf CI on this? would be nice to check for regressions since theres alot of FA changes

vllm/attention/backends/mla/common.py

tlrmchlsmth · 2025-02-27T02:04:53Z

vllm/attention/backends/mla/common.py

do we need to handle the fact that flash_attn_varlen_diff_headdims returns both output and *rest in this case?

we did, but the return was slicing a tensor

tlrmchlsmth

LGTM

mergify · 2025-02-27T02:36:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/attention/backends/mla/common.py

qli88 · 2025-02-28T18:03:46Z

vllm/attention/backends/mla/common.py

Why does this not work for ROCm?

sorry this was cruft from an earlier slack discussion that cast doubts on if return_softmax_lse was supported on RoCM

vllm/v1/attention/backends/mla/common.py

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

chaunceyjiang · 2025-04-18T07:16:44Z

Hi, @LucasWilkinson @simon-mo. This PR uses a new function get_scheduler_metadata:

https://github.com/vllm-project/flash-attention/blob/main/vllm_flash_attn/flash_attn_interface.py#L78

It seems that a new release of vllm_flash_attn is needed.

chaunceyjiang · 2025-04-18T07:20:06Z

https://pypi.org/project/vllm-flash-attn/

It’s been a long time since vllm-flash-attn had a new release. Should we consider publishing a new version?

LucasWilkinson · 2025-04-18T13:42:20Z

Hi, @LucasWilkinson @simon-mo. This PR uses a new function get_scheduler_metadata:

https://github.com/vllm-project/flash-attention/blob/main/vllm_flash_attn/flash_attn_interface.py#L78

It seems that a new release of vllm_flash_attn is needed.

We currently ship vllm_flash_attn inside the vLLM wheel, so you will likely need to rebuild from scratch: #16813 (comment)

nnding · 2025-04-21T09:00:53Z

Hi, @LucasWilkinson. Does the latest FA3 still fail on Lovelace GPUs due to shared memory limits for some shapes?

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Yang Wang <elainewy@meta.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

mergify bot added the ci/build label Feb 11, 2025

LucasWilkinson changed the title ~~[WIP][Attention] Update to lastest FA3 code that supports different K and V head dims~~ [Attention] Update to lastest FA3 code that supports different K and V head dims Feb 11, 2025

LucasWilkinson marked this pull request as ready for review February 11, 2025 20:56

LucasWilkinson requested a review from tlrmchlsmth as a code owner February 11, 2025 20:56

khluu added the perf-benchmarks label Feb 11, 2025

LucasWilkinson mentioned this pull request Feb 13, 2025

[Attention] MLA with chunked prefill #12639

Merged

4 tasks

LucasWilkinson force-pushed the lwilkinson/no-pad-fa3 branch from 61a411e to c65ddce Compare February 26, 2025 15:59

tlrmchlsmth reviewed Feb 27, 2025

View reviewed changes

vllm/attention/backends/mla/common.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 27, 2025

View reviewed changes

tlrmchlsmth approved these changes Feb 27, 2025

View reviewed changes

mergify bot added the needs-rebase label Feb 27, 2025

LucasWilkinson force-pushed the lwilkinson/no-pad-fa3 branch from c69b970 to d0ce155 Compare February 27, 2025 22:16

LucasWilkinson requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners February 28, 2025 07:46

mergify bot added v1 and removed needs-rebase labels Feb 28, 2025

qli88 reviewed Feb 28, 2025

View reviewed changes

vllm/attention/backends/mla/common.py Outdated Show resolved Hide resolved

qli88 reviewed Feb 28, 2025

View reviewed changes

vllm/attention/backends/mla/common.py Outdated Show resolved Hide resolved

qli88 reviewed Feb 28, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

qli88 reviewed Feb 28, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

This was referenced Mar 5, 2025

[Attention] FlashAttn MLA #14258

Merged

[Hardware] Update the flash attn tag to support Blackwell #14244

Merged

LucasWilkinson added 11 commits April 16, 2025 04:19

update FA

417a4ea

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

update mla

f3d862a

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

add scheduler code

b491367

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

cleanups

d6888b4

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

update fa

a8292a3

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

precommit + missing args

11f61ee

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

mla fixes

2d5dd63

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

cleanup

286c79f

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

update git hash

4308a5e

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

amd fixes

31be47b

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

more amd tweaks

fcc54d8

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson force-pushed the lwilkinson/no-pad-fa3 branch from eb958e7 to fcc54d8 Compare April 16, 2025 04:19

LucasWilkinson removed the perf-benchmarks label Apr 16, 2025

simon-mo approved these changes Apr 17, 2025

View reviewed changes

simon-mo merged commit 183dad7 into vllm-project:main Apr 17, 2025
42 of 48 checks passed

reidliu41 mentioned this pull request Apr 18, 2025

[Bug]: ImportError: cannot import name 'get_scheduler_metadata' from 'vllm.vllm_flash_attn' #16813

Closed

1 task

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025

[Attention] Update to lastest FA3 code (vllm-project#13111)

136c5a5

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Yang Wang <elainewy@meta.com>

LucasWilkinson mentioned this pull request Apr 22, 2025

[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) #16998

Merged

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[Attention] Update to lastest FA3 code (vllm-project#13111)

e7cfb5a

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Attention] Update to lastest FA3 code (vllm-project#13111)

04984a1

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025

[Attention] Update to lastest FA3 code (vllm-project#13111)

655b6e8

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

LucasWilkinson mentioned this pull request May 2, 2025

[BugFix][Attention] Fix sliding window attention in V1 giving incorrect results #17574

Merged

vllmellm mentioned this pull request May 7, 2025

[FEAT][ROCm]: Support AITER MLA on V1 Engine #17523

Merged

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Attention] Update to lastest FA3 code (vllm-project#13111)

bcebb45

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

Uh oh!

Conversation

LucasWilkinson commented Feb 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main:

This PR:

Uh oh!

github-actions bot commented Feb 11, 2025

Uh oh!

LucasWilkinson commented Feb 11, 2025

Uh oh!

Uh oh!

tlrmchlsmth Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

qli88 Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chaunceyjiang commented Apr 18, 2025

Uh oh!

chaunceyjiang commented Apr 18, 2025

Uh oh!

LucasWilkinson commented Apr 18, 2025

Uh oh!

nnding commented Apr 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

LucasWilkinson commented Feb 11, 2025 •

edited by github-actions bot

Loading