[Attention] FA4 integration by LucasWilkinson · Pull Request #32974 · vllm-project/vllm

LucasWilkinson · 2026-01-23T22:30:42Z

Integrate upstream FA4; currently only faster for prefill and spec-decode. Follow up PRs will try to use this prefill for MLA and/or used in a composite backend (flashinfer decode, flash attn prefill)

Results:
                                             Attention Benchmark Results                                             
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Batch        ┃                        ┃ Batch ┃    flash ┃   flash ┃   triton ┃  triton ┃ flashinfer ┃ flashinfer ┃
┃ Spec         ┃ Type                   ┃  Size ┃ Time (s) ┃ vs Best ┃ Time (s) ┃ vs Best ┃   Time (s) ┃    vs Best ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ q512         │ prefill                │     1 │ 0.000039 │  111.0% │ 0.000051 │  145.1% │   0.000035 │     100.0% │
│ q2k          │ prefill                │     1 │ 0.000041 │  100.0% │ 0.000473 │ 1141.7% │   0.000073 │     176.5% │
│ q4k          │ prefill                │     1 │ 0.000122 │  100.0% │ 0.001715 │ 1402.3% │   0.000181 │     147.9% │
│ q8k          │ prefill                │     1 │ 0.000415 │  100.0% │ 0.006517 │ 1570.0% │   0.000555 │     133.6% │
│ 8q1s1k       │ decode                 │     8 │ 0.000062 │  172.1% │ 0.000068 │  188.5% │   0.000036 │     100.0% │
│ 16q1s2k      │ decode                 │    16 │ 0.000039 │  105.8% │ 0.000069 │  187.3% │   0.000037 │     100.0% │
│ 32q1s1k      │ decode                 │    32 │ 0.000038 │  108.1% │ 0.000063 │  177.2% │   0.000036 │     100.0% │
│ 64q1s4k      │ decode                 │    64 │ 0.000189 │  118.2% │ 0.000268 │  167.8% │   0.000160 │     100.0% │
│ 2q2k_8q1s1k  │ mixed (decode+prefill) │    10 │ 0.000095 │  100.0% │ 0.000896 │  938.9% │   0.000128 │     134.1% │
│ 4q1k_16q1s2k │ mixed (decode+prefill) │    20 │ 0.000094 │  100.0% │ 0.000552 │  586.2% │   0.000115 │     122.5% │
│ 2q4k_32q1s1k │ mixed (decode+prefill) │    34 │ 0.000282 │  100.0% │ 0.003350 │ 1188.7% │   0.000324 │     114.9% │
│ 16q2s1k      │ spec-decode            │    16 │ 0.000038 │  100.0% │ 0.000058 │  154.5% │   0.000064 │     170.6% │
│ 16q4s1k      │ spec-decode            │    16 │ 0.000040 │  100.0% │ 0.000057 │  142.9% │   0.000065 │     163.4% │
│ 16q8s1k      │ spec-decode            │    16 │ 0.000038 │  100.0% │ 0.000061 │  158.8% │   0.000065 │     168.9% │
│ 32q4s2k      │ spec-decode            │    32 │ 0.000057 │  100.0% │ 0.000134 │  236.3% │   0.000189 │     334.0% │
│ 8q8s4k       │ spec-decode            │     8 │ 0.000060 │  100.0% │ 0.000202 │  337.3% │   0.000110 │     183.8% │
│ q1ks2k       │ extend                 │     1 │ 0.000039 │  100.0% │ 0.000365 │  937.1% │   0.000046 │     118.1% │
│ 2q1ks4k      │ extend                 │     2 │ 0.000111 │  100.0% │ 0.001482 │ 1330.3% │   0.000126 │     113.4% │
└──────────────┴────────────────────────┴───────┴──────────┴─────────┴──────────┴─────────┴────────────┴────────────┘

currently blocked by: #34043

gemini-code-assist

Code Review

This pull request successfully integrates FlashAttention 4 (FA4) by updating CMake configurations, adding necessary dependencies, and extending the attention configuration and version detection logic. The new vllm/vllm_flash_attn/flash_attn_interface.py file correctly centralizes the logic for selecting and using different FlashAttention versions (FA2, FA3, FA4).

However, there's a critical issue with the newly added file vllm/third_party/flashmla/flash_mla_interface.py. This file appears to duplicate the general FlashAttention (FA4) functions (_flash_attn_varlen_forward, flash_attn_varlen_func, etc.) which are already correctly implemented and managed in vllm/vllm_flash_attn/flash_attn_interface.py. This leads to code duplication and potential confusion regarding which implementation is canonical. The vllm/third_party/flashmla/flash_mla_interface.py file should ideally only contain FlashMLA-specific functionalities.

vllm/third_party/flashmla/flash_mla_interface.py

cmake/external_projects/vllm_flash_attn.cmake

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

vllm/v1/attention/backends/fa_utils.py

cmake/external_projects/vllm_flash_attn.cmake

mergify · 2026-01-28T23:57:32Z

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-29T00:13:02Z

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-29T21:11:17Z

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-29T21:31:18Z

Documentation preview: https://vllm--32974.org.readthedocs.build/en/32974/

mergify · 2026-01-29T21:35:05Z

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-31T14:59:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Matthew Bonanni <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Matthew Bonanni <[email protected]>

mergify · 2026-02-05T19:44:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Matthew Bonanni <[email protected]>

mergify · 2026-02-06T22:10:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Matthew Bonanni <[email protected]>

mgoin · 2026-02-11T22:58:49Z

vllm/v1/attention/backends/fa_utils.py

+        elif device_capability.major >= 10 and is_fa_version_supported(4):
+            # Blackwell (SM100+): prefer FA4
+            fa_version = 4


Will the cutedsl run on all arches above sm100 or should we restrict to sm10x? I can try to test on sm120 on my desktop at home if you think it would work

I think it should but probably safer to restrict for now, will do

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners January 23, 2026 22:30

mergify bot added ci/build nvidia v1 labels Jan 23, 2026

github-project-automation bot added this to NVIDIA Jan 23, 2026

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

vllm/third_party/flashmla/flash_mla_interface.py Outdated Show resolved Hide resolved

cmake/external_projects/vllm_flash_attn.cmake Outdated Show resolved Hide resolved

cursor bot reviewed Jan 23, 2026

View reviewed changes

vllm/v1/attention/backends/fa_utils.py Show resolved Hide resolved

LucasWilkinson force-pushed the lwilkinson/fa4-integration branch from d81034e to 93e0c83 Compare January 25, 2026 06:12

ProExpertProg reviewed Jan 27, 2026

View reviewed changes

cmake/external_projects/vllm_flash_attn.cmake Show resolved Hide resolved

ProExpertProg reviewed Jan 27, 2026

View reviewed changes

cmake/external_projects/vllm_flash_attn.cmake Outdated Show resolved Hide resolved

LucasWilkinson force-pushed the lwilkinson/fa4-integration branch from 3adb2e3 to 4e04951 Compare January 29, 2026 00:09

LucasWilkinson force-pushed the lwilkinson/fa4-integration branch from 4e04951 to dd21519 Compare January 29, 2026 21:06

mergify bot added the documentation Improvements or additions to documentation label Jan 29, 2026

LucasWilkinson changed the title ~~[Attention][WIP] FA4 integration~~ [Attention] FA4 integration Jan 29, 2026

mergify bot added needs-rebase and removed needs-rebase labels Jan 31, 2026

LucasWilkinson and others added 6 commits February 4, 2026 02:52

fix

d0d0ff6

Signed-off-by: Lucas Wilkinson <[email protected]>

Update docs

0bc7d40

Signed-off-by: Matthew Bonanni <[email protected]>

fix

99ef916

Signed-off-by: Lucas Wilkinson <[email protected]>

update

d27b7ed

Signed-off-by: Lucas Wilkinson <[email protected]>

Fix docs build

f5395e1

Signed-off-by: Matthew Bonanni <[email protected]>

Update .gitignore

0cac305

Signed-off-by: Matthew Bonanni <[email protected]>

LucasWilkinson force-pushed the lwilkinson/fa4-integration branch from aa09eeb to 0cac305 Compare February 4, 2026 02:52

mergify bot removed the needs-rebase label Feb 4, 2026

Merge branch 'main' into lwilkinson/fa4-integration

23caed9

LucasWilkinson mentioned this pull request Feb 4, 2026

[Misc] Fix up attention benchmarks #33810

Merged

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2026

MatthewBonanni added 2 commits February 4, 2026 16:43

Merge branch 'main' into lwilkinson/fa4-integration

e67aff1

FA4 doesn't support head sizes > 128

c84d443

Signed-off-by: Matthew Bonanni <[email protected]>

mergify bot added the needs-rebase label Feb 5, 2026

Merge branch 'main' into lwilkinson/fa4-integration

d9bdb06

Signed-off-by: Matthew Bonanni <[email protected]>

mergify bot removed the needs-rebase label Feb 6, 2026

MatthewBonanni added 2 commits February 6, 2026 18:31

Fix cuda driver initialization

86a86d5

Signed-off-by: Matthew Bonanni <[email protected]>

Fix cudagraph dispatch

8240f9f

Signed-off-by: Matthew Bonanni <[email protected]>

mergify bot added the needs-rebase label Feb 6, 2026

Merge branch 'main' into lwilkinson/fa4-integration

346b760

Signed-off-by: Matthew Bonanni <[email protected]>

mergify bot removed the needs-rebase label Feb 11, 2026

LucasWilkinson and others added 2 commits February 11, 2026 11:55

Merge branch 'main' into lwilkinson/fa4-integration

f6bf305

Update FA git tag

e4f5ddf

Signed-off-by: Matthew Bonanni <[email protected]>

mgoin reviewed Feb 11, 2026

View reviewed changes

LucasWilkinson and others added 4 commits February 12, 2026 01:34

update FA

c760b6f

Signed-off-by: Lucas Wilkinson <[email protected]>

update dependencies

5fd8e7f

Signed-off-by: Lucas Wilkinson <[email protected]>

review comments

4f966a6

Signed-off-by: Lucas Wilkinson <[email protected]>

Merge branch 'main' into lwilkinson/fa4-integration

ec47e3e

Uh oh!

Conversation

LucasWilkinson commented Jan 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 28, 2026

Uh oh!

mergify bot commented Jan 29, 2026

Uh oh!

mergify bot commented Jan 29, 2026

Uh oh!

mergify bot commented Jan 29, 2026

Uh oh!

mergify bot commented Jan 29, 2026

Uh oh!

mergify bot commented Jan 31, 2026

Uh oh!

mergify bot commented Feb 5, 2026

Uh oh!

mergify bot commented Feb 6, 2026

Uh oh!

mgoin Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LucasWilkinson commented Jan 23, 2026 •

edited by github-actions bot

Loading

mgoin Feb 11, 2026 •

edited

Loading