Skip to content

[Attention] FA4 integration#32974

Open
LucasWilkinson wants to merge 24 commits intovllm-project:mainfrom
neuralmagic:lwilkinson/fa4-integration
Open

[Attention] FA4 integration#32974
LucasWilkinson wants to merge 24 commits intovllm-project:mainfrom
neuralmagic:lwilkinson/fa4-integration

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Jan 23, 2026

Integrate upstream FA4; currently only faster for prefill and spec-decode. Follow up PRs will try to use this prefill for MLA and/or used in a composite backend (flashinfer decode, flash attn prefill)

Results:
                                             Attention Benchmark Results                                             
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Batch        ┃                        ┃ Batch ┃    flash ┃   flash ┃   triton ┃  triton ┃ flashinfer ┃ flashinfer ┃
┃ Spec         ┃ Type                   ┃  Size ┃ Time (s) ┃ vs Best ┃ Time (s) ┃ vs Best ┃   Time (s) ┃    vs Best ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ q512         │ prefill                │     1 │ 0.000039 │  111.0% │ 0.000051 │  145.1% │   0.000035 │     100.0% │
│ q2k          │ prefill                │     1 │ 0.000041 │  100.0% │ 0.000473 │ 1141.7% │   0.000073 │     176.5% │
│ q4k          │ prefill                │     1 │ 0.000122 │  100.0% │ 0.001715 │ 1402.3% │   0.000181 │     147.9% │
│ q8k          │ prefill                │     1 │ 0.000415 │  100.0% │ 0.006517 │ 1570.0% │   0.000555 │     133.6% │
│ 8q1s1k       │ decode                 │     8 │ 0.000062 │  172.1% │ 0.000068 │  188.5% │   0.000036 │     100.0% │
│ 16q1s2k      │ decode                 │    16 │ 0.000039 │  105.8% │ 0.000069 │  187.3% │   0.000037 │     100.0% │
│ 32q1s1k      │ decode                 │    32 │ 0.000038 │  108.1% │ 0.000063 │  177.2% │   0.000036 │     100.0% │
│ 64q1s4k      │ decode                 │    64 │ 0.000189 │  118.2% │ 0.000268 │  167.8% │   0.000160 │     100.0% │
│ 2q2k_8q1s1k  │ mixed (decode+prefill) │    10 │ 0.000095 │  100.0% │ 0.000896 │  938.9% │   0.000128 │     134.1% │
│ 4q1k_16q1s2k │ mixed (decode+prefill) │    20 │ 0.000094 │  100.0% │ 0.000552 │  586.2% │   0.000115 │     122.5% │
│ 2q4k_32q1s1k │ mixed (decode+prefill) │    34 │ 0.000282 │  100.0% │ 0.003350 │ 1188.7% │   0.000324 │     114.9% │
│ 16q2s1k      │ spec-decode            │    16 │ 0.000038 │  100.0% │ 0.000058 │  154.5% │   0.000064 │     170.6% │
│ 16q4s1k      │ spec-decode            │    16 │ 0.000040 │  100.0% │ 0.000057 │  142.9% │   0.000065 │     163.4% │
│ 16q8s1k      │ spec-decode            │    16 │ 0.000038 │  100.0% │ 0.000061 │  158.8% │   0.000065 │     168.9% │
│ 32q4s2k      │ spec-decode            │    32 │ 0.000057 │  100.0% │ 0.000134 │  236.3% │   0.000189 │     334.0% │
│ 8q8s4k       │ spec-decode            │     8 │ 0.000060 │  100.0% │ 0.000202 │  337.3% │   0.000110 │     183.8% │
│ q1ks2k       │ extend                 │     1 │ 0.000039 │  100.0% │ 0.000365 │  937.1% │   0.000046 │     118.1% │
│ 2q1ks4k      │ extend                 │     2 │ 0.000111 │  100.0% │ 0.001482 │ 1330.3% │   0.000126 │     113.4% │
└──────────────┴────────────────────────┴───────┴──────────┴─────────┴──────────┴─────────┴────────────┴────────────┘

currently blocked by: #34043

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully integrates FlashAttention 4 (FA4) by updating CMake configurations, adding necessary dependencies, and extending the attention configuration and version detection logic. The new vllm/vllm_flash_attn/flash_attn_interface.py file correctly centralizes the logic for selecting and using different FlashAttention versions (FA2, FA3, FA4).

However, there's a critical issue with the newly added file vllm/third_party/flashmla/flash_mla_interface.py. This file appears to duplicate the general FlashAttention (FA4) functions (_flash_attn_varlen_forward, flash_attn_varlen_func, etc.) which are already correctly implemented and managed in vllm/vllm_flash_attn/flash_attn_interface.py. This leads to code duplication and potential confusion regarding which implementation is canonical. The vllm/third_party/flashmla/flash_mla_interface.py file should ideally only contain FlashMLA-specific functionalities.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/fa4-integration branch from d81034e to 93e0c83 Compare January 25, 2026 06:12
@mergify
Copy link

mergify bot commented Jan 28, 2026

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/fa4-integration branch from 3adb2e3 to 4e04951 Compare January 29, 2026 00:09
@mergify
Copy link

mergify bot commented Jan 29, 2026

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/fa4-integration branch from 4e04951 to dd21519 Compare January 29, 2026 21:06
@mergify
Copy link

mergify bot commented Jan 29, 2026

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Jan 29, 2026

Documentation preview: https://vllm--32974.org.readthedocs.build/en/32974/

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 29, 2026
@mergify
Copy link

mergify bot commented Jan 29, 2026

Hi @LucasWilkinson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@LucasWilkinson LucasWilkinson changed the title [Attention][WIP] FA4 integration [Attention] FA4 integration Jan 29, 2026
@mergify
Copy link

mergify bot commented Jan 31, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson and others added 6 commits February 4, 2026 02:52
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/fa4-integration branch from aa09eeb to 0cac305 Compare February 4, 2026 02:52
@mergify mergify bot removed the needs-rebase label Feb 4, 2026
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2026
@mergify
Copy link

mergify bot commented Feb 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 5, 2026
@mergify mergify bot removed the needs-rebase label Feb 6, 2026
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
@mergify
Copy link

mergify bot commented Feb 6, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 6, 2026
@mergify mergify bot removed the needs-rebase label Feb 11, 2026
Comment on lines 80 to 82
elif device_capability.major >= 10 and is_fa_version_supported(4):
# Blackwell (SM100+): prefer FA4
fa_version = 4
Copy link
Member

@mgoin mgoin Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the cutedsl run on all arches above sm100 or should we restrict to sm10x? I can try to test on sm120 on my desktop at home if you think it would work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should but probably safer to restrict for now, will do

LucasWilkinson and others added 4 commits February 12, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants