Skip to content

Conversation

benchislett
Copy link
Collaborator

@benchislett benchislett commented Oct 6, 2025

Purpose

The annotation was missing from FlashInfer-MLA while the implementation has support.

Running DSR1-FP4 on 4xB200 gets me 97 TPS:

VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm serve nvidia/DeepSeek-R1-FP4 -tp 4 --max-model-len 32768 --max-num-seqs 128 --no-enable-prefix-caching --async-scheduling --port 8049

I also tested on a local development branch for MTP containing #25984, and #25987.

On that branch, with 3 MTP speculative tokens, I get 165 TPS and passing GSM8k evals.

Test Plan

GSM8k run as follows:

lm_eval \
  --model local-completions \
  --tasks gsm8k \
  --model_args base_url=http://0.0.0.0:8049/v1/completions,model=nvidia/DeepSeek-R1-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5

Test Result

Matches the baseline:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9439|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.9439|±  |0.0063|

@mergify mergify bot added the v1 label Oct 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly enables full CUDA graph support for decode operations in the FlashInfer-MLA attention backend. The change is implemented by creating a new FlashInferMLAMetadataBuilder class that inherits from MLACommonMetadataBuilder and sets the cudagraph_support attribute to AttentionCGSupport.UNIFORM_BATCH. The FlashInferMLABackend is then updated to use this new builder. The approach is clean, follows the existing design patterns in the codebase, and seems to correctly enable the feature as described. The changes are minimal and well-targeted. I found no issues of high or critical severity.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

@LucasWilkinson LucasWilkinson enabled auto-merge (squash) October 6, 2025 21:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 6, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@LucasWilkinson LucasWilkinson merged commit f77df94 into vllm-project:main Oct 6, 2025
54 checks passed
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants