[Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention by LopezCastroRoberto · Pull Request #34265 · vllm-project/vllm

LopezCastroRoberto · 2026-02-10T18:47:29Z

Summary

This PR integrates FlashInfer's radix-based top-k kernel as an alternative implementation for the large context top-k operation in the sparse attention indexer, specifically for DeepSeek-V3.2 models.

Kernel adapted from: flashinfer-ai/flashinfer#2215

Motivation - Microbenchmark results (NVIDIA B200)

E2E results (NVIDIA B200)

vllm serve nvidia/DeepSeek-V3.2-NVFP4 -tp 4
vllm bench serve --backend vllm --model nvidia/DeepSeek-V3.2-NVFP4 --input-len 128000 --output-len 4096 --num-prompts 1

MAIN:

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  59.15     
Total input tokens:                      128000    
Total generated tokens:                  4096      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         69.24     
Peak output token throughput (tok/s):    71.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          2233.14   
---------------Time to First Token----------------
Mean TTFT (ms):                          717.80    
Median TTFT (ms):                        717.80    
P99 TTFT (ms):                           717.80    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.27     
Median TPOT (ms):                        14.27     
P99 TPOT (ms):                           14.27     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.27     
Median ITL (ms):                         14.27     
P99 ITL (ms):                            14.57     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  53.62     
Total input tokens:                      128000    
Total generated tokens:                  4096      
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         76.39     
Peak output token throughput (tok/s):    80.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          2463.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          732.15    
Median TTFT (ms):                        732.15    
P99 TTFT (ms):                           732.15    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.92     
Median TPOT (ms):                        12.92     
P99 TPOT (ms):                           12.92     
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.92     
Median ITL (ms):                         12.72     
P99 ITL (ms):                            13.81     
==================================================

In this example, the current PR improves MAIN throughput by ~10%

Accuracy

python tests/evals/gsm8k/gsm8k_eval.py

MAIN:

Results:
Accuracy: 0.926
Invalid responses: 0.000
Total latency: 54.086 s
Questions per second: 24.387
Total output tokens: 121416
Output tokens per second: 2244.889

PR:

Results:
Accuracy: 0.930
Invalid responses: 0.000
Total latency: 45.613 s
Questions per second: 28.917
Total output tokens: 120617
Output tokens per second: 2644.330

Signed-off-by: LopezCastroRoberto <[email protected]>

gemini-code-assist

Code Review

This pull request replaces the custom large_context_topk kernel with flashinfer.top_k_ragged_transform for handling top-k operations in the large context decode path. The changes primarily involve updating sparse_attn_indexer.py to use the FlashInfer function and passing a new offsets_buffer. Corresponding changes are made for API compatibility in the ROCm path. The tests are also updated to validate the new implementation. My review found a critical issue in the test file where a new test function shadows an existing one due to having the same name, and also misuses a pytest parameter. I've provided a suggestion to fix this.

gemini-code-assist · 2026-02-10T18:49:14Z

tests/kernels/test_top_k_per_row.py

+def test_deepseek_hybrid_topk(clean_logits: bool, top_k: int) -> None:
+    torch.set_default_device("cuda:0")
+
+    top_k = 2048


This new test function test_deepseek_hybrid_topk has the same name as an existing test function at line 286. In Python, this will cause the new function to overwrite the old one, and the original test for torch.ops._C.large_context_topk will no longer be executed. Please rename this new test function to avoid this conflict, for example to test_deepseek_hybrid_topk_flashinfer.

Additionally, the top_k parameter from pytest.mark.parametrize is immediately overwritten on line 401. This makes the parameterization ineffective. Please remove the hardcoded value to allow the test to run with different top_k values as intended.

Suggested change

def test_deepseek_hybrid_topk(clean_logits: bool, top_k: int) -> None:

torch.set_default_device("cuda:0")

top_k = 2048

def test_deepseek_hybrid_topk_flashinfer(clean_logits: bool, top_k: int) -> None:

torch.set_default_device("cuda:0")

Signed-off-by: LopezCastroRoberto <[email protected]>

init FI topk integration

2d74e0f

Signed-off-by: LopezCastroRoberto <[email protected]>

LopezCastroRoberto requested review from WoosukKwon, mgoin, tjtanaa, tlrmchlsmth and yewentao256 as code owners February 10, 2026 18:47

LopezCastroRoberto marked this pull request as draft February 10, 2026 18:47

LopezCastroRoberto changed the title ~~Add FlashInfer top-k support to large context decode path~~ [Perf] Add FlashInfer top-k support to large context decode path Feb 10, 2026

mergify bot added rocm Related to AMD ROCm v1 labels Feb 10, 2026

github-project-automation bot added this to AMD Feb 10, 2026

github-project-automation bot moved this to Todo in AMD Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

LopezCastroRoberto changed the title ~~[Perf] Add FlashInfer top-k support to large context decode path~~ [Perf] Add FlashInfer top-k support to large context decode path - DeepSeek-V3.2 sparse attention Feb 10, 2026

mergify bot added the deepseek Related to DeepSeek models label Feb 10, 2026

LopezCastroRoberto force-pushed the perf/topKperRow-FI branch from bbba437 to 2d74e0f Compare February 11, 2026 15:00

add adapted topK FI kernel

4149cf0

Signed-off-by: LopezCastroRoberto <[email protected]>

mergify bot added the nvidia label Feb 12, 2026

github-project-automation bot added this to NVIDIA Feb 12, 2026

LopezCastroRoberto changed the title ~~[Perf] Add FlashInfer top-k support to large context decode path - DeepSeek-V3.2 sparse attention~~ [Perf][Kernel] Improve topKperRow routine for large context decode path - DeepSeek-V3.2 sparse attention Feb 12, 2026

LopezCastroRoberto changed the title ~~[Perf][Kernel] Improve topKperRow routine for large context decode path - DeepSeek-V3.2 sparse attention~~ [Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention#34265

[Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention#34265
LopezCastroRoberto wants to merge 2 commits intovllm-project:mainfrom
LopezCastroRoberto:perf/topKperRow-FI

LopezCastroRoberto commented Feb 10, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LopezCastroRoberto commented Feb 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation - Microbenchmark results (NVIDIA B200)

E2E results (NVIDIA B200)

Accuracy

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LopezCastroRoberto commented Feb 10, 2026 •

edited by github-actions bot

Loading