minor: zero workspace buffer init for flashinfer trtllm-gen attn #22603

yyihuang · 2025-08-10T20:59:13Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

flashinfer 0.2.11 updates: flashinfer-ai/flashinfer#1444

cc @elvischenv

Test Plan

Test Result

(Optional) Documentation Update

github-actions · 2025-08-10T20:59:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request addresses a critical correctness issue by initializing the FlashInfer workspace buffer with zeros using torch.zeros instead of torch.empty. This is necessary for the proper functioning of the TensorRT-LLM attention kernels in FlashInfer. The change is applied correctly in both the core library code and the test suite. My review includes suggestions to align the data type of the workspace buffer in the tests with the main implementation (torch.uint8) to ensure consistency and prevent potential data interpretation bugs.

gemini-code-assist · 2025-08-10T21:00:16Z

tests/kernels/attention/test_flashinfer_trtllm_attention.py

@@ -113,7 +113,7 @@ def test_flashinfer_trtllm_decode_with_baseline(
    kv_indices = torch.tensor(kv_indices, dtype=torch.int32)
    kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32)

-    workspace_buffer = torch.empty(128 * 1024 * 1024, dtype=torch.int8)
+    workspace_buffer = torch.zeros(128 * 1024 * 1024, dtype=torch.int8)


The workspace buffer is created with torch.int8 dtype, while the main implementation in vllm/v1/attention/backends/flashinfer.py uses torch.uint8. While this might not cause issues with a zero-initialized buffer, using an inconsistent data type can lead to subtle bugs if the underlying kernel has specific expectations about the data being signed or unsigned. For consistency and to prevent potential correctness issues, it's recommended to use torch.uint8 here.

Suggested change

workspace_buffer = torch.zeros(128 * 1024 * 1024, dtype=torch.int8)

workspace_buffer = torch.zeros(128 * 1024 * 1024, dtype=torch.uint8)

gemini-code-assist · 2025-08-10T21:00:17Z

tests/kernels/attention/test_flashinfer_trtllm_attention.py

@@ -247,7 +247,7 @@ def test_flashinfer_trtllm_prefill_with_baseline(
    kv_indices = torch.tensor(kv_indices, dtype=torch.int32)
    kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32)

-    workspace_buffer = torch.empty(128 * 1024 * 1024, dtype=torch.int8)
+    workspace_buffer = torch.zeros(128 * 1024 * 1024, dtype=torch.int8)


The workspace buffer here is created with torch.int8, which is inconsistent with the torch.uint8 used in the main implementation. To ensure consistency across the codebase and avoid potential issues related to signed versus unsigned byte interpretation by the FlashInfer kernel, it is advisable to use torch.uint8 for this buffer as well.

Suggested change

workspace_buffer = torch.zeros(128 * 1024 * 1024, dtype=torch.int8)

workspace_buffer = torch.zeros(128 * 1024 * 1024, dtype=torch.uint8)

Signed-off-by: Avery Yingyi Huang <[email protected]>

elvischenv · 2025-08-11T01:27:20Z

vllm/v1/attention/backends/flashinfer.py

@@ -251,7 +251,7 @@ def __init__(self, kv_cache_spec: AttentionSpec, layer_names: list[str],

    def _get_workspace_buffer(self):
        if self._workspace_buffer is None:
-            self._workspace_buffer = torch.empty(
+            self._workspace_buffer = torch.zeros(


Also need to update in vllm/attention/backends/flashinfer.py

Updated. Thanks for your review!

Sorry for accidentally pushing to another PR. It's added now.

IwakuraRein · 2025-08-11T23:03:40Z

I have tested this pr on b200 and benchmarks/benchmark_serving.py passed with flashinfer-ai/flashinfer#1463. Arguments:

python3 ./benchmarks/benchmark_serving.py --model gpt-oss-120b --dataset-name random --ignore-eos --num-prompts 12288 --random-input-len 1024 --random-output-len 1024 --max-concurrency 4096

yyihuang requested review from tlrmchlsmth, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 10, 2025 20:59

mergify bot added the v1 label Aug 10, 2025

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

yyihuang added 2 commits August 10, 2025 17:08

init

7327d0a

Signed-off-by: Avery Yingyi Huang <[email protected]>

upd test

2f56b6a

Signed-off-by: Avery Yingyi Huang <[email protected]>

yyihuang force-pushed the init_zero_workspace branch from fc85b03 to 2f56b6a Compare August 10, 2025 21:08

elvischenv suggested changes Aug 11, 2025

View reviewed changes

yyihuang marked this pull request as draft August 11, 2025 08:36

yyihuang mentioned this pull request Aug 11, 2025

fix: remove redundant zero_init reverted by #1459 flashinfer-ai/flashinfer#1463

Draft

5 tasks

yyihuang force-pushed the init_zero_workspace branch from 3d73616 to 2f56b6a Compare August 11, 2025 21:16

upd

963518a

yyihuang marked this pull request as ready for review August 11, 2025 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

minor: zero workspace buffer init for flashinfer trtllm-gen attn #22603

minor: zero workspace buffer init for flashinfer trtllm-gen attn #22603

yyihuang commented Aug 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 10, 2025

Uh oh!

gemini-code-assist bot Aug 10, 2025

Uh oh!

elvischenv Aug 11, 2025

Uh oh!

yyihuang Aug 11, 2025

Uh oh!

yyihuang Aug 11, 2025

Uh oh!

IwakuraRein commented Aug 11, 2025

Uh oh!

Uh oh!

	workspace_buffer = torch.zeros(128 * 1024 * 1024, dtype=torch.int8)
	workspace_buffer = torch.zeros(128 * 1024 * 1024, dtype=torch.uint8)

Uh oh!

minor: zero workspace buffer init for flashinfer trtllm-gen attn #22603

Are you sure you want to change the base?

minor: zero workspace buffer init for flashinfer trtllm-gen attn #22603

Conversation

yyihuang commented Aug 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

elvischenv Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

yyihuang Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

yyihuang Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

IwakuraRein commented Aug 11, 2025

Uh oh!

Uh oh!

yyihuang commented Aug 10, 2025 •

edited by github-actions bot

Loading