[wip][Spec Decoding] Zero-bubble async scheduling + spec decoding by HF-001 · Pull Request #7640 · vllm-project/vllm-ascend

HF-001 · 2026-03-25T09:33:22Z

What this PR does / why we need it?

refer to: vllm-project/vllm#32951 , It improves the async-ness of spec decoding by optimistically assuming all draft tokens are accepted on the CPU and deferring the correction until after the forward pass. The NPU-side tensors are taken as the source of truth.

At present, the function is normal, but it may be a problem with the Triton operator, resulting in a slight decrease in performance, which is currently being optimized

How was this patch tested?

todo

vLLM version: v0.18.0
vLLM main: vllm-project/vllm@ed359c4

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

gemini-code-assist · 2026-03-25T09:33:54Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances speculative decoding performance by introducing zero-bubble asynchronous scheduling. The core improvement lies in optimistically processing draft tokens on the CPU, assuming their acceptance, and then performing necessary corrections on the NPU after the model's forward pass. This approach aims to reduce latency and improve hardware utilization. The changes involve a fundamental shift in how sequence lengths and computed tokens are managed across CPU and GPU, optimizing KV cache slot mapping with a new kernel, and implementing robust deferred state corrections to maintain data consistency in this asynchronous execution model.

Highlights

Zero-Bubble Async Speculative Decoding: Implemented a new asynchronous scheduling mechanism for speculative decoding that optimistically assumes draft tokens are accepted on the CPU, deferring corrections to the NPU after the forward pass to minimize idle time.
Optimistic CPU State Management: Introduced optimistic_seq_lens_cpu to manage sequence lengths on the CPU, with GPU tensors serving as the authoritative source of truth for actual sequence lengths and computed tokens.
Kernel-based Slot Mapping: Refactored the compute_slot_mapping function in block_table.py to utilize a more efficient kernel-based approach, improving performance for KV cache management.
Deferred State Corrections: Added mechanisms in execute_model to defer state corrections, ensuring proper synchronization of CPU and GPU states for num_computed_tokens and num_accepted_tokens within the asynchronous flow.
Attention Metadata Alignment: Modified attention metadata construction and token preparation to align with the new asynchronous speculative decoding logic, including conditional handling of CPU-side sequence lengths and computed tokens.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR implements significant improvements for asynchronous speculative decoding and NPU (Ascend) specific optimizations within VLLM. It shifts towards a GPU-centric state management for attention metadata, introducing optimistic_seq_lens_cpu for speculative decoding and moving slot mapping computation to a GPU kernel. Additionally, it includes deferred state corrections and re-synchronization for Mamba cache alignment. A critical review comment highlights a potential integer overflow risk by changing num_accepted_tokens_cpu_tensor from torch.int64 to torch.int32.

gemini-code-assist · 2026-03-25T09:40:26Z

vllm_ascend/worker/npu_input_batch.py

        # Speculative decoding
        self.num_accepted_tokens_cpu_tensor = torch.ones(
-            (max_num_reqs,), dtype=torch.int64, device="cpu", pin_memory=pin_memory
+            (max_num_reqs,), dtype=torch.int32, device="cpu", pin_memory=pin_memory


Changing the dtype of num_accepted_tokens_cpu_tensor from torch.int64 to torch.int32 could lead to an integer overflow if the number of accepted tokens for a request exceeds the maximum value for a 32-bit signed integer (2,147,483,647). Please confirm that int32 is sufficient for all expected scenarios, or revert to int64 to prevent potential data loss or incorrect behavior.

Suggested change

(max_num_reqs,), dtype=torch.int32, device="cpu", pin_memory=pin_memory

(max_num_reqs,), dtype=torch.int64, device="cpu", pin_memory=pin_memory

gemini-code-assist · 2026-03-25T09:40:27Z

vllm_ascend/spec_decode/eagle_proposer.py

+        if common_attn_metadata.seq_lens_cpu is not None:
+            common_attn_metadata.seq_lens_cpu[:batch_size] = common_attn_metadata.seq_lens_cpu[:batch_size] + 1
+            exceeds_mask = common_attn_metadata.seq_lens_cpu[:batch_size] >= self.max_model_len
+            common_attn_metadata.seq_lens_cpu[:batch_size].masked_fill_(exceeds_mask, 1)
+        if common_attn_metadata.num_computed_tokens_cpu is not None:


The addition of if ... is not None checks for common_attn_metadata.seq_lens_cpu and common_attn_metadata.num_computed_tokens_cpu is a critical improvement. This prevents AttributeError in scenarios where these attributes might be None due to the async spec decode logic, ensuring robustness and correctness.

gemini-code-assist · 2026-03-25T09:40:27Z