[Feat]skip lightning indexer for the first 2048 preceding tokens by 1024daniel · Pull Request #7418 · vllm-project/vllm-ascend

1024daniel · 2026-03-18T07:31:01Z

What this PR does / why we need it?

Skip first 2048 tokens for lighting indexer to avoid redundant computation

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.17.0
vLLM main: vllm-project/vllm@4497431

gemini-code-assist · 2026-03-18T07:31:21Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance optimization by allowing the lightning indexer to skip the initial 2048 tokens during computation. This change aims to reduce unnecessary processing, thereby enhancing the efficiency of the attention mechanism. The implementation involves extending existing metadata structures, adding new utility functions for index manipulation, and integrating this logic seamlessly into the model's execution flow.

Highlights

Lightning Indexer Optimization: Implemented a mechanism to skip the first 2048 preceding tokens for the lightning indexer, avoiding redundant computations and improving performance.
Metadata Extension: Extended AscendSFAMetadata and AscendCommonAttentionMetadata with new fields to manage the state and parameters related to the lightning indexer skipping logic.
New Utility Functions: Introduced several new utility functions in vllm_ascend/attention/utils.py to handle the calculation of skip indices, reordering of inputs, and processing of skipped queries.
Dynamic Skip Enablement: Added a new function enable_lightning_indexer_skip to dynamically determine if the lightning indexer skip feature should be enabled based on vLLM configuration.
Model Runner Integration: Integrated the new skipping and reordering logic into the ModelRunnerV1 to ensure input IDs, positions, and hidden states are correctly processed when the lightning indexer skip is active.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a "Lightning Indexer Skip" feature to optimize attention computation by selectively skipping tokens based on a defined threshold. This involves adding new metadata fields to AscendSFAMetadata and AscendCommonAttentionMetadata, implementing utility functions like get_sfa_skip_indices and get_index_of_skipped_queries_numpy for index calculation and reordering, and modifying the attention and model execution logic to handle skipped sequences. Specifically, the indexer_select_post_process method was refactored to conditionally process tokens and concatenate results from skipped and non-skipped sequences, and the forward method was updated to prevent kv_cache updates for fully skipped sequences. A new enable_lightning_indexer_skip function was added to control this feature. Review comments suggest fixing a typo in skip_threold to skip_threshold and defining 2048 as a shared constant to improve readability and maintainability, as well as removing a redundant import numpy as np statement from within a function.

gemini-code-assist · 2026-03-18T07:33:51Z

vllm_ascend/attention/utils.py

+def get_sfa_skip_indices(num_comptuted_tokens, query_lens):
+    num_comptuted_tokens = to_numpy(num_comptuted_tokens)
+    query_lens = to_numpy(query_lens)
+    skip_threold = 2048


The variable skip_threold has a typo and should be skip_threshold. Additionally, the value 2048 is a magic number. It should be defined as a constant with a descriptive name at the module level, for example, LIGHTNING_INDEXER_SKIP_THRESHOLD = 2048. This will improve code readability and make it easier to change this value in the future if needed. The same magic number 2048 is also used in get_index_of_skipped_queries_numpy in vllm_ascend/attention/sfa_v1.py. Using a shared constant would be ideal.

Suggested change

skip_threold = 2048

skip_threshold = 2048

gemini-code-assist · 2026-03-18T07:33:51Z

vllm_ascend/attention/utils.py

+    actual_seq_lengths_query = to_numpy(actual_seq_lengths_query)
+    actual_seq_lengths_key = to_numpy(actual_seq_lengths_key)
+    num_actual_seqs = to_numpy(num_actual_seqs)
+    import numpy as np


The import numpy as np statement is inside the get_index_of_skipped_queries_numpy function. According to Python best practices, imports should be at the top of the module to improve readability and avoid repeated imports. Since numpy is already imported at the top of the file in this pull request, this line is redundant and should be removed.

has put import expression to the top

github-actions · 2026-03-18T07:42:31Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

rjg-lyh · 2026-03-18T10:12:46Z

vllm_ascend/attention/utils.py

+    return x
+
+
+def get_sfa_skip_indices(num_comptuted_tokens, query_lens):


I think better to rename this func name, like "get_li_skip_indices)".

has change the func name

rjg-lyh · 2026-03-18T10:15:27Z

vllm_ascend/attention/sfa_v1.py

+        # =========================
+        if attn_metadata.skip:
+            num_tokens = attn_metadata.non_skip_num_actual_tokens
+            if num_tokens > 0:


I suggest that you could refactor this logic after capacity finish, in order to separate skip and non-skip token sequences for function invocation.

rjg-lyh · 2026-03-18T10:17:02Z

vllm_ascend/attention/sfa_v1.py

            k_li = self._get_full_kv(k_li, attn_metadata)

-        if kv_cache is not None:
+        if kv_cache is not None and (not attn_metadata.skip or attn_metadata.non_skip_num_actual_tokens > 0):


This condition may lead to precision issues. The skip indices are introduced to reduce matmul and LI operator computation, but k_li still needs to be stored globally for use in subsequent scheduling batches.

has fix this branch condition

github-actions · 2026-03-22T13:09:41Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Co-authored-by: YzTongNiar <1667927948@qq.com> Co-authored-by: wyh145 <1987244901@qq.com> Signed-off-by: 1024daniel <xxltju324@gmail.com>

YzTongNiar · 2026-03-25T09:10:14Z

vllm_ascend/attention/utils.py

    # Metadata for Prefill Context Parallelism (PCP) operations.
    prefill_context_parallel_metadata: AscendPrefillContextParallelMetadata | None = None

+    lightning_indexer_metadata: AscendLightningIndexerMetadata | None = None


We can refactor 'lightning_indexer_metadata' as 'lightning_indexer_context' just like 'dsa_cp_context' in the sfa metadata builder.

YzTongNiar · 2026-03-25T09:12:40Z

vllm_ascend/attention/sfa_v1.py

+                    .pin_memory()
+                    .to(dtype=torch.bool, device=self.device, non_blocking=True)
+                )
+                common_attn_metadata.lightning_indexer_metadata = AscendLightningIndexerMetadata(


Refactor 'AscendLightningIndexerMetadata' as 'lightling_indexer_context' (just like 'dsa_cp_context') to avoid building an extra metadata.

YzTongNiar · 2026-03-25T09:13:55Z

vllm_ascend/attention/sfa_v1.py

        seq_lens = common_attn_metadata.seq_lens[:num_reqs]

+        query_start_loc = common_attn_metadata.query_start_loc[: num_reqs + 1]
+        tokens = query_start_loc[1:] - query_start_loc[:-1]


should rename 'tokens' as 'num_computed_tokens' to clarify its usage.

YzTongNiar · 2026-03-25T09:15:08Z

vllm_ascend/attention/sfa_v1.py

        dsa_cp_context = None
        if self.enable_dsa_cp:
+            num_of_non_skip_tokens = 0
+            num_segs_for_cp = cum_query_lens.shape[0]


should rename 'num_segs_for_cp' as 'num_segs' since it does not depend on cp actually.

1024daniel requested review from MengqingCao, wangxiyuan, weijinqian0 and whx-sjtu as code owners March 18, 2026 07:31

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

1024daniel changed the title ~~skip lightning indexer for the first 2048 preceding tokens~~ [Feat]skip lightning indexer for the first 2048 preceding tokens Mar 18, 2026

github-actions bot added the module:core label Mar 18, 2026

1024daniel force-pushed the skip_pr branch from 8cac148 to 3dfe033 Compare March 18, 2026 09:30

rjg-lyh reviewed Mar 18, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Mar 22, 2026

yiz-liu added this to the v0.18.0rc1 milestone Mar 25, 2026

1024daniel force-pushed the skip_pr branch 4 times, most recently from 4c0f99e to 3fbbf99 Compare March 25, 2026 07:38

MengqingCao added ready read for review ready-for-test start test by label for PR labels Mar 25, 2026

1024daniel force-pushed the skip_pr branch from 3fbbf99 to 850f632 Compare March 25, 2026 07:54

github-actions bot removed the merge-conflicts label Mar 25, 2026

1024daniel force-pushed the skip_pr branch 2 times, most recently from b823a22 to 3e51b42 Compare March 25, 2026 08:11

skip lightning indexer

97050bc

Co-authored-by: YzTongNiar <1667927948@qq.com> Co-authored-by: wyh145 <1987244901@qq.com> Signed-off-by: 1024daniel <xxltju324@gmail.com>

1024daniel force-pushed the skip_pr branch from 3e51b42 to 97050bc Compare March 25, 2026 08:17

yiz-liu mentioned this pull request Mar 25, 2026

[Release]: Release checklist for v0.18.0rc1 #7634

Open

52 tasks

YzTongNiar reviewed Mar 25, 2026

View reviewed changes

		return x


		def get_sfa_skip_indices(num_comptuted_tokens, query_lens):

Conversation

1024daniel commented Mar 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

1024daniel commented Mar 18, 2026 •

edited by github-actions bot

Loading