Optimize the inference performance of the FLA operator On Qwen3.5 Model by mikequan0425 · Pull Request #7597 · vllm-project/vllm-ascend

mikequan0425 · 2026-03-24T09:46:12Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.18.0
vLLM main: vllm-project/vllm@ed359c4

清理调试用的打印语句和已被注释的代码，保持代码整洁

gemini-code-assist · 2026-03-24T09:46:39Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the inference performance of the FLA (Fused Linear Attention) operator, particularly for the Qwen3.5 model, by enhancing how chunk-related metadata is prepared and managed. The core improvement involves introducing a robust caching mechanism within the fla_utils module, which prevents redundant computations of chunk indices and offsets. This refactoring, coupled with the integration of new utility functions across various FLA kernels, aims to reduce computational overhead and improve overall efficiency during model inference.

Highlights

Performance Optimization: Implemented a caching mechanism for FLA utility functions to reuse computed chunk metadata, significantly reducing redundant calculations and improving inference performance.
Refactored Chunk Metadata Preparation: Rewrote core functions like prepare_chunk_indices, prepare_chunk_offsets, prepare_final_chunk_indices, and prepare_update_chunk_offsets for improved efficiency and to leverage the new caching system.
New Utility Functions: Introduced prepare_num_chunks and prepare_num_total_chunks to standardize the calculation of chunk counts across FLA operators.
Integration Across FLA Operators: Updated several FLA operator kernels (chunk_delta_h, chunk_delta_hupdate, chunk_scaled_dot_kkt, cumsum, solve_tril, wy_fast) to utilize the new prepare_num_total_chunks function and benefit from the caching.
Comprehensive Testing: Added a new unit test file (test_fla_utils.py) to validate the correctness and caching behavior of the fla_utils functions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly optimizes the Flash-Linear-Attention (FLA) operator by refactoring and caching chunk metadata preparation. The changes replace inefficient Python-level loops with vectorized PyTorch operations, and introduce a caching layer to avoid redundant computations, which also removes expensive CPU-GPU synchronizations. The PR also includes new unit tests to validate the correctness of these optimizations.

My main feedback is to address a potential memory leak in the new caching implementation by adding a size limit to the cache.

Following the repository's style guide, I've also suggested an updated PR title and summary.

Suggested PR Title:

[Ops][Feature] Optimize FLA operator performance by caching metadata

Suggested PR Summary:

### What this PR does / why we need it?
This PR optimizes the Flash-Linear-Attention (FLA) operator by improving the performance of chunk metadata preparation.

- The utility functions in `vllm_ascend/ops/triton/fla/utils.py` are refactored to use efficient, vectorized PyTorch operations instead of Python loops and list comprehensions.
- A caching mechanism is introduced for these utility functions to avoid recomputing metadata for the same input tensors. The cache uses the tensor's ID and version for keying.
- This change avoids expensive CPU-GPU synchronization that was present in the previous implementation.
- Other FLA operator files are updated to use the new optimized utility functions.
- Unit tests are added to verify the new utility functions and their caching logic.

This optimization improves the overall inference performance of models using the FLA operator.

### Does this PR introduce _any_ user-facing change?
No. This is a performance optimization and should not change any user-facing behavior.

### How was this patch tested?
- Added new unit tests in `tests/ut/ops/test_fla_utils.py` to verify the correctness of the refactored utility functions and the caching mechanism.
- CI passed with new and existing tests.

gemini-code-assist · 2026-03-24T09:48:32Z

vllm_ascend/ops/triton/fla/utils.py

+def _cache_prepare_result(cu_seqlens: torch.LongTensor, chunk_size: int, name: str, value):
+    key = _get_prepare_cache_key(cu_seqlens, chunk_size, name)
+    _PREPARE_CACHE[key] = (weakref.ref(cu_seqlens), value)
+    return value


The _PREPARE_CACHE dictionary is unbounded and can grow indefinitely if many different cu_seqlens tensors are used throughout the lifetime of the application. This can lead to a memory leak. Although weakref is used for the tensor, the cache entries for garbage-collected tensors are not proactively removed. They are only removed if a new tensor happens to reuse the same id, which is not a reliable cleanup mechanism.

To prevent potential out-of-memory errors, you should bound the cache size. A simple approach is to evict items when the cache exceeds a certain threshold.

For example, you could implement a simple FIFO eviction policy:

# At module level _PREPARE_CACHE_MAX_SIZE = 256 # In _cache_prepare_result def _cache_prepare_result(cu_seqlens: torch.LongTensor, chunk_size: int, name: str, value): if len(_PREPARE_CACHE) >= _PREPARE_CACHE_MAX_SIZE: # Evict the first item inserted (FIFO). Requires Python 3.7+ for dict insertion order. _PREPARE_CACHE.pop(next(iter(_PREPARE_CACHE))) key = _get_prepare_cache_key(cu_seqlens, chunk_size, name) _PREPARE_CACHE[key] = (weakref.ref(cu_seqlens), value) return value

A more robust solution would be to use an LRU (Least Recently Used) cache.

github-actions · 2026-03-24T09:51:40Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-03-24T09:56:18Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

mikequan0425 added 2 commits March 24, 2026 15:50

fla op update

1ed65e0

refactor(fused_moe): 移除调试打印语句和注释代码

e174689

清理调试用的打印语句和已被注释的代码，保持代码整洁

gemini-code-assist bot reviewed Mar 24, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Mar 24, 2026

github-actions bot added module:tests module:ops labels Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the inference performance of the FLA operator On Qwen3.5 Model#7597

Optimize the inference performance of the FLA operator On Qwen3.5 Model#7597
mikequan0425 wants to merge 2 commits intovllm-project:mainfrom
mikequan0425:0.17.0rc1-FLA-fix

mikequan0425 commented Mar 24, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot commented Mar 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikequan0425 commented Mar 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikequan0425 commented Mar 24, 2026 •

edited by github-actions bot

Loading