[kernel] Recompilation optimization triggered by triton function para… by HarpsealCC · Pull Request #7647 · vllm-project/vllm-ascend

HarpsealCC · 2026-03-25T12:29:04Z

…meter optimization

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

gemini-code-assist · 2026-03-25T12:29:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of Triton kernels by implementing strategic optimizations related to parameter handling. The core changes involve modifying do_not_specialize decorators and removing tl.constexpr from selected kernel parameters. This approach aims to minimize kernel recompilation overhead, leading to more efficient execution, especially in scenarios where these parameters vary dynamically. Additionally, the PR simplifies the underlying batching and grid sizing mechanisms in several Triton operations, contributing to overall code clarity and performance.

Highlights

Triton Kernel Optimization: Adjusted do_not_specialize decorators in several Triton kernels to prevent unnecessary recompilations when certain parameters change at runtime.
Parameter Refactoring: Removed tl.constexpr from various kernel parameters, allowing them to be passed as dynamic values without triggering new kernel compilations.
Simplified Batching Logic: Streamlined the calculation of grid and block sizes, and batching logic in specific Triton operations for improved efficiency and reduced complexity.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

mergify · 2026-03-25T12:29:41Z

⚠️ The sha of the head commit of this PR conflicts with #7481. Mergify cannot evaluate rules on this PR. Once #7481 is merged or closed, Mergify will resume processing this PR. ⚠️

gemini-code-assist

Code Review

This pull request refactors Triton kernel definitions across multiple files, primarily by expanding the do_not_specialize list for @triton.jit decorators and converting several tl.constexpr parameters to regular runtime arguments. Additionally, it simplifies the grid and block size calculation logic in reject_sample.py and fused_gdn_gating.py. However, the changes to reject_sample.py introduce critical bugs by removing a necessary import (get_vectorcore_num) and implementing an incorrect grid/block size calculation, which will lead to indexing errors in dependent kernels. Furthermore, the simplified batching logic in fused_gdn_gating.py may cause a performance regression for small batch sizes, and an optimized approach for these cases is recommended.

gemini-code-assist · 2026-03-25T12:32:49Z

vllm_ascend/ops/triton/reject_sample.py

 from vllm.triton_utils import tl, triton

-from vllm_ascend.ops.triton.triton_utils import get_element, get_vectorcore_num
+from vllm_ascend.ops.triton.triton_utils import get_element


The function get_vectorcore_num is needed for the correct implementation of cal_grid_and_block_size but was removed. Please add it back to the imports.

Suggested change

from vllm_ascend.ops.triton.triton_utils import get_element

from vllm_ascend.ops.triton.triton_utils import get_element, get_vectorcore_num

gemini-code-assist · 2026-03-25T12:32:49Z

vllm_ascend/ops/triton/reject_sample.py

 def cal_grid_and_block_size(batch_size: int):
-    vectorcore_num = get_vectorcore_num()
-    if batch_size <= vectorcore_num:
-        grid = batch_size
-        block_size = 1
-    else:
-        grid = vectorcore_num
-        block_size = triton.next_power_of_2(triton.cdiv(batch_size, grid))
+    grid = batch_size
+    block_size = 64
    return grid, block_size


The new implementation of cal_grid_and_block_size is incorrect. It sets grid = batch_size and block_size = 64, which causes incorrect indexing within the Triton kernels that use it (e.g., rejection_greedy_sample_triton, expand_kernel). This will lead to bugs and incorrect results because the kernels are written to process block_size items per program instance, but with grid = batch_size, the indexing logic will be wrong. The previous implementation correctly calculated grid and block sizes. Please revert to the previous logic.

def cal_grid_and_block_size(batch_size: int): vectorcore_num = get_vectorcore_num() if batch_size <= vectorcore_num: grid = batch_size block_size = 1 else: grid = vectorcore_num block_size = triton.next_power_of_2(triton.cdiv(batch_size, grid)) return grid, block_size

gemini-code-assist · 2026-03-25T12:32:49Z

vllm_ascend/ops/triton/fused_gdn_gating.py

+    progs = num_cores
+    row_per_core = triton.cdiv(batch, progs)
+    BLK_BATCHES = 64
+    ROW_ITER = triton.cdiv(row_per_core, BLK_BATCHES)


The logic for determining progs, BLK_BATCHES, and ROW_ITER has been simplified, but this may cause a performance regression for small batch sizes. The previous logic handled small batches (batch <= num_cores) more efficiently by setting progs = batch and BLK_BATCHES = 1. With the new logic, if batch is small, many program instances in the grid will be idle or perform masked work, leading to inefficiency.

For example, if batch=1 and num_cores=32, the new logic launches a grid of 32 programs, but only one will do useful work. The old logic would have launched a grid of 1.

This suggestion re-introduces the efficient handling for small batches while keeping the simplified logic for larger batches.

Suggested change

progs = num_cores

row_per_core = triton.cdiv(batch, progs)

BLK_BATCHES = 64

ROW_ITER = triton.cdiv(row_per_core, BLK_BATCHES)

if batch <= num_cores:

progs = batch

BLK_BATCHES = 1

ROW_ITER = 1

else:

progs = num_cores

row_per_core = triton.cdiv(batch, progs)

BLK_BATCHES = 64

ROW_ITER = triton.cdiv(row_per_core, BLK_BATCHES)

HarpsealCC requested review from realliujiaxu, whx-sjtu and zzzzwwjj as code owners March 25, 2026 12:29

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

HarpsealCC closed this Mar 25, 2026

HarpsealCC force-pushed the main branch from a3fc72b to ff1860b Compare March 25, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kernel] Recompilation optimization triggered by triton function para…#7647

[kernel] Recompilation optimization triggered by triton function para…#7647
HarpsealCC wants to merge 0 commit intovllm-project:releases/v0.18.0from
HarpsealCC:main

HarpsealCC commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Uh oh!

mergify bot commented Mar 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Uh oh!

gemini-code-assist bot Mar 25, 2026

Uh oh!

gemini-code-assist bot Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	from vllm_ascend.ops.triton.triton_utils import get_element
	from vllm_ascend.ops.triton.triton_utils import get_element, get_vectorcore_num

Conversation

HarpsealCC commented Mar 25, 2026

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Mar 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant