Skip to content

[kernel] Recompilation optimization triggered by triton function para…#7647

Closed
HarpsealCC wants to merge 0 commit intovllm-project:releases/v0.18.0from
HarpsealCC:main
Closed

[kernel] Recompilation optimization triggered by triton function para…#7647
HarpsealCC wants to merge 0 commit intovllm-project:releases/v0.18.0from
HarpsealCC:main

Conversation

@HarpsealCC
Copy link

…meter optimization

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of Triton kernels by implementing strategic optimizations related to parameter handling. The core changes involve modifying do_not_specialize decorators and removing tl.constexpr from selected kernel parameters. This approach aims to minimize kernel recompilation overhead, leading to more efficient execution, especially in scenarios where these parameters vary dynamically. Additionally, the PR simplifies the underlying batching and grid sizing mechanisms in several Triton operations, contributing to overall code clarity and performance.

Highlights

  • Triton Kernel Optimization: Adjusted do_not_specialize decorators in several Triton kernels to prevent unnecessary recompilations when certain parameters change at runtime.
  • Parameter Refactoring: Removed tl.constexpr from various kernel parameters, allowing them to be passed as dynamic values without triggering new kernel compilations.
  • Simplified Batching Logic: Streamlined the calculation of grid and block sizes, and batching logic in specific Triton operations for improved efficiency and reduced complexity.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify
Copy link

mergify bot commented Mar 25, 2026

⚠️ The sha of the head commit of this PR conflicts with #7481. Mergify cannot evaluate rules on this PR. Once #7481 is merged or closed, Mergify will resume processing this PR. ⚠️

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors Triton kernel definitions across multiple files, primarily by expanding the do_not_specialize list for @triton.jit decorators and converting several tl.constexpr parameters to regular runtime arguments. Additionally, it simplifies the grid and block size calculation logic in reject_sample.py and fused_gdn_gating.py. However, the changes to reject_sample.py introduce critical bugs by removing a necessary import (get_vectorcore_num) and implementing an incorrect grid/block size calculation, which will lead to indexing errors in dependent kernels. Furthermore, the simplified batching logic in fused_gdn_gating.py may cause a performance regression for small batch sizes, and an optimized approach for these cases is recommended.

from vllm.triton_utils import tl, triton

from vllm_ascend.ops.triton.triton_utils import get_element, get_vectorcore_num
from vllm_ascend.ops.triton.triton_utils import get_element
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function get_vectorcore_num is needed for the correct implementation of cal_grid_and_block_size but was removed. Please add it back to the imports.

Suggested change
from vllm_ascend.ops.triton.triton_utils import get_element
from vllm_ascend.ops.triton.triton_utils import get_element, get_vectorcore_num

Comment on lines 23 to 26
def cal_grid_and_block_size(batch_size: int):
vectorcore_num = get_vectorcore_num()
if batch_size <= vectorcore_num:
grid = batch_size
block_size = 1
else:
grid = vectorcore_num
block_size = triton.next_power_of_2(triton.cdiv(batch_size, grid))
grid = batch_size
block_size = 64
return grid, block_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The new implementation of cal_grid_and_block_size is incorrect. It sets grid = batch_size and block_size = 64, which causes incorrect indexing within the Triton kernels that use it (e.g., rejection_greedy_sample_triton, expand_kernel). This will lead to bugs and incorrect results because the kernels are written to process block_size items per program instance, but with grid = batch_size, the indexing logic will be wrong. The previous implementation correctly calculated grid and block sizes. Please revert to the previous logic.

def cal_grid_and_block_size(batch_size: int):
    vectorcore_num = get_vectorcore_num()
    if batch_size <= vectorcore_num:
        grid = batch_size
        block_size = 1
    else:
        grid = vectorcore_num
        block_size = triton.next_power_of_2(triton.cdiv(batch_size, grid))
    return grid, block_size

Comment on lines +74 to +77
progs = num_cores
row_per_core = triton.cdiv(batch, progs)
BLK_BATCHES = 64
ROW_ITER = triton.cdiv(row_per_core, BLK_BATCHES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for determining progs, BLK_BATCHES, and ROW_ITER has been simplified, but this may cause a performance regression for small batch sizes. The previous logic handled small batches (batch <= num_cores) more efficiently by setting progs = batch and BLK_BATCHES = 1. With the new logic, if batch is small, many program instances in the grid will be idle or perform masked work, leading to inefficiency.

For example, if batch=1 and num_cores=32, the new logic launches a grid of 32 programs, but only one will do useful work. The old logic would have launched a grid of 1.

This suggestion re-introduces the efficient handling for small batches while keeping the simplified logic for larger batches.

Suggested change
progs = num_cores
row_per_core = triton.cdiv(batch, progs)
BLK_BATCHES = 64
ROW_ITER = triton.cdiv(row_per_core, BLK_BATCHES)
if batch <= num_cores:
progs = batch
BLK_BATCHES = 1
ROW_ITER = 1
else:
progs = num_cores
row_per_core = triton.cdiv(batch, progs)
BLK_BATCHES = 64
ROW_ITER = triton.cdiv(row_per_core, BLK_BATCHES)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant