Skip to content

[None][feat] add skip condition in AutoDeploy's triton fused moe kernel#8632

Merged
suyoggupta merged 8 commits intoNVIDIA:mainfrom
nv-auto-deploy:sg/triton-fix-rebased
Oct 24, 2025
Merged

[None][feat] add skip condition in AutoDeploy's triton fused moe kernel#8632
suyoggupta merged 8 commits intoNVIDIA:mainfrom
nv-auto-deploy:sg/triton-fix-rebased

Conversation

@suyoggupta
Copy link
Collaborator

@suyoggupta suyoggupta commented Oct 23, 2025

Skip compute when kernel has no valid work

Summary by CodeRabbit

  • Performance Improvements

    • Optimized MOE kernel to efficiently skip padded tokens during processing, reducing unnecessary computation and improving overall throughput for sequences with padding.
  • Bug Fixes

    • Improved boundary condition handling to prevent out-of-bounds memory access in padded data processing.
  • Refactor

    • Enhanced kernel parameter passing for better control over padding handling.

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta suyoggupta requested a review from a team as a code owner October 23, 2025 20:32
@suyoggupta suyoggupta requested a review from MrGeva October 23, 2025 20:32
@suyoggupta suyoggupta mentioned this pull request Oct 23, 2025
1 task
@suyoggupta
Copy link
Collaborator Author

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

📝 Walkthrough

Walkthrough

Adds a num_tokens_post_padded_ptr parameter to the Triton MOE kernel to track actual token count. The kernel performs early exit when processing padded tokens, with the parameter propagated through _invoke_kernel and grid calculations to prevent out-of-bounds operations.

Changes

Cohort / File(s) Summary
Triton MOE Kernel Parameter Addition
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
Added num_tokens_post_padded_ptr parameter to fused_mlp_moe_kernel function. Kernel loads the value and performs early return when pid_m * BLOCK_SIZE_M >= num_tokens_post_padded to skip padded token processing. Updated _invoke_kernel to accept and forward num_tokens_post_padded argument. Added guard check before token/index computations. Updated kernel call sites in grid calculation path to supply the new parameter.

Sequence Diagram

sequenceDiagram
    participant Caller as Grid/Caller
    participant _invoke_kernel
    participant fused_mlp_moe_kernel
    
    Caller->>_invoke_kernel: _invoke_kernel(..., num_tokens_post_padded)
    _invoke_kernel->>fused_mlp_moe_kernel: launch kernel with num_tokens_post_padded_ptr
    
    rect rgb(200, 220, 240)
        Note over fused_mlp_moe_kernel: Load num_tokens_post_padded
    end
    
    fused_mlp_moe_kernel->>fused_mlp_moe_kernel: Check: pid_m * BLOCK_SIZE_M >= num_tokens_post_padded?
    
    alt Padded token range
        fused_mlp_moe_kernel->>fused_mlp_moe_kernel: Early return (skip processing)
    else Valid token range
        rect rgb(220, 240, 200)
            Note over fused_mlp_moe_kernel: Process token computations
        end
        fused_mlp_moe_kernel->>fused_mlp_moe_kernel: Execute kernel logic
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description is significantly incomplete compared to the required template. The author provided only a brief statement "Skip compute when kernel has no valid work," which lacks the essential sections outlined in the template, including a detailed explanation of the issue and solution, a dedicated Test Coverage section listing relevant tests, and a completed PR Checklist. While the brief description hints at the purpose of the changes, it does not provide sufficient context or follow the repository's required structure for pull request documentation. The author should expand the pull request description to include: a detailed explanation of the problem being solved and why it was necessary (what issue occurs without the skip condition), the solution approach (how num_tokens_post_padded is used to skip padding tokens), a Test Coverage section documenting which tests validate this change, and confirmation of the PR Checklist items. Refer to the repository's description template to ensure all required sections are completed.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "[None][feat] add skip condition in AutoDeploy's triton fused moe kernel" clearly and specifically describes the main change in the changeset. The raw summary confirms that the primary modification is adding a skip condition (early return) to the fused_mlp_moe_kernel to avoid processing padding tokens when pid_m * BLOCK_SIZE_M >= num_tokens_post_padded. The title follows the repository's template format with the [None] ticket identifier and [feat] type designation, and it is concise and descriptive without unnecessary noise.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py (1)

67-77: Consider updating the docstring.

The function docstring could be updated to document the new num_tokens_post_padded_ptr parameter and mention the early-exit optimization for padded blocks.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32e1ad6 and b20a340.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Check PR Checklist Resolution
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py (3)

44-44: LGTM: Parameter addition is clean.

The new num_tokens_post_padded_ptr parameter is correctly added to track the actual token count after padding.


88-90: LGTM: Early-exit logic is correct and safe.

The early return correctly skips blocks that only contain padded tokens. Loading the scalar value from the pointer is safe (all threads read the same value), and returning before any token processing or memory operations prevents out-of-bounds access.


248-248: LGTM: Parameter propagation is correct.

The num_tokens_post_padded parameter is properly threaded through _invoke_kernel and passed to the kernel. The comment about CUDA graph compatibility helpfully explains why this is a tensor rather than a scalar.

Also applies to: 278-278

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22321 [ run ] triggered by Bot. Commit: b20a340

@github-project-automation github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Oct 23, 2025
@suyoggupta suyoggupta enabled auto-merge (squash) October 23, 2025 20:49
@tensorrt-cicd
Copy link
Collaborator

PR_Github #22321 [ run ] completed with state SUCCESS. Commit: b20a340
/LLM/main/L0_MergeRequest_PR pipeline #16828 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22325 [ run ] triggered by Bot. Commit: 2597388

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22325 [ run ] completed with state SUCCESS. Commit: 2597388
/LLM/main/L0_MergeRequest_PR pipeline #16831 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22342 [ run ] triggered by Bot. Commit: 588eccc

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22342 [ run ] completed with state SUCCESS. Commit: 588eccc
/LLM/main/L0_MergeRequest_PR pipeline #16844 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22373 [ run ] triggered by Bot. Commit: 1745541

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22373 [ run ] completed with state SUCCESS. Commit: 1745541
/LLM/main/L0_MergeRequest_PR pipeline #16864 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22402 [ run ] triggered by Bot. Commit: ff6c06f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22402 [ run ] completed with state SUCCESS. Commit: ff6c06f
/LLM/main/L0_MergeRequest_PR pipeline #16885 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22423 [ run ] triggered by Bot. Commit: a90fa7b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22423 [ run ] completed with state SUCCESS. Commit: a90fa7b
/LLM/main/L0_MergeRequest_PR pipeline #16900 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@suyoggupta suyoggupta merged commit f512dda into NVIDIA:main Oct 24, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Oct 24, 2025
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Oct 24, 2025
…el (NVIDIA#8632)

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025
…el (NVIDIA#8632)

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…el (NVIDIA#8632)

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…el (NVIDIA#8632)

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…el (NVIDIA#8632)

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants