[None][feat] add skip condition in AutoDeploy's triton fused moe kernel#8632
[None][feat] add skip condition in AutoDeploy's triton fused moe kernel#8632suyoggupta merged 8 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
/bot run |
📝 WalkthroughWalkthroughAdds a Changes
Sequence DiagramsequenceDiagram
participant Caller as Grid/Caller
participant _invoke_kernel
participant fused_mlp_moe_kernel
Caller->>_invoke_kernel: _invoke_kernel(..., num_tokens_post_padded)
_invoke_kernel->>fused_mlp_moe_kernel: launch kernel with num_tokens_post_padded_ptr
rect rgb(200, 220, 240)
Note over fused_mlp_moe_kernel: Load num_tokens_post_padded
end
fused_mlp_moe_kernel->>fused_mlp_moe_kernel: Check: pid_m * BLOCK_SIZE_M >= num_tokens_post_padded?
alt Padded token range
fused_mlp_moe_kernel->>fused_mlp_moe_kernel: Early return (skip processing)
else Valid token range
rect rgb(220, 240, 200)
Note over fused_mlp_moe_kernel: Process token computations
end
fused_mlp_moe_kernel->>fused_mlp_moe_kernel: Execute kernel logic
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py (1)
67-77: Consider updating the docstring.The function docstring could be updated to document the new
num_tokens_post_padded_ptrparameter and mention the early-exit optimization for padded blocks.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Check PR Checklist Resolution
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py (3)
44-44: LGTM: Parameter addition is clean.The new
num_tokens_post_padded_ptrparameter is correctly added to track the actual token count after padding.
88-90: LGTM: Early-exit logic is correct and safe.The early return correctly skips blocks that only contain padded tokens. Loading the scalar value from the pointer is safe (all threads read the same value), and returning before any token processing or memory operations prevents out-of-bounds access.
248-248: LGTM: Parameter propagation is correct.The
num_tokens_post_paddedparameter is properly threaded through_invoke_kerneland passed to the kernel. The comment about CUDA graph compatibility helpfully explains why this is a tensor rather than a scalar.Also applies to: 278-278
|
PR_Github #22321 [ run ] triggered by Bot. Commit: |
|
PR_Github #22321 [ run ] completed with state |
|
/bot run |
|
PR_Github #22325 [ run ] triggered by Bot. Commit: |
|
PR_Github #22325 [ run ] completed with state |
|
/bot run |
|
PR_Github #22342 [ run ] triggered by Bot. Commit: |
|
PR_Github #22342 [ run ] completed with state |
|
/bot run |
|
PR_Github #22373 [ run ] triggered by Bot. Commit: |
|
PR_Github #22373 [ run ] completed with state |
|
/bot run |
|
PR_Github #22402 [ run ] triggered by Bot. Commit: |
|
PR_Github #22402 [ run ] completed with state |
|
/bot run |
|
PR_Github #22423 [ run ] triggered by Bot. Commit: |
|
PR_Github #22423 [ run ] completed with state |
…el (NVIDIA#8632) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
…el (NVIDIA#8632) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
…el (NVIDIA#8632) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
…el (NVIDIA#8632) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
…el (NVIDIA#8632) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Skip compute when kernel has no valid work
Summary by CodeRabbit
Performance Improvements
Bug Fixes
Refactor