Skip to content

[Performance]: Willing to PR for optimizations about several moe-related kernels #8143

@fzyzcjy

Description

@fzyzcjy

Proposal to improve performance

Hi, firstly I personally want to say thanks to TensorRT-LLM, since it has quite fast kernels which are integrated into SGLang. It seems that I find a little bit of room that may be improved in moe-related kernels, and thus I am willing to contribute back to TensorRT-LLM.

I have made a tiny prototype at flashinfer-ai/flashinfer#1717, and it achieves 5% end-to-end speedup and up to 2.5x kernel speedup on DeepSeek V3/R1 prefill. I am happy to polish the code and PR to TensorRT-LLM and FlashInfer, and thus firstly open an issue to briefly discuss about it.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

System Information:

  • OS:
  • Python version:
  • CUDA version:
  • GPU model(s):
  • Driver version:
  • TensorRT version:
  • PyTorch version:
  • TensorRT-LLM version:

Detailed output:

Paste the output of the above commands here

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions