-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.
Description
Proposal to improve performance
Hi, firstly I personally want to say thanks to TensorRT-LLM, since it has quite fast kernels which are integrated into SGLang. It seems that I find a little bit of room that may be improved in moe-related kernels, and thus I am willing to contribute back to TensorRT-LLM.
I have made a tiny prototype at flashinfer-ai/flashinfer#1717, and it achieves 5% end-to-end speedup and up to 2.5x kernel speedup on DeepSeek V3/R1 prefill. I am happy to polish the code and PR to TensorRT-LLM and FlashInfer, and thus firstly open an issue to briefly discuss about it.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
System Information:
- OS:
- Python version:
- CUDA version:
- GPU model(s):
- Driver version:
- TensorRT version:
- PyTorch version:
- TensorRT-LLM version:
Detailed output:
Paste the output of the above commands here
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
josephrocca
Metadata
Metadata
Labels
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.