[Performance]: Willing to PR for optimizations about several moe-related kernels

### Proposal to improve performance

Hi, firstly I personally want to say thanks to TensorRT-LLM, since it has quite fast kernels which are integrated into SGLang. It seems that I find a little bit of room that may be improved in moe-related kernels, and thus I am willing to contribute back to TensorRT-LLM.

I have made a tiny prototype at https://github.com/flashinfer-ai/flashinfer/pull/1717, and it achieves 5% end-to-end speedup and up to 2.5x kernel speedup on DeepSeek V3/R1 prefill. I am happy to polish the code and PR to TensorRT-LLM and FlashInfer, and thus firstly open an issue to briefly discuss about it.

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

**System Information:**
- OS:
- Python version:
- CUDA version:
- GPU model(s):
- Driver version:
- TensorRT version:
- PyTorch version:
- TensorRT-LLM version:

**Detailed output:**
```text
Paste the output of the above commands here
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance]: Willing to PR for optimizations about several moe-related kernels #8143

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: Willing to PR for optimizations about several moe-related kernels #8143

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions