-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendCustomized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't workingSomething isn't workingtriagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
To WAR, choose the triton kernels in default.yaml:
fuse_moe:
stage: post_load_fusion
enabled: true
backend: triton
System Info
All
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
trtllm-bench --model nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3 throughput --dataset tmp/nemotron_128_128_256.inp --warmup 0 --backend _autodeploy --max_batch_size 256 --extra_llm_api_options ~/llm_args_ad.yaml --tp=1
File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/gkwasniewski/dev/TensorRT-LLM/tensorrt_llm/_torch/custom_ops/torch_custom_ops.py", line 224, in fused_moe
output = run_moe(input, token_selected_experts, token_final_scales,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: fc1_expert_weights inter size must be 2 times fc2_expert_weights inter size.
Expected behavior
Should not throw exception
actual behavior
throws an exception
additional notes
The cutlass MoE kernel is invoked with the default activation function (silu) instead of the activation used by the model
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendCustomized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't workingSomething isn't workingtriagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Type
Projects
Status
Done