Skip to content

[AutoDeploy][Bug]: half/full precision MoE cutlass kernel invocation is missing style+activation arguments #9338

@nzmora-nvidia

Description

@nzmora-nvidia

To WAR, choose the triton kernels in default.yaml:

  fuse_moe:
    stage: post_load_fusion
    enabled: true
    backend: triton

System Info

All

Who can help?

@nzmora-nvidia

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

trtllm-bench --model nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3 throughput --dataset tmp/nemotron_128_128_256.inp --warmup 0 --backend _autodeploy --max_batch_size 256 --extra_llm_api_options ~/llm_args_ad.yaml --tp=1


File "/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/gkwasniewski/dev/TensorRT-LLM/tensorrt_llm/_torch/custom_ops/torch_custom_ops.py", line 224, in fused_moe
    output = run_moe(input, token_selected_experts, token_final_scales,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: fc1_expert_weights inter size must be 2 times fc2_expert_weights inter size.

Expected behavior

Should not throw exception

actual behavior

throws an exception

additional notes

The cutlass MoE kernel is invoked with the default activation function (silu) instead of the activation used by the model

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

AutoDeploy<NV> AutoDeploy BackendCustomized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions