-
-
Notifications
You must be signed in to change notification settings - Fork 13k
[Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement #31830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for calculating Mixture-of-Experts (MoE) problem sizes in CUTLASS kernels. It replaces a kernel that computes expert token counts from topk_ids with a more efficient kernel that derives them from expert_first_token_offset. This change is propagated through the C++ ops, Python bindings, and the cutlass_moe.py layer, resulting in a notable end-to-end throughput improvement. The changes are logical and well-implemented. I have one critical comment regarding a potential integer overflow that could lead to correctness issues.
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good opportunity for a Triton kernel, did you try that?
Signed-off-by: yewentao256 <[email protected]>
Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Wentao Ye <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
No yet, why for a triton kernel? |
I agree, a triton kernel would be simpler and could be completely local to cutlass_moe.py |
|
@ProExpertProg @mgoin I wrote a Triton kernel and it makes TTFT slower ============ Serving Benchmark Result ============
Successful requests: 128
Failed requests: 0
Benchmark duration (s): 4.78
Total input tokens: 256
Total generated tokens: 16384
Request throughput (req/s): 26.77
Output token throughput (tok/s): 3426.12
Peak output token throughput (tok/s): 3584.00
Peak concurrent requests: 128.00
Total token throughput (tok/s): 3479.65
---------------Time to First Token----------------
Mean TTFT (ms): 172.63
Median TTFT (ms): 165.87
P99 TTFT (ms): 199.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 36.01
Median TPOT (ms): 36.04
P99 TPOT (ms): 36.08
---------------Inter-token Latency----------------
Mean ITL (ms): 36.01
Median ITL (ms): 35.96
P99 ITL (ms): 40.37
==================================================I think we don't need to involve triton in this PR, if needed we can have following up PR and tune it accordingly |
| if expert_map is not None: | ||
| "Translate info from expert_map to topk_ids" | ||
| local_topk_ids = torch.where( | ||
| expert_map[topk_ids] != -1, expert_map[topk_ids], -1 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please verify correctness for removing the expert_map logic here. I assume this works because moe_permute already handles the mapping, but I'm not sure. I think you should test accuracy with EP and EPLB to properly exercise this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9462|± |0.0062|
| | |strict-match | 5|exact_match|↑ |0.9454|± |0.0063|Tested with EPLB, added in the PR description as well
Signed-off-by: yewentao256 <[email protected]>
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
| ops.get_cutlass_moe_mm_problem_sizes_from_expert_offsets( | ||
| expert_first_token_offset, problem_sizes1, problem_sizes2, N, K, swap_ab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use it or cutlass moe fp4 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems out of this PR's scope, I can test it and if could be used, I will have a following up PR for that.
…hput improvement, 2.2% TTFT improvement (vllm-project#31830) Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Wentao Ye <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
…hput improvement, 2.2% TTFT improvement (vllm-project#31830) Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Wentao Ye <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: dsuhinin <[email protected]>
…hput improvement, 2.2% TTFT improvement (vllm-project#31830) Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Wentao Ye <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
…hput improvement, 2.2% TTFT improvement (vllm-project#31830) Signed-off-by: yewentao256 <[email protected]> Signed-off-by: Wentao Ye <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: daje0601 <[email protected]>
Purpose
Part of the #31755
Here we add a kernel for faster calculation of problem size
Test
export MODEL="zai-org/GLM-4.7-FP8"vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --max_num_seqs 128Acc
lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=$MODEL,num_concurrent=1024" --tasks gsm8kWith EPLB:
vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --max_num_seqs 128 --enable-eplbPerf
vllm bench serve --model $MODEL --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 128 --request-rate inf --num-prompts 128Note
Introduces a faster path for CUTLASS MoE problem-size setup and wires it through C++/Python to fused MoE.
compute_problem_sizes_from_expert_offsetswith callers and registry (moe_data.cu,scaled_mm_entry.cu,torch_bindings.cpp) and Python APIops.get_cutlass_moe_mm_problem_sizes_from_expert_offsetstopk_idscounting with expert-offset–based sizing incutlass_moe.py(both FP8 and W4A8 paths): allocateproblem_sizes{1,2}forlocal_E, call the new op, deriveexpert_offsetsfromexpert_first_token_offset, and always passexpert_first_token_offsettomoe_unpermuteVLLM_DISPATCH_BOOLfor launch-time specialization; minor pointer/type cleanupsops.h) and Python wrappers added in_custom_ops.pyWritten by Cursor Bugbot for commit 2c089da. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 1902481. Configure here.