-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num
#21325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a bug in the condition to enable FlashInfer's Allreduce+Norm fusion. The original logic incorrectly used num_tokens
instead of hidden_size
when calculating the dynamic size limit, causing the fusion to be disabled for models with large hidden sizes. The fix addresses this, and the provided test results confirm that the fusion is now correctly triggered. I've added one suggestion to improve the readability of the calculation.
f229bcc
to
30b722e
Compare
Signed-off-by: XIn Li <[email protected]>
Signed-off-by: XIn Li <[email protected]>
30b722e
to
bf3fa63
Compare
Signed-off-by: XIn Li <[email protected]>
bf247da
to
8195e6c
Compare
Thank you for the fix! Looks good to me |
Thanks @mgoin @ilmarkov , the CI failure seems to be unrelated and fails with recent vLLM PRs too, e.g. in https://buildkite.com/vllm/ci/builds/24587/steps/canvas?jid=01983248-1e76-4766-bd71-2881bbbe90b2 Please let me know if there is anything I can do to help merge this :) |
fi_allreduce_fusion_max_token_num
fi_allreduce_fusion_max_token_num
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]> Signed-off-by: qizixi <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]> Signed-off-by: avigny <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]> Signed-off-by: shuw <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]> Signed-off-by: x22x22 <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]>
…_allreduce_fusion_max_token_num` (vllm-project#21325) Signed-off-by: XIn Li <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Essential Elements of an Effective PR Description Checklist
Purpose
Warm up for cudagraph passes
all_reduce_in
with shape [num_tokens, hidden size ] (for dsr1, hidden size is 7168), but the comparison compares this against: num_tokens * shape[0] (which is also num_tokens), leading to flashinfer fusion being always disabled with large enough hidden sizeTest Result
No FI kernels being called on TOT.
With this PR, FI kernels are called correctly
Benchmarks (WIP)
On DSR1 + 4xB200 (with hf_overwrite to 30 layers), TP4, different concurrency values.