-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/v1/worker/gpu_worker.py
Outdated
# Warmup kernels used during model execution | ||
kernel_warmup(self) | ||
kernel_warmup(self, do_autotune=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this flag and to run twice? I think we can just move this before cuda graphs like your original commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I was thinking that auto-tuning and warm-up serve two different purposes here. Auto-tuning is meant to store the best kernel function index, so I placed it before cuda graph capture to make sure cuda graph sees the correct kernel. Warm-up is a dry run before actual job starts, so I added it right before the real execution just like original codes. Based on your earlier comment, I thought you were suggesting that warm-up is necessary (maybe DeepGEMM requires it?). Please correct me if I’ve misunderstood.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Siyuan Fu <[email protected]>
Signed-off-by: Siyuan Fu <[email protected]>
Signed-off-by: Siyuan Fu <[email protected]>
Signed-off-by: siyuanf <[email protected]>
c0b3809
to
7e1fb28
Compare
Signed-off-by: Siyuan Fu <[email protected]>
Signed-off-by: Siyuan Fu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, could you also add a E2E accuracy test using lm-eval
?
Hi @yewentao256 . I have experimented with
|
Signed-off-by: Siyuan Fu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good except address todo
Signed-off-by: Siyuan Fu <[email protected]>
Signed-off-by: Siyuan Fu <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
Closed after #23537 |
Can be merged after flashinfer address the AOT installation.
Purpose
The flashinfer fp4 autotuner is merged. Need to update the api call in the mxfp4 moe.
x_scale
shape and use a hardcoded max tunning number of tokens in the mxfp4 moe.kernel_warmup
above theself.model_runner.capture_model()
bump flashinfer tag to 0.2.13Test Plan
Test Result
On B200,
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
:without autotuner
with autotuner
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.