-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
feat: bf16 x mxfp4 cutlass fused moe for gpt-oss on hopper #23369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: bf16 x mxfp4 cutlass fused moe for gpt-oss on hopper #23369
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
40ca782
to
c215b71
Compare
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
c215b71
to
927c179
Compare
…ing in fused moe (#1565) ## 📌 Description This fixes an OOB issue in the fused MoE and creates a separate sm90 and sm100 path fp4 quantization. ## 🔍 Related Issues fix required for vllm-project/vllm#23369 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes --------- Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
closing as #23696 contains these changes include the blackwell changes |
Purpose
This PR adds the cutlass fused moe bf16xmxfp4 for gpt-oss on hopper from flashinfer the blackwell integration will come in a follow up PR soon.
The accuracy looks good, but there is a performance regression compared to triton (you can see in the results section). This is something we actively working on fixing and will push to flashinfer once it is resolved. Please let me know how you want to proceed with backend selection, and I can update accordingly.
Test Plan
Unit Tests:
Accuracy Tests:
evals
from herelow
Ref for gpt-oss-20b and gpt-oss-120b are 0.58 and 0.68 respectively.Performance:
Test Result
Unit Tests:
Accuracy Tests:
Peformance:
FlashInfer:
Triton: