-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the MiniMax-Text model to enable torch.compile
and piecewise CUDA graph capture. The changes primarily involve modifying forward passes to use output buffers instead of returning tensors, which is a key pattern for compiler compatibility. A custom op linear_attention
is introduced to serve as a boundary for piecewise compilation. The changes are generally well-executed and align with the goal of improving performance through compilation. My feedback focuses on improving code quality by correcting type hints and removing a leftover debug statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @tdoublep can you update the document?
Please merge after the correctness is verified.
Thank you for your work! We ran validation using the same build environment as last time, but the results seemed a bit unusual, which could be due to a problem with the model inference process. Here are the reproduction steps and outcomes for your reference. We compiled and installed your PR inside the The deployment command was the same as in the previous PR, just with
Here are the test results: For gsm8k: python3 bench_other.py --num-questions 500 --num-shots 5 --backend vllm --port 8000 --host http://127.0.0.1
# ...
Accuracy: 0.010
Invalid: 0.018
Latency: 202.115 s For mmlu: python3 bench_other.py --nsub 200 --backend vllm --port 8000 --host http://127.0.0.1
# ...
Total latency: 1405.775
Average accuracy: 0.801 Upon checking the model's output for GSM8K, we noticed a significant number of extra newlines and garbled characters, indicating an abnormal output format. On the other hand, for the MMLU benchmark, which only requires a single letter as the answer, the accuracy is only slightly lower than normal. We suspect this simpler output format might be masking some underlying issues that are more apparent in the GSM8K results. |
@rogeryoungh Thanks for the eval. There must be some bug that isn't being hit when I run with the tiny model. I will take another look at it. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
97fdef6
to
07b3dd7
Compare
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
I was able to reproduce the bad lm_eval results and dug into what is going on here. The problem is related to implementation of the rotary embedding for this model (it is not compatible with torch compile). I've replaced it with the call to I deploy the model as follows:
and then I run eval with:
which now produces the expected output:
@rogeryoungh Could you give it another try on your end? |
After pulling in latest changes from main, we can also now deploy with the default FlashAttention backend (instead of FlashInfer):
The gsm8k eval above now produces:
|
Signed-off-by: Thomas Parnell <[email protected]>
Great work! I have verified the changes, and the implementation now works as expected. On GSM8k the accuracy is 0.908, and on MMLU the average accuracy is 0.847. Deployment command: VLLM_ATTENTION_BACKEND=FLASHINFER VLLM_USE_V1=1 python3 -m vllm.entrypoints.api_server \
--model /data/xxx/model/MiniMax-Text-01/ \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 4096 \
--dtype bfloat16 \
--no-enable-prefix-caching GSM8k test: Accuracy: 0.908
Invalid: 0.000
Latency: 203.588 s MMLU test: Total latency: 1489.795
Average accuracy: 0.847 |
Thanks a lot for the fix! The custom implementation didn’t have any special reason — it was just how it was written back then. Really appreciate your improvement. |
I also retested with the default FlashAttention backend. On GSM8k, the accuracy was 0.904, and on MMLU the average accuracy was 0.847. Everything looks good now. |
…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>
…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>
…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>
…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>
…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This PR removes the
--enforce-eager
constraint for Minimax models. It adds support for piecewise CUDA graphs for the linear attention and enables torch compiling of the rest of the model.It would be great if Minimax team could run additional correctness checks on the real model.
cc @rogeryoungh @qscqesze @heheda12345
Test Plan
I have tested it using
Goekdeniz-Guelmez/MiniMax01Text-Dev
locally. I haven't included that test in this PR because we need to land #21549 before it can be included because FlashInfer doesn't support that tiny model unfortunately.Test Result
The test is passing (e.g., V1 results with compile match V0 results).
(Optional) Documentation Update