[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

tdoublep · 2025-08-10T09:33:20Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR removes the --enforce-eager constraint for Minimax models. It adds support for piecewise CUDA graphs for the linear attention and enables torch compiling of the rest of the model.

It would be great if Minimax team could run additional correctness checks on the real model.

cc @rogeryoungh @qscqesze @heheda12345

Test Plan

I have tested it using Goekdeniz-Guelmez/MiniMax01Text-Dev locally. I haven't included that test in this PR because we need to land #21549 before it can be included because FlashInfer doesn't support that tiny model unfortunately.

Test Result

The test is passing (e.g., V1 results with compile match V0 results).

(Optional) Documentation Update

Signed-off-by: Thomas Parnell <[email protected]>

github-actions · 2025-08-10T09:33:28Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request refactors the MiniMax-Text model to enable torch.compile and piecewise CUDA graph capture. The changes primarily involve modifying forward passes to use output buffers instead of returning tensors, which is a key pattern for compiler compatibility. A custom op linear_attention is introduced to serve as a boundary for piecewise compilation. The changes are generally well-executed and align with the goal of improving performance through compilation. My feedback focuses on improving code quality by correcting type hints and removing a leftover debug statement.

vllm/model_executor/models/minimax_text_01.py

heheda12345

LGTM! @tdoublep can you update the document?
Please merge after the correctness is verified.

rogeryoungh · 2025-08-14T08:27:35Z

Thank you for your work! We ran validation using the same build environment as last time, but the results seemed a bit unusual, which could be due to a problem with the model inference process. Here are the reproduction steps and outcomes for your reference.

We compiled and installed your PR inside the vllm-openai:v0.10.0 Docker image via pip install --no-build-isolation /tmp/vllm_patched/.

The deployment command was the same as in the previous PR, just with --enforce-eager removed:

python3 -m vllm.entrypoints.api_server --model /data/xxx/model/MiniMax-Text-01/ --tensor-parallel-size 8 --trust-remote-code --quantization experts_int8 --max_model_len 8192 --dtype bfloat16 --no-enable-prefix-caching

Here are the test results:

For gsm8k:

python3 bench_other.py --num-questions 500 --num-shots 5 --backend vllm --port 8000 --host http://127.0.0.1
# ...
Accuracy: 0.010
Invalid: 0.018
Latency: 202.115 s

For mmlu:

python3 bench_other.py --nsub 200 --backend vllm --port 8000 --host http://127.0.0.1
# ...
Total latency: 1405.775
Average accuracy: 0.801

Upon checking the model's output for GSM8K, we noticed a significant number of extra newlines and garbled characters, indicating an abnormal output format.

On the other hand, for the MMLU benchmark, which only requires a single letter as the answer, the accuracy is only slightly lower than normal. We suspect this simpler output format might be masking some underlying issues that are more apparent in the GSM8K results.

tdoublep · 2025-08-14T10:44:41Z

@rogeryoungh Thanks for the eval. There must be some bug that isn't being hit when I run with the tiny model. I will take another look at it.

mergify · 2025-08-15T13:07:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-08-26T15:33:40Z

I was able to reproduce the bad lm_eval results and dug into what is going on here.

The problem is related to implementation of the rotary embedding for this model (it is not compatible with torch compile). I've replaced it with the call to get_rope that the other models use. If there was any particular reason why a custom implementation (e.g., MiniMaxText01RotaryEmbedding) was needed, please let me know. I took a quick look through the code and couldn't see one.

I deploy the model as follows:

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve MiniMaxAI/MiniMax-Text-01 \
	--tensor-parallel-size 8 \
	--trust-remote-code \
	--quantization experts_int8  \
	--max_model_len 4096 \
	--dtype bfloat16 \
	--gpu-memory-utilization 0.95 \
	--no-enable-prefix-caching

and then I run eval with:

lm_eval   --model local-completions   \
	--model_args base_url=http://localhost:8000/v1/completions,tokenizer=MiniMaxAI/MiniMax-Text-01 \
	--tasks gsm8k  \
	--batch_size 128 \
	--num_fewshot 5 \
	--limit 500

which now produces the expected output:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.898|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.892|±  |0.0139|

@rogeryoungh Could you give it another try on your end?

tdoublep · 2025-08-26T17:53:32Z

After pulling in latest changes from main, we can also now deploy with the default FlashAttention backend (instead of FlashInfer):

VLLM_USE_V1=1 vllm serve MiniMaxAI/MiniMax-Text-01 \
	--tensor-parallel-size 8 \
	--trust-remote-code \
	--quantization experts_int8  \
	--max_model_len 4096 \
	--dtype bfloat16 \
	--gpu-memory-utilization 0.95 \
	--no-enable-prefix-caching

The gsm8k eval above now produces:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.904|±  |0.0132|
|     |       |strict-match    |     5|exact_match|↑  |0.900|±  |0.0134|

Signed-off-by: Thomas Parnell <[email protected]>

rogeryoungh · 2025-08-27T08:45:36Z

Great work! I have verified the changes, and the implementation now works as expected. On GSM8k the accuracy is 0.908, and on MMLU the average accuracy is 0.847.

Deployment command:

VLLM_ATTENTION_BACKEND=FLASHINFER VLLM_USE_V1=1 python3 -m vllm.entrypoints.api_server \
  --model /data/xxx/model/MiniMax-Text-01/ \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --quantization experts_int8 \
  --max_model_len 4096 \
  --dtype bfloat16 \
  --no-enable-prefix-caching

GSM8k test:

Accuracy: 0.908
Invalid: 0.000
Latency: 203.588 s

MMLU test:

Total latency: 1489.795
Average accuracy: 0.847

qscqesze · 2025-08-27T09:20:01Z

I was able to reproduce the bad lm_eval results and dug into what is going on here.

The problem is related to implementation of the rotary embedding for this model (it is not compatible with torch compile). I've replaced it with the call to get_rope that the other models use. If there was any particular reason why a custom implementation (e.g., MiniMaxText01RotaryEmbedding) was needed, please let me know. I took a quick look through the code and couldn't see one.

Thanks a lot for the fix! The custom implementation didn’t have any special reason — it was just how it was written back then. Really appreciate your improvement.

rogeryoungh · 2025-08-27T11:52:34Z

I also retested with the default FlashAttention backend. On GSM8k, the accuracy was 0.904, and on MMLU the average accuracy was 0.847. Everything looks good now.

…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>

tdoublep added 2 commits August 10, 2025 05:26

Enable compile for minimax

9ad6271

Signed-off-by: Thomas Parnell <[email protected]>

minor diff reduction

c698db3

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested review from simon-mo, WoosukKwon, youkaichao, robertgshaw2-redhat, mgoin, tlrmchlsmth, houseroad and hmellor as code owners August 10, 2025 09:33

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

vllm/model_executor/models/minimax_text_01.py Outdated Show resolved Hide resolved

vllm/model_executor/models/minimax_text_01.py Outdated Show resolved Hide resolved

heheda12345 approved these changes Aug 11, 2025

View reviewed changes

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 11, 2025

mergify bot added the needs-rebase label Aug 15, 2025

tdoublep added 2 commits August 26, 2025 03:56

Resolve conflicts

fa831c5

Signed-off-by: Thomas Parnell <[email protected]>

Fix some things

07b3dd7

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep force-pushed the minimax-compile-pr branch from 97fdef6 to 07b3dd7 Compare August 26, 2025 09:27

tdoublep requested review from yewentao256 and ProExpertProg as code owners August 26, 2025 09:27

mergify bot removed the needs-rebase label Aug 26, 2025

tdoublep added 3 commits August 26, 2025 10:57

Fix issue with rope

9de8c8f

Signed-off-by: Thomas Parnell <[email protected]>

Further cleanup

fed4c21

Signed-off-by: Thomas Parnell <[email protected]>

Remove custom rope

477f9bb

Signed-off-by: Thomas Parnell <[email protected]>

Merge branch 'main' into minimax-compile-pr

946f871

Fix type hints

02c37d5

Signed-off-by: Thomas Parnell <[email protected]>

heheda12345 merged commit dd58932 into vllm-project:main Aug 27, 2025
40 checks passed

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Tex…

beae0d7

…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Tex…

ff749b2

…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Tex…

91f6600

…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>

dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Tex…

1194ef9

…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>

2015aroras pushed a commit to 2015aroras/vllm that referenced this pull request Aug 29, 2025

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Tex…

e5abd63

…t models (vllm-project#22589) Signed-off-by: Thomas Parnell <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

tdoublep commented Aug 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

heheda12345 left a comment

Uh oh!

rogeryoungh commented Aug 14, 2025 •

edited

Loading

Uh oh!

tdoublep commented Aug 14, 2025

Uh oh!

mergify bot commented Aug 15, 2025

Uh oh!

tdoublep commented Aug 26, 2025

Uh oh!

tdoublep commented Aug 26, 2025

Uh oh!

rogeryoungh commented Aug 27, 2025

Uh oh!

qscqesze commented Aug 27, 2025

Uh oh!

rogeryoungh commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

Conversation

tdoublep commented Aug 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

rogeryoungh commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Aug 14, 2025

Uh oh!

mergify bot commented Aug 15, 2025

Uh oh!

tdoublep commented Aug 26, 2025

Uh oh!

tdoublep commented Aug 26, 2025

Uh oh!

rogeryoungh commented Aug 27, 2025

Uh oh!

qscqesze commented Aug 27, 2025

Uh oh!

rogeryoungh commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

tdoublep commented Aug 10, 2025 •

edited by github-actions bot

Loading

rogeryoungh commented Aug 14, 2025 •

edited

Loading