[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

tdoublep · 2025-08-10T09:33:20Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR removes the --enforce-eager constraint for Minimax models. It adds support for piecewise CUDA graphs for the linear attention and enables torch compiling of the rest of the model.

It would be great if Minimax team could run additional correctness checks on the real model.

cc @rogeryoungh @qscqesze @heheda12345

Test Plan

I have tested it using Goekdeniz-Guelmez/MiniMax01Text-Dev locally. I haven't included that test in this PR because we need to land #21549 before it can be included because FlashInfer doesn't support that tiny model unfortunately.

Test Result

The test is passing (e.g., V1 results with compile match V0 results).

(Optional) Documentation Update

Signed-off-by: Thomas Parnell <[email protected]>

github-actions · 2025-08-10T09:33:28Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request refactors the MiniMax-Text model to enable torch.compile and piecewise CUDA graph capture. The changes primarily involve modifying forward passes to use output buffers instead of returning tensors, which is a key pattern for compiler compatibility. A custom op linear_attention is introduced to serve as a boundary for piecewise compilation. The changes are generally well-executed and align with the goal of improving performance through compilation. My feedback focuses on improving code quality by correcting type hints and removing a leftover debug statement.

gemini-code-assist · 2025-08-10T09:35:31Z

vllm/model_executor/models/minimax_text_01.py

+    def forward(self, hidden_states: torch.Tensor, output: torch.Tensor,
+                positions: torch.Tensor,
+                kv_caches: MinimaxCacheParams) -> torch.Tensor:
+        if not envs.VLLM_USE_V1:
+            self._forward(hidden_states, output, positions, kv_caches)
+        else:
+            torch.ops.vllm.linear_attention(
+                hidden_states,
+                output,
+                positions,
+                self.prefix,
+            )
+
+    def _forward(self, hidden_states: torch.Tensor, output: torch.Tensor,
+                 positions: torch.Tensor,
+                 kv_caches: MinimaxCacheParams) -> torch.Tensor:


The type hints for forward and _forward methods in MiniMaxText01LinearAttention have some issues that should be corrected for code clarity and correctness:

The return type for both forward (L514) and _forward (L527) is annotated as torch.Tensor, but neither function returns a value. They should be annotated with -> None.

The kv_caches parameter in _forward (L527) is annotated as MinimaxCacheParams, but it's called with None from the linear_attention custom op (L1460). It should be Optional[MinimaxCacheParams].

Suggested change

def forward(self, hidden_states: torch.Tensor, output: torch.Tensor,

positions: torch.Tensor,

kv_caches: MinimaxCacheParams) -> torch.Tensor:

if not envs.VLLM_USE_V1:

self._forward(hidden_states, output, positions, kv_caches)

else:

torch.ops.vllm.linear_attention(

hidden_states,

output,

positions,

self.prefix,

)

def _forward(self, hidden_states: torch.Tensor, output: torch.Tensor,

positions: torch.Tensor,

kv_caches: MinimaxCacheParams) -> torch.Tensor:

def forward(self, hidden_states: torch.Tensor, output: torch.Tensor,

positions: torch.Tensor,

kv_caches: MinimaxCacheParams) -> None:

if not envs.VLLM_USE_V1:

self._forward(hidden_states, output, positions, kv_caches)

else:

torch.ops.vllm.linear_attention(

hidden_states,

output,

positions,

self.prefix,

)

def _forward(self, hidden_states: torch.Tensor, output: torch.Tensor,

positions: torch.Tensor,

kv_caches: Optional[MinimaxCacheParams]) -> None:

gemini-code-assist · 2025-08-10T09:35:31Z

vllm/model_executor/models/minimax_text_01.py

+    layer_name: str,
+) -> None:
+    forward_context: ForwardContext = get_forward_context()
+    print("layer_name: ", layer_name)


This print statement appears to be a leftover from debugging. It should be removed to avoid polluting logs in production.

Signed-off-by: Thomas Parnell <[email protected]>

heheda12345

LGTM! @tdoublep can you update the document?
Please merge after the correctness is verified.

tdoublep added 2 commits August 10, 2025 05:26

Enable compile for minimax

9ad6271

Signed-off-by: Thomas Parnell <[email protected]>

minor diff reduction

c698db3

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested review from simon-mo, WoosukKwon, youkaichao, robertgshaw2-redhat, mgoin, tlrmchlsmth, houseroad and hmellor as code owners August 10, 2025 09:33

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

tdoublep added 3 commits August 10, 2025 08:37

Try to use return instead of mutate

bf57695

Signed-off-by: Thomas Parnell <[email protected]>

Fix return type

c57d39e

Signed-off-by: Thomas Parnell <[email protected]>

minor cleanup

97fdef6

Signed-off-by: Thomas Parnell <[email protected]>

heheda12345 approved these changes Aug 11, 2025

View reviewed changes

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

Uh oh!

tdoublep commented Aug 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 10, 2025

Uh oh!

gemini-code-assist bot Aug 10, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

Uh oh!

Uh oh!

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

Are you sure you want to change the base?

[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models #22589

Uh oh!

Conversation

tdoublep commented Aug 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tdoublep commented Aug 10, 2025 •

edited by github-actions bot

Loading