[Mamba] Support TP>1 with quantization for mamba2 mixer in case `n_groups % tp_size == 0` #24593

tomeras91 · 2025-09-10T14:48:16Z

Purpose

PR #13660 built on #10909 and fixed TP loading issues of mamba2 mixer weights. Concretely, it dealt with duplication of weights needed in cases where n_groups % tp_size != 0. This logic is a bit complicated and required creating custom weight loaders, which means using quantized layers is not possible since their weight loading logic can't be overridden. Following that, PR #14617 introduced a special assertion to make sure mamba2 mixer isn't run with quantization and TP>1.

Yet, due to the complexity/impact tradeoff, PR #13660 didn't support all values of n_groups % tp_size, but rather only 2 cases - n_groups % tp_size == 0 and n_groups == 1. The custom weight loading logic is needed only for the latter case. In the former, no weight duplication is needed and the MergedColumnParallelLinear class can be used.

This PR splits the weight creation of mamba2 mixer to two code paths

The current path in main is now used only if n_groups % tp_size != 0 (i.e. n_groups==1)
Use of MergedColumnParallelLinear if n_groups % tp_size == 0
This enables quantization of the mamba2 mixer block in case n_groups % tp_size == 0, which is the more common case.

Test Plan

Make sure TP>1 results of unquantized models stay the same. This is done by running the following on main and on this branch:

lm_eval --model vllm --model_args pretrained=mistralai/Mamba-Codestral-7B-v0.1,gpu_memory_utilization=0.8,max_model_len=4096,tensor_parallel_size=2 --batch_size auto --trust_remote_code --cache_requests true --tasks gsm8k

Make sure a quantized mamba2 model can be loaded with TP>1, and get equivalent results as in TP=1.

Test Result

Got the following GSM8K scores:
main:

Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4685|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.4549|±  |0.0137|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4685|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.4549|±  |0.0137|

Results are identical.

Validated with an internal quantized mamba2 model

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

`

…groups % tp_size == 0 Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 · 2025-09-10T14:48:33Z

CC @fabianlim

gemini-code-assist

Code Review

This pull request introduces a significant enhancement by enabling quantization for the Mamba2 mixer with tensor parallelism when n_groups is divisible by tp_size. This is achieved by adding a new code path using MergedColumnParallelLinear. The changes are well-structured. However, I've identified a critical logic error in the condition that selects between the new and old code paths, which would lead to incorrect behavior for the n_groups=1 case with TP>1. A fix is suggested to ensure the correct path is chosen based on the original model configuration.

vllm/model_executor/layers/mamba/mamba_mixer2.py

…always divisible by tp_size Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: tomeras91 <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

fabianlim · 2025-09-10T23:09:36Z

vllm/model_executor/layers/mamba/mamba_mixer2.py

                self.intermediate_size // self.tp_size,
-                groups_time_state_size // self.tp_size,
-                groups_time_state_size // self.tp_size,
+                self.groups_ssm_state_size // self.tp_size,


oic was this a bug?

no.. this is just some renaming since I now create self.groups_ssm_state_size = self.n_groups * self.ssm_state_size during init

fabianlim

LGTM

mergify · 2025-09-12T13:59:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomeras91.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tomer Asida <[email protected]>

tdoublep

LGTM - thanks

…oups % tp_size == 0` (vllm-project#24593) Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: tomeras91 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…oups % tp_size == 0` (vllm-project#24593) Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: tomeras91 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <[email protected]>

…oups % tp_size == 0` (vllm-project#24593) Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: tomeras91 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…oups % tp_size == 0` (vllm-project#24593) Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: tomeras91 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <[email protected]>

tomeras91 added 2 commits September 10, 2025 15:06

Add initial implementation for using MergedColumnParallelLinear if n_…

392e99c

…groups % tp_size == 0 Signed-off-by: Tomer Asida <[email protected]>

fix assertion comment + remove debug print

eb4d81f

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 requested a review from tdoublep as a code owner September 10, 2025 14:48

gemini-code-assist bot reviewed Sep 10, 2025

View reviewed changes

vllm/model_executor/layers/mamba/mamba_mixer2.py Outdated Show resolved Hide resolved

tomeras91 and others added 2 commits September 10, 2025 18:09

Fix if condition. Use n_groups instead of self.n_groups which is …

947c745

…always divisible by tp_size Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: tomeras91 <[email protected]>

Merge branch 'main' into fix-mamba2-quant-tp

d7cdb7b

Signed-off-by: Tomer Asida <[email protected]>

fabianlim reviewed Sep 10, 2025

View reviewed changes

fabianlim approved these changes Sep 10, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 12, 2025

Merge branch 'main' into fix-mamba2-quant-tp

9956fb2

Signed-off-by: Tomer Asida <[email protected]>

mergify bot removed the needs-rebase label Sep 14, 2025

tdoublep approved these changes Sep 16, 2025

View reviewed changes

tdoublep enabled auto-merge (squash) September 16, 2025 08:01

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2025

Merge branch 'main' into fix-mamba2-quant-tp

a6f1415

tdoublep merged commit 27fcfe7 into vllm-project:main Sep 16, 2025
42 checks passed

tomeras91 deleted the fix-mamba2-quant-tp branch September 16, 2025 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Mamba] Support TP>1 with quantization for mamba2 mixer in case `n_groups % tp_size == 0` #24593

[Mamba] Support TP>1 with quantization for mamba2 mixer in case `n_groups % tp_size == 0` #24593

Uh oh!

tomeras91 commented Sep 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

tomeras91 commented Sep 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

fabianlim Sep 10, 2025

Uh oh!

tomeras91 Sep 11, 2025

Uh oh!

fabianlim left a comment

Uh oh!

mergify bot commented Sep 12, 2025

Uh oh!

tdoublep left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Mamba] Support TP>1 with quantization for mamba2 mixer in case n_groups % tp_size == 0 #24593

[Mamba] Support TP>1 with quantization for mamba2 mixer in case n_groups % tp_size == 0 #24593

Uh oh!

Conversation

tomeras91 commented Sep 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

tomeras91 commented Sep 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

fabianlim Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

tomeras91 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

fabianlim left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 12, 2025

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Mamba] Support TP>1 with quantization for mamba2 mixer in case `n_groups % tp_size == 0` #24593

[Mamba] Support TP>1 with quantization for mamba2 mixer in case `n_groups % tp_size == 0` #24593

tomeras91 commented Sep 10, 2025 •

edited by github-actions bot

Loading