Skip to content

Conversation

@tomeras91
Copy link
Contributor

@tomeras91 tomeras91 commented Sep 10, 2025

Purpose

PR #13660 built on #10909 and fixed TP loading issues of mamba2 mixer weights. Concretely, it dealt with duplication of weights needed in cases where n_groups % tp_size != 0. This logic is a bit complicated and required creating custom weight loaders, which means using quantized layers is not possible since their weight loading logic can't be overridden. Following that, PR #14617 introduced a special assertion to make sure mamba2 mixer isn't run with quantization and TP>1.

Yet, due to the complexity/impact tradeoff, PR #13660 didn't support all values of n_groups % tp_size, but rather only 2 cases - n_groups % tp_size == 0 and n_groups == 1. The custom weight loading logic is needed only for the latter case. In the former, no weight duplication is needed and the MergedColumnParallelLinear class can be used.

This PR splits the weight creation of mamba2 mixer to two code paths

  1. The current path in main is now used only if n_groups % tp_size != 0 (i.e. n_groups==1)
  2. Use of MergedColumnParallelLinear if n_groups % tp_size == 0
    This enables quantization of the mamba2 mixer block in case n_groups % tp_size == 0, which is the more common case.

Test Plan

  1. Make sure TP>1 results of unquantized models stay the same. This is done by running the following on main and on this branch:
lm_eval --model vllm --model_args pretrained=mistralai/Mamba-Codestral-7B-v0.1,gpu_memory_utilization=0.8,max_model_len=4096,tensor_parallel_size=2 --batch_size auto --trust_remote_code --cache_requests true --tasks gsm8k
  1. Make sure a quantized mamba2 model can be loaded with TP>1, and get equivalent results as in TP=1.

Test Result

  1. Got the following GSM8K scores:
    main:
Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4685|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.4549|±  |0.0137|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4685|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.4549|±  |0.0137|

Results are identical.

  1. Validated with an internal quantized mamba2 model

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

`

@tomeras91 tomeras91 requested a review from tdoublep as a code owner September 10, 2025 14:48
@tomeras91
Copy link
Contributor Author

CC @fabianlim

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant enhancement by enabling quantization for the Mamba2 mixer with tensor parallelism when n_groups is divisible by tp_size. This is achieved by adding a new code path using MergedColumnParallelLinear. The changes are well-structured. However, I've identified a critical logic error in the condition that selects between the new and old code paths, which would lead to incorrect behavior for the n_groups=1 case with TP>1. A fix is suggested to ensure the correct path is chosen based on the original model configuration.

tomeras91 and others added 2 commits September 10, 2025 18:09
…always divisible by tp_size

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: tomeras91 <[email protected]>
self.intermediate_size // self.tp_size,
groups_time_state_size // self.tp_size,
groups_time_state_size // self.tp_size,
self.groups_ssm_state_size // self.tp_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oic was this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no.. this is just some renaming since I now create self.groups_ssm_state_size = self.n_groups * self.ssm_state_size during init

Copy link
Contributor

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify
Copy link

mergify bot commented Sep 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomeras91.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 12, 2025
@mergify mergify bot removed the needs-rebase label Sep 14, 2025
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks

@tdoublep tdoublep enabled auto-merge (squash) September 16, 2025 08:01
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2025
@tdoublep tdoublep merged commit 27fcfe7 into vllm-project:main Sep 16, 2025
42 checks passed
@tomeras91 tomeras91 deleted the fix-mamba2-quant-tp branch September 16, 2025 11:27
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…oups % tp_size == 0` (vllm-project#24593)

Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: tomeras91 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…oups % tp_size == 0` (vllm-project#24593)

Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: tomeras91 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: xuebwang-amd <[email protected]>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
…oups % tp_size == 0` (vllm-project#24593)

Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: tomeras91 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…oups % tp_size == 0` (vllm-project#24593)

Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: tomeras91 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: xuebwang-amd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants