-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Mamba] Support TP>1 with quantization for mamba2 mixer in case n_groups % tp_size == 0
#24593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…groups % tp_size == 0 Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
|
CC @fabianlim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant enhancement by enabling quantization for the Mamba2 mixer with tensor parallelism when n_groups is divisible by tp_size. This is achieved by adding a new code path using MergedColumnParallelLinear. The changes are well-structured. However, I've identified a critical logic error in the condition that selects between the new and old code paths, which would lead to incorrect behavior for the n_groups=1 case with TP>1. A fix is suggested to ensure the correct path is chosen based on the original model configuration.
…always divisible by tp_size Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: tomeras91 <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
| self.intermediate_size // self.tp_size, | ||
| groups_time_state_size // self.tp_size, | ||
| groups_time_state_size // self.tp_size, | ||
| self.groups_ssm_state_size // self.tp_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oic was this a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no.. this is just some renaming since I now create self.groups_ssm_state_size = self.n_groups * self.ssm_state_size during init
fabianlim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tomer Asida <[email protected]>
tdoublep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks
…oups % tp_size == 0` (vllm-project#24593) Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: tomeras91 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…oups % tp_size == 0` (vllm-project#24593) Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: tomeras91 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <[email protected]>
…oups % tp_size == 0` (vllm-project#24593) Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: tomeras91 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…oups % tp_size == 0` (vllm-project#24593) Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: tomeras91 <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
PR #13660 built on #10909 and fixed TP loading issues of mamba2 mixer weights. Concretely, it dealt with duplication of weights needed in cases where
n_groups % tp_size != 0. This logic is a bit complicated and required creating custom weight loaders, which means using quantized layers is not possible since their weight loading logic can't be overridden. Following that, PR #14617 introduced a special assertion to make sure mamba2 mixer isn't run with quantization and TP>1.Yet, due to the complexity/impact tradeoff, PR #13660 didn't support all values of
n_groups % tp_size, but rather only 2 cases -n_groups % tp_size == 0andn_groups == 1. The custom weight loading logic is needed only for the latter case. In the former, no weight duplication is needed and theMergedColumnParallelLinearclass can be used.This PR splits the weight creation of mamba2 mixer to two code paths
n_groups % tp_size != 0(i.e.n_groups==1)MergedColumnParallelLinearifn_groups % tp_size == 0This enables quantization of the mamba2 mixer block in case
n_groups % tp_size == 0, which is the more common case.Test Plan
Test Result
main:
PR:
Results are identical.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.`