-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
[V1][Mamba] - Enable V1 by default for Mamba Models #23650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Mamba] - Enable V1 by default for Mamba Models #23650
Conversation
Signed-off-by: asafg <[email protected]>
@tdoublep @heheda12345 Once we get PR and this PR merged, we should probably enable Mamba models by default to V1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to enable the V1 engine by default for Mamba models by removing the check that was previously falling back to the V0 engine. While the change is correct in principle, it introduces a critical issue where prefix caching is enabled by default for these models, which they do not support, leading to a crash. A fix is required to adjust the default V1 arguments for Mamba-like models.
Signed-off-by: asafg <[email protected]>
…m into default_mamba_v1_support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will enable V1 by default for all models that use mamba1, mamba2, minimax linear attention and short conv layers.
For mamba2, we definitely don't want to do this until we first enable cudagraph_mode=FULL_AND_PIECEWISE
as the default, because otherwise the performance drop between V0 and V1 is very big.
Let's also merge this tiny one first (otherwise user with get a crash using default |
@tdoublep what about merging the two PRs in parallel? There's no merge conflict between them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@heheda12345 We can. If this one goes first, there will be a short time inbetween when vllm crashes by default for these models. But it's OK - both PRs can be merged in next hours. |
@tdoublep I need to set |
Signed-off-by: asafg <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: asafg <[email protected]>
Signed-off-by: asafg <[email protected]>
I'll wait for this PR to be merged first as my tests kind of depend on them |
Head branch was pushed to by a user without write access
Signed-off-by: asafg <[email protected]>
Signed-off-by: asafg <[email protected]>
Signed-off-by: asafg <[email protected]>
@@ -417,4 +417,5 @@ def verify_and_update_config(cls, vllm_config: "VllmConfig") -> None: | |||
"GptOssForCausalLM": GptOssForCausalLMConfig, | |||
"MambaForCausalLM": MambaModelConfig, | |||
"Mamba2ForCausalLM": MambaModelConfig, | |||
"FalconMambaForCausalLM": MambaModelConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdoublep fine for this PR but I think this line makes vLLM not that plugable to new models.
@tdoublep Can you create a new PR to update the doc for recent status and a simple guideline for contributing new mamba models? |
Signed-off-by: asafg <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: asafg <[email protected]>
@heheda12345 @tdoublep I fixed some tests. Some tests are, I believe, only suited for V0 like
|
@Josephasafg Thanks - the tests are passing now. When we remove V0 code we can either decide to adapt those 2 tests to V1 or drop them if they don't make sense. |
@heheda12345 Sure, I will do that. Are there any examples of guidelines for contributing other models that I could use as a reference? |
Sounds good. |
Signed-off-by: asafg <[email protected]>
Signed-off-by: asafg <[email protected]>
Signed-off-by: asafg <[email protected]>
Signed-off-by: asafg <[email protected]>
Signed-off-by: asafg <[email protected]>
Purpose
This PR enables V1 by default to Mamba models so they won't fall back to V0. This PR needs to have this PR and this PR that enables full cuda graph support by default merged first, so users won't have to specify
VLLM_ATTENTION_BACKEND=FLASHINFER
when they start vLLM.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.