fix: handle LFM2/Mamba hybrid layers in _get_vllm_state_dict for fast_inference#504
fix: handle LFM2/Mamba hybrid layers in _get_vllm_state_dict for fast_inference#504devchilll wants to merge 8 commits intounslothai:mainfrom
Conversation
Summary of ChangesHello @devchilll, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug that prevented LFM2/Mamba hybrid models from being used with Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the UnboundLocalError for LFM2/Mamba hybrid models in fast inference mode by adding support for short_conv and feed_forward layers in _get_vllm_state_dict. The changes also correctly handle different naming conventions for output projections (o_proj vs out_proj) and final model normalization (norm vs embedding_norm). My review includes a few suggestions to improve code clarity and reduce duplication, such as simplifying bias retrieval logic and refactoring norm handling. I also identified a potential issue with duplicate layer definitions that could lead to redundant processing.
fe9a2a6 to
1f2025a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fe9a2a69da
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
36fdc5e to
3f316b6
Compare
There was a problem hiding this comment.
Hey @devchilll thanks for your contributions. A few comments
Also we'd like you to run this function on both the new model (LFM in this case) and older models like llama, qwen (and VL), gemma (text and vision) to ensure that we are not breaking anything?
…_inference ## Summary Fixes `UnboundLocalError` in `_get_vllm_state_dict` when using `fast_inference=True` with LFM2/Mamba hybrid models (e.g., `LiquidAI/LFM2.5-1.2B-Thinking`). ## Problem When loading LFM2 models with `fast_inference=True`, `_get_vllm_state_dict` crashes because it only handles `self_attn` and `cross_attn` layers, failing on LFM2's `short_conv` layers. LFM2 also uses different naming conventions (`out_proj` vs `o_proj`, `embedding_norm` vs `norm`). ## Fix - Added `_extract_short_conv_layer` helper to extract conv `in_proj`, `out_proj`, and `conv` weights - Updated `_get_vllm_state_dict` to handle `short_conv` via helper, clean up loop logic - Moved `o_proj`/`out_proj` extraction inside attention branches - Added conditional MLP extraction (`mlp` vs `feed_forward`) - Handle `embedding_norm` for final model norm - Added LFM2 templates to `get_model_layer_config` and fixed conv weight assignment in `set_additional_modules` to preserve `nn.Conv1d` ## Changes - `vllm_utils.py`: Added internal helper, updated state dict extraction logic - `empty_model.py`: Added LFM2 layer templates, specific norm handling Related: unslothai/unsloth#4073 cc @danielhanchen
3f316b6 to
210dd50
Compare
|
Hey @devchilll , we'd ideally appreciate if you can find a cleaner or simpler way to handle this instead of doing |
|
Hey @devchilll, I ran an end-to-end test of this PR on an RTX 5080. Here's what I found: Test Environment
What worksThe core fix is solid — the original
The unit-level code structure is also correct — Issues found1. LoRA registration fails (blocker with default settings)Unsloth defaults to This might need handling in 2. Conv
|
|
hey @gaztrabisme and @Datta0 - thanks both sm! Will have capacity this week to take a closer look |
…_inference ## Summary Fixes `UnboundLocalError` in `_get_vllm_state_dict` when using `fast_inference=True` with LFM2/Mamba hybrid models (e.g., `LiquidAI/LFM2.5-1.2B-Thinking`). ## Problem When loading LFM2 models with `fast_inference=True`, `_get_vllm_state_dict` crashes because it only handles `self_attn` and `cross_attn` layers, failing on LFM2's `short_conv` layers. LFM2 also uses different naming conventions (`out_proj` vs `o_proj`, `embedding_norm` vs `norm`). ## Fix - Added `_extract_short_conv_layer` helper to extract conv `in_proj`, `out_proj`, and `conv` weights - Updated `_get_vllm_state_dict` to handle `short_conv` via helper, clean up loop logic - Moved `o_proj`/`out_proj` extraction inside attention branches - Added conditional MLP extraction (`mlp` vs `feed_forward`) - Handle `embedding_norm` for final model norm - Added LFM2 templates to `get_model_layer_config` and fixed conv weight assignment in `set_additional_modules` to preserve `nn.Conv1d` ## Changes - `vllm_utils.py`: Added internal helper, updated state dict extraction logic - `empty_model.py`: Added LFM2 layer templates, specific norm handling Related: unslothai/unsloth#4073 cc @danielhanchen
|
Hi, invited both @gaztrabisme and @Datta0 as collaborators! I am fixing the bugs @gaztrabisme found, expect a commit soon. Please feel free to pull the branch, run tests and fix things too. Also lmk if there's better way to collaborate! And then I am using github codespace as remote instance as it supports the right torch version to install local unsloth-zoo. |
- Fix in_proj weight shape mismatch: extract merged weight directly from base layer instead of using get_state_dict which incorrectly handles MergedColumnParallelLinear, producing (1,3) instead of (3*hidden, hidden) - Disable LoRA for LFM2 models: conv modules are not BaseLayerWithLoRA instances, causing AssertionError during LoRA registration - lm_head.weight warning is benign: existing tie_word_embeddings path already handles it correctly (no code change needed) - Update test_lfm_fix.py with LoRA exclusion unit test and improved docs
210dd50 to
4c6aa96
Compare
813f88d to
6766c4e
Compare
- Move all LFM2-specific logic into _extract_lfm2_layer() function - Use outer if/else dispatch by model_type instead of per-layer checks - Standard model loop is now completely free of LFM2 conditionals - Also fixes o_proj/mlp being outside self_attn/cross_attn guards in main
6766c4e to
ef77e45
Compare
There was a problem hiding this comment.
Refactor
Moved all LFM2-specific logic out of the main _get_vllm_state_dict loop into a dedicated _extract_lfm2_layer function. The main loop now uses an outer if model_type == "lfm2" / else dispatch so:
- LFM2 gets its own clean loop calling
_extract_lfm2_layer - The standard model loop body is unchanged from main (just re-indented)
- Zero
elif hasattr(...)branches for LFM2 polluting the standard path
Easy to add other hybrid architectures (e.g. Mamba, RWKV) the same way later
Also extracted _extract_short_conv_layer as a helper for the conv layer weight extraction.
Bug Fixes
- Conv
in_projweight shape mismatch - LoRA registration fails on conv modules
lm_head.weightnot initialized warning - No code change needed.
1. Disable LoRA in load_vllm() for LFM2 — the exclusion was only in _test_get_vllm_state_dict but not in the actual load_vllm() called by FastLanguageModel.from_pretrained. Conv modules are not BaseLayerWithLoRA, causing AssertionError during LoRA registration. 2. Fix attribute name: "short_conv" → "conv" in _extract_lfm2_layer — vLLM's Lfm2ShortConvDecoderLayer stores the ShortConv module as self.conv, not self.short_conv. The hasattr check was silently skipping all conv layers, leaving weights unextracted. 3. Unwrap ModelWeightParameter in set_additional_modules — vLLM 0.11.x wraps weights in ModelWeightParameter (tensor subclass) whose detach() returns plain Tensor, breaking nn.Parameter(). Use getattr(w, 'data', w) to unwrap, matching the existing pattern in convert_vllm_to_huggingface (line 1498). Tested end-to-end: LFM2.5-1.2B-Thinking loads with fast_inference=True and generates text correctly on RTX 5080. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey @devchilll — pulled your latest refactor, ran it through GPU integration testing on RTX 5080 (vLLM 0.11.2, torch 2.9.0+cu128). The refactor structure is clean 👍 Found and fixed 3 runtime bugs in 1. LoRA not disabled in 2. Wrong attribute name: 3. All three only surface with an actual GPU running the real CC @Datta0 |
|
Thanks for finding and fixing the bugs @gaztrabisme! - could you summit your commit for review? FYI @Datta0 @danielhanchen the pr is ready for review. Thanks! |
|
@devchilll Already pushed — commit f93888a is on the branch. Ready for review! 🙌 |
unsloth_zoo/empty_model.py
Outdated
|
|
||
| # LFM2 conv.conv weights — must be assigned directly to preserve nn.Conv1d module type | ||
| # (standard_layers uses nn.Linear for all entries, which would break Conv1d) | ||
| text_config_tmp = getattr(config, "text_config", config) |
There was a problem hiding this comment.
I'd recommend having a function like _extract_lfm2_layer to deal with these differences
There was a problem hiding this comment.
On it, will push a commit shortly.
unsloth_zoo/vllm_utils.py
Outdated
| if use_fused_qkv: | ||
| # For some model types like phi3 vllm will expect fused qkv (e.g. Phi3, Phi3.5-mini-instruct, Phi4-mini-instruct) | ||
| if model_type == "lfm2": | ||
| for kk in range(len(vllm_text_model.layers)): |
There was a problem hiding this comment.
I'd say
for kk ...:
if lfm: extract_lfm; return
rest of the code with no `else`
is perhaps better?
There was a problem hiding this comment.
Makes sense, will switch to early-return style.
There was a problem hiding this comment.
@Datta0 No. You can't do early return right there because there's common code after the for kk loop. the latest commit of the for kk: if lfm, extract lfm layer, continue; does make sense in turns of extracting layers correctly.
|
|
||
| # LFM2 conv modules are not BaseLayerWithLoRA — disable LoRA to avoid | ||
| # AssertionError during LoRA registration in vLLM's create_lora_manager. | ||
| if getattr(config, "model_type", "") == "lfm2": |
There was a problem hiding this comment.
Wait on, if we disable LoRA whats the point of this anymore?
There was a problem hiding this comment.
Sorry, I let Claude Code run amok on that fix. Looked into it properly — vLLM's Lfm2ForCausalLM doesn't define supported_lora_modules at all, so LoRA isn't supported for LFM2 in vLLM yet. enable_lora = False just avoids the crash. We could open a PR on vLLM to add LoRA support for LFM2 — @devchilll want to work on that together?
There was a problem hiding this comment.
Update: turns out vLLM already merged LoRA support for LFM2 (vllm-project/vllm#34921, merged Feb 20). It's not in any release yet (our 0.11.2 doesn't have it), but once Unsloth upgrades vLLM we can drop the enable_lora = False guard. So no action needed from us — just a version bump down the line.
1. vllm_utils.py: switch to early-return style — single loop with `if lfm2: extract; continue` instead of outer if/else with two loops 2. empty_model.py: extract LFM2-specific logic (embedding_norm, conv.conv weights) into _set_lfm2_modules() helper function Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Fixes
UnboundLocalErrorin_get_vllm_state_dictwhen usingfast_inference=Truewith LFM2/Mamba hybrid models, e.g.,LiquidAI/LFM2.5-1.2B-ThinkingProblem
When loading LFM2 models with
fast_inference=True,_get_vllm_state_dictcrashes with:This happens because the layer iteration loop only handles
self_attnandcross_attnlayer types. LFM2 is a hybrid architecture with short convolution layers that have neither attribute, soprefixis never assigned. Additionally, LFM2 uses different naming conventions:out_projinstead ofo_proj,feed_forward.w1/w2/w3instead ofmlp.gate_proj/up_proj/down_proj,embedding_norminstead ofnormFix
short_convbranch to extract convin_proj,out_proj, andconvweights from non-attention layerso_proj,out_projextraction inside each attention branch so conv layers don't crash..mlppath (standard) vsfeed_forwardpath w LFM2 w1/w2/w3embedding_norm(LFM2) alongsidenorm(standard) for final model normget_model_layer_configfor HF model reconstructionChanges
vllm_utils.py:_get_vllm_state_dictempty_model.py:get_model_layer_config&&set_additional_modulesRelated: unslothai/unsloth#4073
CC @danielhanchen @Datta0