fix: handle LFM2/Mamba hybrid layers in _get_vllm_state_dict for fast_inference by devchilll · Pull Request #504 · unslothai/unsloth-zoo

devchilll · 2026-02-18T02:07:51Z

Summary

Fixes UnboundLocalError in _get_vllm_state_dict when using fast_inference=True with LFM2/Mamba hybrid models, e.g., LiquidAI/LFM2.5-1.2B-Thinking

Problem

When loading LFM2 models with fast_inference=True, _get_vllm_state_dict crashes with:

UnboundLocalError: cannot access local variable 'prefix'

This happens because the layer iteration loop only handles self_attn and cross_attn layer types. LFM2 is a hybrid architecture with short convolution layers that have neither attribute, so prefix is never assigned. Additionally, LFM2 uses different naming conventions: out_proj instead of o_proj, feed_forward.w1/w2/w3 instead of mlp.gate_proj/up_proj/down_proj, embedding_norm instead of norm

Fix

Added short_conv branch to extract conv in_proj, out_proj, and conv weights from non-attention layers
Moved o_proj, out_proj extraction inside each attention branch so conv layers don't crash..
Added conditional MLP extraction: mlp path (standard) vs feed_forward path w LFM2 w1/w2/w3
Handle embedding_norm (LFM2) alongside norm (standard) for final model norm
Added LFM2 layer templates to get_model_layer_config for HF model reconstruction

Changes

vllm_utils.py: _get_vllm_state_dict
empty_model.py: get_model_layer_config && set_additional_modules

Related: unslothai/unsloth#4073

CC @danielhanchen @Datta0

gemini-code-assist · 2026-02-18T02:08:09Z

Summary of Changes

Hello @devchilll, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug that prevented LFM2/Mamba hybrid models from being used with fast_inference due to an UnboundLocalError. It significantly expands the model state dictionary extraction and reconstruction logic to fully support the unique architectural components of these hybrid models, ensuring compatibility and correct weight loading for efficient inference. The changes enable broader support for diverse model architectures within the framework.

Highlights

Bug Fix: UnboundLocalError: Resolved an UnboundLocalError in _get_vllm_state_dict that occurred when processing LFM2/Mamba hybrid models with fast_inference=True, due to unhandled 'short_conv' layers.
LFM2/Mamba Hybrid Layer Support: Extended _get_vllm_state_dict to correctly extract weights from 'short_conv' layers, including 'in_proj', 'out_proj', and 'conv' components, crucial for LFM2/Mamba hybrid models.
Flexible Attention Output Projection: Updated attention layer weight extraction to support 'out_proj' in addition to 'o_proj', accommodating models like LFM2 that use 'out_proj'.
LFM2 Feed-Forward MLP Handling: Implemented support for LFM2's 'feed_forward' MLP structure, correctly extracting 'w1', 'w2', and 'w3' weights.
Enhanced Model Norm Handling: Improved model norm handling to recognize and process 'embedding_norm' for LFM2 models during both state dict reconstruction and Hugging Face model reconstruction.
Updated Model Layer Configuration: Integrated LFM2-specific layer and norm templates into 'get_model_layer_config' for proper architectural recognition and weight mapping.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth_zoo/empty_model.py
- Updated set_embedding to conditionally handle embedding_norm for LFM2 models in addition to the standard norm.
- Expanded get_model_layer_config to include LFM2-specific layer names for attention (out_proj, q_layernorm, k_layernorm), convolution (in_proj, out_proj, conv), and feed-forward (w1, w3, w2) modules.
- Added LFM2-specific norm names (operator_norm, ffn_norm, q_layernorm, k_layernorm) to get_model_layer_config.
unsloth_zoo/vllm_utils.py
- Modified _get_vllm_state_dict to check for o_proj or out_proj when extracting attention output projections, accommodating LFM2's out_proj.
- Introduced a new branch in _get_vllm_state_dict to handle short_conv layers, extracting in_proj, out_proj, and conv weights, including bias handling for conv.
- Refactored MLP extraction to differentiate between standard mlp and LFM2's feed_forward structure, correctly extracting w1, w3, and w2 weights for the latter.
- Updated the final model norm extraction to check for embedding_norm if norm is not present, specifically for LFM2 models.

Activity

No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the UnboundLocalError for LFM2/Mamba hybrid models in fast inference mode by adding support for short_conv and feed_forward layers in _get_vllm_state_dict. The changes also correctly handle different naming conventions for output projections (o_proj vs out_proj) and final model normalization (norm vs embedding_norm). My review includes a few suggestions to improve code clarity and reduce duplication, such as simplifying bias retrieval logic and refactoring norm handling. I also identified a potential issue with duplicate layer definitions that could lead to redundant processing.

unsloth_zoo/empty_model.py

unsloth_zoo/vllm_utils.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe9a2a69da

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

unsloth_zoo/empty_model.py

Datta0

Hey @devchilll thanks for your contributions. A few comments

Also we'd like you to run this function on both the new model (LFM in this case) and older models like llama, qwen (and VL), gemma (text and vision) to ensure that we are not breaking anything?

unsloth_zoo/vllm_utils.py

@danielhanchen

…_inference ## Summary Fixes `UnboundLocalError` in `_get_vllm_state_dict` when using `fast_inference=True` with LFM2/Mamba hybrid models (e.g., `LiquidAI/LFM2.5-1.2B-Thinking`). ## Problem When loading LFM2 models with `fast_inference=True`, `_get_vllm_state_dict` crashes because it only handles `self_attn` and `cross_attn` layers, failing on LFM2's `short_conv` layers. LFM2 also uses different naming conventions (`out_proj` vs `o_proj`, `embedding_norm` vs `norm`). ## Fix - Added `_extract_short_conv_layer` helper to extract conv `in_proj`, `out_proj`, and `conv` weights - Updated `_get_vllm_state_dict` to handle `short_conv` via helper, clean up loop logic - Moved `o_proj`/`out_proj` extraction inside attention branches - Added conditional MLP extraction (`mlp` vs `feed_forward`) - Handle `embedding_norm` for final model norm - Added LFM2 templates to `get_model_layer_config` and fixed conv weight assignment in `set_additional_modules` to preserve `nn.Conv1d` ## Changes - `vllm_utils.py`: Added internal helper, updated state dict extraction logic - `empty_model.py`: Added LFM2 layer templates, specific norm handling Related: unslothai/unsloth#4073 cc @danielhanchen

Datta0 · 2026-02-23T12:31:11Z

Hey @devchilll , we'd ideally appreciate if you can find a cleaner or simpler way to handle this instead of doing if else everywhere that too only for one model

gaztrabisme · 2026-02-23T15:05:13Z

Hey @devchilll, I ran an end-to-end test of this PR on an RTX 5080. Here's what I found:

Test Environment

GPU: NVIDIA GeForce RTX 5080 (16GB, sm_120a)
Docker image: unsloth/unsloth:2026.2.1-pt2.9.0-cu12.8-moe-optimized-training
PyTorch: 2.9.0+cu128
vLLM: 0.11.2
Transformers: 4.57.1
Model: LiquidAI/LFM2.5-1.2B-Thinking

What works

The core fix is solid — the original UnboundLocalError on prefix is gone. vLLM successfully:

Resolves the architecture as Lfm2ForCausalLM
Loads weights into the model
Captures CUDA graphs (both piecewise and full)
Allocates KV cache

The unit-level code structure is also correct — o_proj extraction is properly inside the attention branch, short_conv/feed_forward/embedding_norm conditionals are all in place.

Issues found

1. LoRA registration fails (blocker with default settings)

Unsloth defaults to enable_lora=True for fast_inference. vLLM's LoRA manager tries to register the conv module but it's not a BaseLayerWithLoRA instance:

AssertionError: Module model.layers.0.conv must be a BaseLayerWithLoRA instance,

This might need handling in vllm_lora_worker_manager.py to skip conv modules, or LFM2 models might need to load with enable_lora=False until vLLM adds LoRA support for conv layers.

2. Conv `in_proj` weight shape mismatch (blocker for inference)

When testing with LoRA disabled, the model loads but inference fails at the first conv layer:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (6x2048 and 1x3)

This happens at Lfm2ShortConv.in_proj(x) — the reconstructed weight is shape (1, 3) instead of the expected (6144, 2048). The get_state_dict(f"{conv_prefix}.in_proj", 0, state_dict, short_conv.in_proj, slice_weights=False) call seems to not be extracting the actual weight tensor correctly. This might be a difference in how vLLM 0.11.2 implements ShortConv vs what you were coding against.

3. `lm_head.weight` not initialized (warning)

During HF model reconstruction:

Some weights of Lfm2ForCausalLM were not initialized from the model checkpoint: ['lm_head.weight']

LFM2 likely uses tie_word_embeddings=True, so this might just need the tied embedding path to kick in properly.

The remaining issues are around weight tensor extraction details and LoRA compatibility. Happy to re-test whenever you push updates!

devchilll · 2026-02-23T18:54:44Z

hey @gaztrabisme and @Datta0 - thanks both sm! Will have capacity this week to take a closer look

@danielhanchen

…_inference ## Summary Fixes `UnboundLocalError` in `_get_vllm_state_dict` when using `fast_inference=True` with LFM2/Mamba hybrid models (e.g., `LiquidAI/LFM2.5-1.2B-Thinking`). ## Problem When loading LFM2 models with `fast_inference=True`, `_get_vllm_state_dict` crashes because it only handles `self_attn` and `cross_attn` layers, failing on LFM2's `short_conv` layers. LFM2 also uses different naming conventions (`out_proj` vs `o_proj`, `embedding_norm` vs `norm`). ## Fix - Added `_extract_short_conv_layer` helper to extract conv `in_proj`, `out_proj`, and `conv` weights - Updated `_get_vllm_state_dict` to handle `short_conv` via helper, clean up loop logic - Moved `o_proj`/`out_proj` extraction inside attention branches - Added conditional MLP extraction (`mlp` vs `feed_forward`) - Handle `embedding_norm` for final model norm - Added LFM2 templates to `get_model_layer_config` and fixed conv weight assignment in `set_additional_modules` to preserve `nn.Conv1d` ## Changes - `vllm_utils.py`: Added internal helper, updated state dict extraction logic - `empty_model.py`: Added LFM2 layer templates, specific norm handling Related: unslothai/unsloth#4073 cc @danielhanchen

devchilll · 2026-02-24T03:10:56Z

Hi, invited both @gaztrabisme and @Datta0 as collaborators! I am fixing the bugs @gaztrabisme found, expect a commit soon. Please feel free to pull the branch, run tests and fix things too. Also lmk if there's better way to collaborate!

And then I am using github codespace as remote instance as it supports the right torch version to install local unsloth-zoo.

- Fix in_proj weight shape mismatch: extract merged weight directly from base layer instead of using get_state_dict which incorrectly handles MergedColumnParallelLinear, producing (1,3) instead of (3*hidden, hidden) - Disable LoRA for LFM2 models: conv modules are not BaseLayerWithLoRA instances, causing AssertionError during LoRA registration - lm_head.weight warning is benign: existing tie_word_embeddings path already handles it correctly (no code change needed) - Update test_lfm_fix.py with LoRA exclusion unit test and improved docs

- Move all LFM2-specific logic into _extract_lfm2_layer() function - Use outer if/else dispatch by model_type instead of per-layer checks - Standard model loop is now completely free of LFM2 conditionals - Also fixes o_proj/mlp being outside self_attn/cross_attn guards in main

devchilll

Refactor

Moved all LFM2-specific logic out of the main _get_vllm_state_dict loop into a dedicated _extract_lfm2_layer function. The main loop now uses an outer if model_type == "lfm2" / else dispatch so:

LFM2 gets its own clean loop calling _extract_lfm2_layer
The standard model loop body is unchanged from main (just re-indented)
Zero elif hasattr(...) branches for LFM2 polluting the standard path

Easy to add other hybrid architectures (e.g. Mamba, RWKV) the same way later

Also extracted _extract_short_conv_layer as a helper for the conv layer weight extraction.

Bug Fixes

Conv in_proj weight shape mismatch
LoRA registration fails on conv modules
lm_head.weight not initialized warning - No code change needed.

CC @Datta0 @gaztrabisme

1. Disable LoRA in load_vllm() for LFM2 — the exclusion was only in _test_get_vllm_state_dict but not in the actual load_vllm() called by FastLanguageModel.from_pretrained. Conv modules are not BaseLayerWithLoRA, causing AssertionError during LoRA registration. 2. Fix attribute name: "short_conv" → "conv" in _extract_lfm2_layer — vLLM's Lfm2ShortConvDecoderLayer stores the ShortConv module as self.conv, not self.short_conv. The hasattr check was silently skipping all conv layers, leaving weights unextracted. 3. Unwrap ModelWeightParameter in set_additional_modules — vLLM 0.11.x wraps weights in ModelWeightParameter (tensor subclass) whose detach() returns plain Tensor, breaking nn.Parameter(). Use getattr(w, 'data', w) to unwrap, matching the existing pattern in convert_vllm_to_huggingface (line 1498). Tested end-to-end: LFM2.5-1.2B-Thinking loads with fast_inference=True and generates text correctly on RTX 5080. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gaztrabisme · 2026-02-24T18:57:23Z

Hey @devchilll — pulled your latest refactor, ran it through GPU integration testing on RTX 5080 (vLLM 0.11.2, torch 2.9.0+cu128). The refactor structure is clean 👍 Found and fixed 3 runtime bugs in f93888a:

1. LoRA not disabled in load_vllm()
The enable_lora = model_type not in ("mllama", "lfm2") exclusion was only in _test_get_vllm_state_dict, not in the actual load_vllm() that FastLanguageModel.from_pretrained calls. Added the check there too.

2. Wrong attribute name: short_conv → conv
_extract_lfm2_layer checks hasattr(layer, "short_conv") but vLLM's Lfm2ShortConvDecoderLayer stores the module as self.conv, not self.short_conv. This silently skipped all conv layer extraction, leaving weights uninitialized. Fixed to hasattr(layer, "conv") / layer.conv.

3. ModelWeightParameter unwrap in set_additional_modules
vLLM 0.11.x wraps weights in ModelWeightParameter (tensor subclass) whose detach() returns plain Tensor, breaking nn.Parameter(). Added getattr(w, 'data', w) to unwrap — same pattern already used in convert_vllm_to_huggingface at line 1498.

All three only surface with an actual GPU running the real from_pretrained path (not the test harness). After fixes, LFM2.5-1.2B-Thinking loads with fast_inference=True and generates correctly:

Generated: Hello, world! Ok, okay, world. It seems like you're just starting out. I'm here to assist...

CC @Datta0

devchilll · 2026-02-24T19:54:35Z

Thanks for finding and fixing the bugs @gaztrabisme! - could you summit your commit for review?

FYI @Datta0 @danielhanchen the pr is ready for review. Thanks!

gaztrabisme · 2026-02-24T20:04:23Z

@devchilll Already pushed — commit f93888a is on the branch. Ready for review! 🙌

Datta0 · 2026-02-25T05:05:27Z

unsloth_zoo/empty_model.py

+
+    # LFM2 conv.conv weights — must be assigned directly to preserve nn.Conv1d module type
+    # (standard_layers uses nn.Linear for all entries, which would break Conv1d)
+    text_config_tmp = getattr(config, "text_config", config)


I'd recommend having a function like _extract_lfm2_layer to deal with these differences

On it, will push a commit shortly.

Datta0 · 2026-02-25T05:10:25Z

unsloth_zoo/vllm_utils.py

-            if use_fused_qkv:
-                # For some model types like phi3 vllm will expect fused qkv (e.g. Phi3, Phi3.5-mini-instruct, Phi4-mini-instruct)
+    if model_type == "lfm2":
+        for kk in range(len(vllm_text_model.layers)):


I'd say

for kk ...: if lfm: extract_lfm; return rest of the code with no `else`

is perhaps better?

Makes sense, will switch to early-return style.

@Datta0 No. You can't do early return right there because there's common code after the for kk loop. the latest commit of the for kk: if lfm, extract lfm layer, continue; does make sense in turns of extracting layers correctly.

Datta0 · 2026-02-25T05:15:40Z

unsloth_zoo/vllm_utils.py


+    # LFM2 conv modules are not BaseLayerWithLoRA — disable LoRA to avoid
+    # AssertionError during LoRA registration in vLLM's create_lora_manager.
+    if getattr(config, "model_type", "") == "lfm2":


Wait on, if we disable LoRA whats the point of this anymore?

Sorry, I let Claude Code run amok on that fix. Looked into it properly — vLLM's Lfm2ForCausalLM doesn't define supported_lora_modules at all, so LoRA isn't supported for LFM2 in vLLM yet. enable_lora = False just avoids the crash. We could open a PR on vLLM to add LoRA support for LFM2 — @devchilll want to work on that together?

Update: turns out vLLM already merged LoRA support for LFM2 (vllm-project/vllm#34921, merged Feb 20). It's not in any release yet (our 0.11.2 doesn't have it), but once Unsloth upgrades vLLM we can drop the enable_lora = False guard. So no action needed from us — just a version bump down the line.

1. vllm_utils.py: switch to early-return style — single loop with `if lfm2: extract; continue` instead of outer if/else with two loops 2. empty_model.py: extract LFM2-specific logic (embedding_norm, conv.conv weights) into _set_lfm2_modules() helper function Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

unsloth_zoo/empty_model.py Outdated Show resolved Hide resolved

unsloth_zoo/vllm_utils.py Outdated Show resolved Hide resolved

unsloth_zoo/vllm_utils.py Outdated Show resolved Hide resolved

devchilll force-pushed the fix/lfm-mamba-fast-inference branch from fe9a2a6 to 1f2025a Compare February 18, 2026 02:11

chatgpt-codex-connector bot reviewed Feb 18, 2026

View reviewed changes

unsloth_zoo/empty_model.py Outdated Show resolved Hide resolved

devchilll force-pushed the fix/lfm-mamba-fast-inference branch 2 times, most recently from 36fdc5e to 3f316b6 Compare February 18, 2026 02:25

Datta0 reviewed Feb 18, 2026

View reviewed changes

unsloth_zoo/vllm_utils.py Outdated Show resolved Hide resolved

unsloth_zoo/vllm_utils.py Outdated Show resolved Hide resolved

devchilll force-pushed the fix/lfm-mamba-fast-inference branch from 3f316b6 to 210dd50 Compare February 19, 2026 06:40

devchilll force-pushed the fix/lfm-mamba-fast-inference branch from 210dd50 to 4c6aa96 Compare February 24, 2026 03:22

devchilll added 2 commits February 24, 2026 03:25

Merge branch pull - resolve vllm_utils.py conflict

3f1fa37

remove temp test file

9d09a2c

devchilll force-pushed the fix/lfm-mamba-fast-inference branch from 813f88d to 6766c4e Compare February 24, 2026 04:20

devchilll force-pushed the fix/lfm-mamba-fast-inference branch from 6766c4e to ef77e45 Compare February 24, 2026 04:25

devchilll commented Feb 24, 2026

View reviewed changes

Datta0 reviewed Feb 25, 2026

View reviewed changes

Conversation

devchilll commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Changes

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Datta0 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Datta0 commented Feb 23, 2026

Uh oh!

gaztrabisme commented Feb 23, 2026

Test Environment

What works

Issues found

1. LoRA registration fails (blocker with default settings)

2. Conv in_proj weight shape mismatch (blocker for inference)

3. lm_head.weight not initialized (warning)

Uh oh!

devchilll commented Feb 23, 2026

Uh oh!

devchilll commented Feb 24, 2026

Uh oh!

devchilll left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Refactor

Bug Fixes

Uh oh!

gaztrabisme commented Feb 24, 2026

Uh oh!

devchilll commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaztrabisme commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

devchilll commented Feb 18, 2026 •

edited

Loading

Datta0 left a comment •

edited

Loading

2. Conv `in_proj` weight shape mismatch (blocker for inference)

3. `lm_head.weight` not initialized (warning)

devchilll left a comment •

edited

Loading

devchilll commented Feb 24, 2026 •

edited

Loading

gaztrabisme commented Feb 24, 2026 •

edited

Loading