Skip to content

fix: handle LFM2/Mamba hybrid layers in _get_vllm_state_dict for fast_inference#504

Open
devchilll wants to merge 8 commits intounslothai:mainfrom
devchilll:fix/lfm-mamba-fast-inference
Open

fix: handle LFM2/Mamba hybrid layers in _get_vllm_state_dict for fast_inference#504
devchilll wants to merge 8 commits intounslothai:mainfrom
devchilll:fix/lfm-mamba-fast-inference

Conversation

@devchilll
Copy link

@devchilll devchilll commented Feb 18, 2026

Summary

Fixes UnboundLocalError in _get_vllm_state_dict when using fast_inference=True with LFM2/Mamba hybrid models, e.g., LiquidAI/LFM2.5-1.2B-Thinking

Problem

When loading LFM2 models with fast_inference=True, _get_vllm_state_dict crashes with:

UnboundLocalError: cannot access local variable 'prefix'

This happens because the layer iteration loop only handles self_attn and cross_attn layer types. LFM2 is a hybrid architecture with short convolution layers that have neither attribute, so prefix is never assigned. Additionally, LFM2 uses different naming conventions: out_proj instead of o_proj, feed_forward.w1/w2/w3 instead of mlp.gate_proj/up_proj/down_proj, embedding_norm instead of norm

Fix

  • Added short_conv branch to extract conv in_proj, out_proj, and conv weights from non-attention layers
  • Moved o_proj, out_proj extraction inside each attention branch so conv layers don't crash..
  • Added conditional MLP extraction: mlp path (standard) vs feed_forward path w LFM2 w1/w2/w3
  • Handle embedding_norm (LFM2) alongside norm (standard) for final model norm
  • Added LFM2 layer templates to get_model_layer_config for HF model reconstruction

Changes

  • vllm_utils.py: _get_vllm_state_dict
  • empty_model.py: get_model_layer_config && set_additional_modules

Related: unslothai/unsloth#4073

CC @danielhanchen @Datta0

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @devchilll, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug that prevented LFM2/Mamba hybrid models from being used with fast_inference due to an UnboundLocalError. It significantly expands the model state dictionary extraction and reconstruction logic to fully support the unique architectural components of these hybrid models, ensuring compatibility and correct weight loading for efficient inference. The changes enable broader support for diverse model architectures within the framework.

Highlights

  • Bug Fix: UnboundLocalError: Resolved an UnboundLocalError in _get_vllm_state_dict that occurred when processing LFM2/Mamba hybrid models with fast_inference=True, due to unhandled 'short_conv' layers.
  • LFM2/Mamba Hybrid Layer Support: Extended _get_vllm_state_dict to correctly extract weights from 'short_conv' layers, including 'in_proj', 'out_proj', and 'conv' components, crucial for LFM2/Mamba hybrid models.
  • Flexible Attention Output Projection: Updated attention layer weight extraction to support 'out_proj' in addition to 'o_proj', accommodating models like LFM2 that use 'out_proj'.
  • LFM2 Feed-Forward MLP Handling: Implemented support for LFM2's 'feed_forward' MLP structure, correctly extracting 'w1', 'w2', and 'w3' weights.
  • Enhanced Model Norm Handling: Improved model norm handling to recognize and process 'embedding_norm' for LFM2 models during both state dict reconstruction and Hugging Face model reconstruction.
  • Updated Model Layer Configuration: Integrated LFM2-specific layer and norm templates into 'get_model_layer_config' for proper architectural recognition and weight mapping.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth_zoo/empty_model.py
    • Updated set_embedding to conditionally handle embedding_norm for LFM2 models in addition to the standard norm.
    • Expanded get_model_layer_config to include LFM2-specific layer names for attention (out_proj, q_layernorm, k_layernorm), convolution (in_proj, out_proj, conv), and feed-forward (w1, w3, w2) modules.
    • Added LFM2-specific norm names (operator_norm, ffn_norm, q_layernorm, k_layernorm) to get_model_layer_config.
  • unsloth_zoo/vllm_utils.py
    • Modified _get_vllm_state_dict to check for o_proj or out_proj when extracting attention output projections, accommodating LFM2's out_proj.
    • Introduced a new branch in _get_vllm_state_dict to handle short_conv layers, extracting in_proj, out_proj, and conv weights, including bias handling for conv.
    • Refactored MLP extraction to differentiate between standard mlp and LFM2's feed_forward structure, correctly extracting w1, w3, and w2 weights for the latter.
    • Updated the final model norm extraction to check for embedding_norm if norm is not present, specifically for LFM2 models.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the UnboundLocalError for LFM2/Mamba hybrid models in fast inference mode by adding support for short_conv and feed_forward layers in _get_vllm_state_dict. The changes also correctly handle different naming conventions for output projections (o_proj vs out_proj) and final model normalization (norm vs embedding_norm). My review includes a few suggestions to improve code clarity and reduce duplication, such as simplifying bias retrieval logic and refactoring norm handling. I also identified a potential issue with duplicate layer definitions that could lead to redundant processing.

@devchilll devchilll force-pushed the fix/lfm-mamba-fast-inference branch from fe9a2a6 to 1f2025a Compare February 18, 2026 02:11
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe9a2a69da

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@devchilll devchilll force-pushed the fix/lfm-mamba-fast-inference branch 2 times, most recently from 36fdc5e to 3f316b6 Compare February 18, 2026 02:25
Copy link
Collaborator

@Datta0 Datta0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @devchilll thanks for your contributions. A few comments

Also we'd like you to run this function on both the new model (LFM in this case) and older models like llama, qwen (and VL), gemma (text and vision) to ensure that we are not breaking anything?

…_inference

## Summary
Fixes `UnboundLocalError` in `_get_vllm_state_dict` when using `fast_inference=True` with LFM2/Mamba hybrid models (e.g., `LiquidAI/LFM2.5-1.2B-Thinking`).

## Problem
When loading LFM2 models with `fast_inference=True`, `_get_vllm_state_dict` crashes because it only handles `self_attn` and `cross_attn` layers, failing on LFM2's `short_conv` layers. LFM2 also uses different naming conventions (`out_proj` vs `o_proj`, `embedding_norm` vs `norm`).

## Fix
- Added `_extract_short_conv_layer` helper to extract conv `in_proj`, `out_proj`, and `conv` weights
- Updated `_get_vllm_state_dict` to handle `short_conv` via helper, clean up loop logic
- Moved `o_proj`/`out_proj` extraction inside attention branches
- Added conditional MLP extraction (`mlp` vs `feed_forward`)
- Handle `embedding_norm` for final model norm
- Added LFM2 templates to `get_model_layer_config` and fixed conv weight assignment in `set_additional_modules` to preserve `nn.Conv1d`

## Changes
- `vllm_utils.py`: Added internal helper, updated state dict extraction logic
- `empty_model.py`: Added LFM2 layer templates, specific norm handling

Related: unslothai/unsloth#4073

cc @danielhanchen
@devchilll devchilll force-pushed the fix/lfm-mamba-fast-inference branch from 3f316b6 to 210dd50 Compare February 19, 2026 06:40
@Datta0
Copy link
Collaborator

Datta0 commented Feb 23, 2026

Hey @devchilll , we'd ideally appreciate if you can find a cleaner or simpler way to handle this instead of doing if else everywhere that too only for one model

@gaztrabisme
Copy link

Hey @devchilll, I ran an end-to-end test of this PR on an RTX 5080. Here's what I found:

Test Environment

  • GPU: NVIDIA GeForce RTX 5080 (16GB, sm_120a)
  • Docker image: unsloth/unsloth:2026.2.1-pt2.9.0-cu12.8-moe-optimized-training
  • PyTorch: 2.9.0+cu128
  • vLLM: 0.11.2
  • Transformers: 4.57.1
  • Model: LiquidAI/LFM2.5-1.2B-Thinking

What works

The core fix is solid — the original UnboundLocalError on prefix is gone. vLLM successfully:

  • Resolves the architecture as Lfm2ForCausalLM
  • Loads weights into the model
  • Captures CUDA graphs (both piecewise and full)
  • Allocates KV cache

The unit-level code structure is also correct — o_proj extraction is properly inside the attention branch, short_conv/feed_forward/embedding_norm conditionals are all in place.

Issues found

1. LoRA registration fails (blocker with default settings)

Unsloth defaults to enable_lora=True for fast_inference. vLLM's LoRA manager tries to register the conv module but it's not a BaseLayerWithLoRA instance:

AssertionError: Module model.layers.0.conv must be a BaseLayerWithLoRA instance,

This might need handling in vllm_lora_worker_manager.py to skip conv modules, or LFM2 models might need to load with enable_lora=False until vLLM adds LoRA support for conv layers.

2. Conv in_proj weight shape mismatch (blocker for inference)

When testing with LoRA disabled, the model loads but inference fails at the first conv layer:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (6x2048 and 1x3)

This happens at Lfm2ShortConv.in_proj(x) — the reconstructed weight is shape (1, 3) instead of the expected (6144, 2048). The get_state_dict(f"{conv_prefix}.in_proj", 0, state_dict, short_conv.in_proj, slice_weights=False) call seems to not be extracting the actual weight tensor correctly. This might be a difference in how vLLM 0.11.2 implements ShortConv vs what you were coding against.

3. lm_head.weight not initialized (warning)

During HF model reconstruction:

Some weights of Lfm2ForCausalLM were not initialized from the model checkpoint: ['lm_head.weight']

LFM2 likely uses tie_word_embeddings=True, so this might just need the tied embedding path to kick in properly.

The remaining issues are around weight tensor extraction details and LoRA compatibility. Happy to re-test whenever you push updates!

@devchilll
Copy link
Author

hey @gaztrabisme and @Datta0 - thanks both sm! Will have capacity this week to take a closer look

…_inference

## Summary
Fixes `UnboundLocalError` in `_get_vllm_state_dict` when using `fast_inference=True` with LFM2/Mamba hybrid models (e.g., `LiquidAI/LFM2.5-1.2B-Thinking`).

## Problem
When loading LFM2 models with `fast_inference=True`, `_get_vllm_state_dict` crashes because it only handles `self_attn` and `cross_attn` layers, failing on LFM2's `short_conv` layers. LFM2 also uses different naming conventions (`out_proj` vs `o_proj`, `embedding_norm` vs `norm`).

## Fix
- Added `_extract_short_conv_layer` helper to extract conv `in_proj`, `out_proj`, and `conv` weights
- Updated `_get_vllm_state_dict` to handle `short_conv` via helper, clean up loop logic
- Moved `o_proj`/`out_proj` extraction inside attention branches
- Added conditional MLP extraction (`mlp` vs `feed_forward`)
- Handle `embedding_norm` for final model norm
- Added LFM2 templates to `get_model_layer_config` and fixed conv weight assignment in `set_additional_modules` to preserve `nn.Conv1d`

## Changes
- `vllm_utils.py`: Added internal helper, updated state dict extraction logic
- `empty_model.py`: Added LFM2 layer templates, specific norm handling

Related: unslothai/unsloth#4073

cc @danielhanchen
@devchilll
Copy link
Author

Hi, invited both @gaztrabisme and @Datta0 as collaborators! I am fixing the bugs @gaztrabisme found, expect a commit soon. Please feel free to pull the branch, run tests and fix things too. Also lmk if there's better way to collaborate!

And then I am using github codespace as remote instance as it supports the right torch version to install local unsloth-zoo.

- Fix in_proj weight shape mismatch: extract merged weight directly from
  base layer instead of using get_state_dict which incorrectly handles
  MergedColumnParallelLinear, producing (1,3) instead of (3*hidden, hidden)
- Disable LoRA for LFM2 models: conv modules are not BaseLayerWithLoRA
  instances, causing AssertionError during LoRA registration
- lm_head.weight warning is benign: existing tie_word_embeddings path
  already handles it correctly (no code change needed)
- Update test_lfm_fix.py with LoRA exclusion unit test and improved docs
@devchilll devchilll force-pushed the fix/lfm-mamba-fast-inference branch from 210dd50 to 4c6aa96 Compare February 24, 2026 03:22
@devchilll devchilll force-pushed the fix/lfm-mamba-fast-inference branch from 813f88d to 6766c4e Compare February 24, 2026 04:20
- Move all LFM2-specific logic into _extract_lfm2_layer() function
- Use outer if/else dispatch by model_type instead of per-layer checks
- Standard model loop is now completely free of LFM2 conditionals
- Also fixes o_proj/mlp being outside self_attn/cross_attn guards in main
@devchilll devchilll force-pushed the fix/lfm-mamba-fast-inference branch from 6766c4e to ef77e45 Compare February 24, 2026 04:25
Copy link
Author

@devchilll devchilll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor

Moved all LFM2-specific logic out of the main _get_vllm_state_dict loop into a dedicated _extract_lfm2_layer function. The main loop now uses an outer if model_type == "lfm2" / else dispatch so:

  • LFM2 gets its own clean loop calling _extract_lfm2_layer
  • The standard model loop body is unchanged from main (just re-indented)
  • Zero elif hasattr(...) branches for LFM2 polluting the standard path

Easy to add other hybrid architectures (e.g. Mamba, RWKV) the same way later

Also extracted _extract_short_conv_layer as a helper for the conv layer weight extraction.

Bug Fixes

  • Conv in_proj weight shape mismatch
  • LoRA registration fails on conv modules
  • lm_head.weight not initialized warning - No code change needed.

CC @Datta0 @gaztrabisme

1. Disable LoRA in load_vllm() for LFM2 — the exclusion was only in
   _test_get_vllm_state_dict but not in the actual load_vllm() called
   by FastLanguageModel.from_pretrained. Conv modules are not
   BaseLayerWithLoRA, causing AssertionError during LoRA registration.

2. Fix attribute name: "short_conv" → "conv" in _extract_lfm2_layer —
   vLLM's Lfm2ShortConvDecoderLayer stores the ShortConv module as
   self.conv, not self.short_conv. The hasattr check was silently
   skipping all conv layers, leaving weights unextracted.

3. Unwrap ModelWeightParameter in set_additional_modules — vLLM 0.11.x
   wraps weights in ModelWeightParameter (tensor subclass) whose
   detach() returns plain Tensor, breaking nn.Parameter(). Use
   getattr(w, 'data', w) to unwrap, matching the existing pattern
   in convert_vllm_to_huggingface (line 1498).

Tested end-to-end: LFM2.5-1.2B-Thinking loads with fast_inference=True
and generates text correctly on RTX 5080.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gaztrabisme
Copy link

Hey @devchilll — pulled your latest refactor, ran it through GPU integration testing on RTX 5080 (vLLM 0.11.2, torch 2.9.0+cu128). The refactor structure is clean 👍 Found and fixed 3 runtime bugs in f93888a:

1. LoRA not disabled in load_vllm()
The enable_lora = model_type not in ("mllama", "lfm2") exclusion was only in _test_get_vllm_state_dict, not in the actual load_vllm() that FastLanguageModel.from_pretrained calls. Added the check there too.

2. Wrong attribute name: short_convconv
_extract_lfm2_layer checks hasattr(layer, "short_conv") but vLLM's Lfm2ShortConvDecoderLayer stores the module as self.conv, not self.short_conv. This silently skipped all conv layer extraction, leaving weights uninitialized. Fixed to hasattr(layer, "conv") / layer.conv.

3. ModelWeightParameter unwrap in set_additional_modules
vLLM 0.11.x wraps weights in ModelWeightParameter (tensor subclass) whose detach() returns plain Tensor, breaking nn.Parameter(). Added getattr(w, 'data', w) to unwrap — same pattern already used in convert_vllm_to_huggingface at line 1498.

All three only surface with an actual GPU running the real from_pretrained path (not the test harness). After fixes, LFM2.5-1.2B-Thinking loads with fast_inference=True and generates correctly:

Generated: Hello, world! Ok, okay, world. It seems like you're just starting out. I'm here to assist...

CC @Datta0

@devchilll
Copy link
Author

devchilll commented Feb 24, 2026

Thanks for finding and fixing the bugs @gaztrabisme! - could you summit your commit for review?

FYI @Datta0 @danielhanchen the pr is ready for review. Thanks!

@gaztrabisme
Copy link

gaztrabisme commented Feb 24, 2026

@devchilll Already pushed — commit f93888a is on the branch. Ready for review! 🙌


# LFM2 conv.conv weights — must be assigned directly to preserve nn.Conv1d module type
# (standard_layers uses nn.Linear for all entries, which would break Conv1d)
text_config_tmp = getattr(config, "text_config", config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend having a function like _extract_lfm2_layer to deal with these differences

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it, will push a commit shortly.

if use_fused_qkv:
# For some model types like phi3 vllm will expect fused qkv (e.g. Phi3, Phi3.5-mini-instruct, Phi4-mini-instruct)
if model_type == "lfm2":
for kk in range(len(vllm_text_model.layers)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say

for kk ...:
  if lfm: extract_lfm; return
  rest of the code with no `else`

is perhaps better?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will switch to early-return style.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Datta0 No. You can't do early return right there because there's common code after the for kk loop. the latest commit of the for kk: if lfm, extract lfm layer, continue; does make sense in turns of extracting layers correctly.


# LFM2 conv modules are not BaseLayerWithLoRA — disable LoRA to avoid
# AssertionError during LoRA registration in vLLM's create_lora_manager.
if getattr(config, "model_type", "") == "lfm2":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait on, if we disable LoRA whats the point of this anymore?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I let Claude Code run amok on that fix. Looked into it properly — vLLM's Lfm2ForCausalLM doesn't define supported_lora_modules at all, so LoRA isn't supported for LFM2 in vLLM yet. enable_lora = False just avoids the crash. We could open a PR on vLLM to add LoRA support for LFM2 — @devchilll want to work on that together?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: turns out vLLM already merged LoRA support for LFM2 (vllm-project/vllm#34921, merged Feb 20). It's not in any release yet (our 0.11.2 doesn't have it), but once Unsloth upgrades vLLM we can drop the enable_lora = False guard. So no action needed from us — just a version bump down the line.

1. vllm_utils.py: switch to early-return style — single loop with
   `if lfm2: extract; continue` instead of outer if/else with two loops
2. empty_model.py: extract LFM2-specific logic (embedding_norm, conv.conv
   weights) into _set_lfm2_modules() helper function

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants