Skip to content

[Feature Request] fast inference for LFM (and Mamba models) #4073

@gaztrabisme

Description

@gaztrabisme

Bug Description

When using FastLanguageModel.from_pretrained() with fast_inference=True on an LFM2.5 model (LiquidAI/LFM2.5-1.2B-Thinking, architecture Lfm2ForCausalLM), the model loads into vLLM successfully but crashes during state dict extraction.

Error

File "unsloth_zoo/vllm_utils.py", line 1122, in _get_vllm_state_dict
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
                       ^^^^^^
UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value

Root Cause

In _get_vllm_state_dict, the layer iteration loop only sets prefix inside if hasattr(layer, "self_attn") and elif hasattr(layer, "cross_attn") branches. The get_state_dict(f"{prefix}.o_proj", ...) call is at the loop body level (outside both branches).

LFM2/Mamba layers use mixer (or similar) instead of self_attn/cross_attn, so neither branch executes and prefix is never assigned.

for kk in range(len(vllm_text_model.layers)):
    layer = vllm_text_model.layers[kk]
    if hasattr(layer, "self_attn"):
        prefix = f"..."  # set here
        # ...
    elif hasattr(layer, "cross_attn"):
        prefix = f"..."  # set here
        # ...
    # Mamba layers fall through — prefix never set
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)  # CRASH

Environment

  • Unsloth: 2026.2.1
  • vLLM: 0.15.1
  • PyTorch: 2.9.1+cu128
  • CUDA: 12.8
  • GPU: NVIDIA GeForce RTX 5080 (Blackwell, sm_120a)
  • Model: LiquidAI/LFM2.5-1.2B-Thinking (Lfm2ForCausalLM)

Steps to Reproduce

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="LiquidAI/LFM2.5-1.2B-Thinking",
    max_seq_length=4096,
    load_in_4bit=False,
    fast_inference=True,
)

Notes

  • vLLM itself handles LFM2 fine — model loads as Lfm2ForCausalLM, CUDA graphs are captured, KV cache is allocated. The crash is only in Unsloth's _get_vllm_state_dict wrapper.
  • fast_inference=False works as expected (bypasses vLLM entirely).
  • There is no FastLfm2Model class in Unsloth — LFM2 falls through to the generic FastModel/FastBaseModel path, which does attempt vLLM initialization.

Suggested Fix

Add handling for Mamba/SSM layers in the loop — either skip them with continue or add an elif hasattr(layer, "mixer") branch that extracts the correct state dict for Mamba layers.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions