Skip to content

Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba) #980

@skcadri

Description

@skcadri

Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba)

Summary

Prompt prefix caching — the mechanism that reuses computed KV states across requests sharing a common prefix — only works for pure full-attention models. Any model using sliding window attention, Mamba/SSM layers, or mixed attention types silently falls back to full prompt recomputation on every request. This makes multi-turn conversations unusably slow for the majority of modern open-weight models.

This is not a single bug but a systemic gap in how mlx-lm handles non-standard cache types. I'm filing this as a unifying issue because the symptoms are spread across many separate reports that all trace back to the same root causes.

Empirical Evidence

Tested on Mac Studio M3 Ultra (512GB unified memory), LM Studio 0.4.6, mlx-engine. Three sequential requests with identical system prompts, different user messages, max_tokens=30:

MiniMax M2.5 (pure attention, MoE) — caching works

Request Time Speedup
1 (cold) 29.33s
2 (warm) 6.15s 4.8x
3 (warm) 2.79s 10.5x

GPT-OSS 120B (sliding_attention + full_attention hybrid) — no caching

Request Time Speedup
1 (cold) 1.54s
2 (warm) 1.77s none
3 (warm) 1.67s none

Qwen 3.5 9B (attention + Mamba/SSM hybrid) — no caching

Request Time Speedup
1 (cold) 5.02s
2 (warm) 7.76s none (slower)
3 (warm) 8.00s none (slower)

MiniMax shows clear prefix reuse (29s → 3s). GPT-OSS and Qwen 3.5 show zero improvement — the full prompt is recomputed every turn.

Root Cause Analysis

There are two distinct failure modes, both stemming from the assumption that all layers use identical, trimmable KV caches:

1. Sliding window models → RotatingKVCache can't be trimmed

Models like GPT-OSS 120B and Gemma 3 27B alternate between sliding window and full attention layers:

# GPT-OSS 120B config.json
"layer_types": ["sliding_attention", "full_attention", "sliding_attention", "full_attention", ...]
"sliding_window": 128

# Gemma 3 27B — 5:1 pattern
5x local attention (window=1024) + 1x global attention, repeated

Sliding window layers use RotatingKVCache (circular buffer). When the cache wrapper attempts to trim to a common prefix for reuse, the circular buffer state can't be meaningfully trimmed — so the entire cache is erased and recomputed from scratch. See lmstudio-ai/mlx-engine#177.

2. SSM/Mamba hybrid models → non-trimmable state

Models like Qwen 3.5 (all sizes) use attention + Mamba layers:

# Qwen 3.5 — hybrid attention + SSM
Attention layers: standard KVCache (trimmable)
Mamba/SSM layers: recurrent state (NOT trimmable)

The Mamba state is fundamentally different from a KV cache — it's a compressed recurrent state that can't be split at an arbitrary token boundary. Additionally, the KVCache.make_mask() interface requires window_size and return_array arguments that don't apply to SSM state, causing TypeError on multi-turn prefill (see QwenLM/Qwen3.5#37).

Affected Models

Every popular hybrid-architecture model is affected. This covers the majority of modern open-weight models:

Model Architecture Cache behavior
Qwen 3.5 (all sizes) Attention + Mamba/SSM Broken — crashes or no reuse
GPT-OSS 120B / 20B Sliding + full attention Broken — full recompute
Gemma 3 (all sizes) 5:1 sliding + global Broken — full recompute
Llama 4 Scout/Maverick iRoPE chunked (8K) + NoPE Likely broken
Qwen2.5-VL Partial sliding window Likely broken
MiniMax M2.5 Pure full attention (MoE) Works

As of March 2026, MiniMax M2.5 appears to be the only major model where prefix caching works correctly on MLX.

Impact

This is particularly painful for agentic workloads where:

  • System prompts are large (tool definitions, personas, instructions)
  • Conversations are multi-turn (each turn should only process new tokens)
  • Multiple agents share the same model (each request recomputes from scratch)

Without prefix caching, a 40K-token context takes ~200 seconds to process vs ~5 seconds with cache reuse. For agentic frameworks running on local MLX models, this is the difference between usable and unusable.

Proposed Solution

Implement per-layer cache logic instead of assuming uniform cache types:

  1. For sliding window layers: Either make RotatingKVCache trimmable to a prefix boundary, or maintain a parallel standard cache for the prefix portion that gets replayed into the rotating buffer on reuse.

  2. For SSM/Mamba layers: Use make_prompt_cache(model) (which correctly creates ArrayCache for linear attention layers) instead of uniform KVCache() allocation. The workaround in [Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework QwenLM/Qwen3.5#37 demonstrates this works for Qwen 3.5 at the code level.

  3. Cache type introspection: The cache wrapper should inspect what type of cache each layer requires and handle trim/reuse differently per type.

PR #923 proposes a partial fix for Qwen 3.5 specifically. The RotatingKVCache trim issue has PRs at lmstudio-ai/mlx-engine#188 and #192. These should be unified into a comprehensive solution.

Related Issues

Environment

  • macOS 26.2.0 (Tahoe), Mac Studio M3 Ultra, 512GB unified memory
  • LM Studio 0.4.6 (build 1)
  • Models tested: MiniMax-M2.5-MLX-8bit, gpt-oss-120b-mlx-8bit, Qwen3.5-9B (MLX)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions