-
Notifications
You must be signed in to change notification settings - Fork 500
Description
Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba)
Summary
Prompt prefix caching — the mechanism that reuses computed KV states across requests sharing a common prefix — only works for pure full-attention models. Any model using sliding window attention, Mamba/SSM layers, or mixed attention types silently falls back to full prompt recomputation on every request. This makes multi-turn conversations unusably slow for the majority of modern open-weight models.
This is not a single bug but a systemic gap in how mlx-lm handles non-standard cache types. I'm filing this as a unifying issue because the symptoms are spread across many separate reports that all trace back to the same root causes.
Empirical Evidence
Tested on Mac Studio M3 Ultra (512GB unified memory), LM Studio 0.4.6, mlx-engine. Three sequential requests with identical system prompts, different user messages, max_tokens=30:
MiniMax M2.5 (pure attention, MoE) — caching works
| Request | Time | Speedup |
|---|---|---|
| 1 (cold) | 29.33s | — |
| 2 (warm) | 6.15s | 4.8x |
| 3 (warm) | 2.79s | 10.5x |
GPT-OSS 120B (sliding_attention + full_attention hybrid) — no caching
| Request | Time | Speedup |
|---|---|---|
| 1 (cold) | 1.54s | — |
| 2 (warm) | 1.77s | none |
| 3 (warm) | 1.67s | none |
Qwen 3.5 9B (attention + Mamba/SSM hybrid) — no caching
| Request | Time | Speedup |
|---|---|---|
| 1 (cold) | 5.02s | — |
| 2 (warm) | 7.76s | none (slower) |
| 3 (warm) | 8.00s | none (slower) |
MiniMax shows clear prefix reuse (29s → 3s). GPT-OSS and Qwen 3.5 show zero improvement — the full prompt is recomputed every turn.
Root Cause Analysis
There are two distinct failure modes, both stemming from the assumption that all layers use identical, trimmable KV caches:
1. Sliding window models → RotatingKVCache can't be trimmed
Models like GPT-OSS 120B and Gemma 3 27B alternate between sliding window and full attention layers:
# GPT-OSS 120B config.json
"layer_types": ["sliding_attention", "full_attention", "sliding_attention", "full_attention", ...]
"sliding_window": 128
# Gemma 3 27B — 5:1 pattern
5x local attention (window=1024) + 1x global attention, repeated
Sliding window layers use RotatingKVCache (circular buffer). When the cache wrapper attempts to trim to a common prefix for reuse, the circular buffer state can't be meaningfully trimmed — so the entire cache is erased and recomputed from scratch. See lmstudio-ai/mlx-engine#177.
2. SSM/Mamba hybrid models → non-trimmable state
Models like Qwen 3.5 (all sizes) use attention + Mamba layers:
# Qwen 3.5 — hybrid attention + SSM
Attention layers: standard KVCache (trimmable)
Mamba/SSM layers: recurrent state (NOT trimmable)
The Mamba state is fundamentally different from a KV cache — it's a compressed recurrent state that can't be split at an arbitrary token boundary. Additionally, the KVCache.make_mask() interface requires window_size and return_array arguments that don't apply to SSM state, causing TypeError on multi-turn prefill (see QwenLM/Qwen3.5#37).
Affected Models
Every popular hybrid-architecture model is affected. This covers the majority of modern open-weight models:
| Model | Architecture | Cache behavior |
|---|---|---|
| Qwen 3.5 (all sizes) | Attention + Mamba/SSM | Broken — crashes or no reuse |
| GPT-OSS 120B / 20B | Sliding + full attention | Broken — full recompute |
| Gemma 3 (all sizes) | 5:1 sliding + global | Broken — full recompute |
| Llama 4 Scout/Maverick | iRoPE chunked (8K) + NoPE | Likely broken |
| Qwen2.5-VL | Partial sliding window | Likely broken |
| MiniMax M2.5 | Pure full attention (MoE) | Works |
As of March 2026, MiniMax M2.5 appears to be the only major model where prefix caching works correctly on MLX.
Impact
This is particularly painful for agentic workloads where:
- System prompts are large (tool definitions, personas, instructions)
- Conversations are multi-turn (each turn should only process new tokens)
- Multiple agents share the same model (each request recomputes from scratch)
Without prefix caching, a 40K-token context takes ~200 seconds to process vs ~5 seconds with cache reuse. For agentic frameworks running on local MLX models, this is the difference between usable and unusable.
Proposed Solution
Implement per-layer cache logic instead of assuming uniform cache types:
-
For sliding window layers: Either make
RotatingKVCachetrimmable to a prefix boundary, or maintain a parallel standard cache for the prefix portion that gets replayed into the rotating buffer on reuse. -
For SSM/Mamba layers: Use
make_prompt_cache(model)(which correctly createsArrayCachefor linear attention layers) instead of uniformKVCache()allocation. The workaround in [Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework QwenLM/Qwen3.5#37 demonstrates this works for Qwen 3.5 at the code level. -
Cache type introspection: The cache wrapper should inspect what type of cache each layer requires and handle trim/reuse differently per type.
PR #923 proposes a partial fix for Qwen 3.5 specifically. The RotatingKVCache trim issue has PRs at lmstudio-ai/mlx-engine#188 and #192. These should be unified into a comprehensive solution.
Related Issues
- Caching doesn't seem to be working for Qwen3.5 #903 — Caching doesn't work for Qwen3.5
- KV cache cross-contamination between concurrent requests in mlx_lm.server #965 — KV cache cross-contamination between concurrent requests
- Strange cache behavior with 0.31.0 in server mode #975 — Strange cache behavior in server mode (cache corruption)
- Prompt caching returns different logits for repeat prompts #259 — Prompt caching returns different logits for repeat prompts
- Hybrid cache for Qwen3.5 #923 — PR: Qwen3.5 cache fix (not merged)
- MLX
RotatingKVCachetrim behavior causes context overflow policies to always erase the whole cache lmstudio-ai/mlx-engine#177 — RotatingKVCache trim erases entire cache - Cache reuse and cache fixes lmstudio-ai/mlx-engine#188, Is there a way to use params JSON? #192 — PRs for RotatingKVCache fix (not merged)
- Qwen3.5-35B-A3B: KV cache reuse not supported — full prompt recompute on every request lmstudio-ai/lmstudio-bug-tracker#1563 — Qwen3.5 full recompute in LM Studio
- KV Caching broken for MLX on Mac LM Studio 0.3.35(Build 1) lmstudio-ai/lmstudio-bug-tracker#1319 — KV caching broken in LM Studio 0.3.35
- [Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework QwenLM/Qwen3.5#37 — Prefill failure with KV cache (workaround found)
- Eval bug: Qwen3.5 always re-processes the full prompt ggml-org/llama.cpp#19858 — Qwen3.5 always re-processes full prompt (fixed in b8212)
Environment
- macOS 26.2.0 (Tahoe), Mac Studio M3 Ultra, 512GB unified memory
- LM Studio 0.4.6 (build 1)
- Models tested: MiniMax-M2.5-MLX-8bit, gpt-oss-120b-mlx-8bit, Qwen3.5-9B (MLX)