Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba)

# Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba)

## Summary

Prompt prefix caching — the mechanism that reuses computed KV states across requests sharing a common prefix — only works for **pure full-attention models**. Any model using sliding window attention, Mamba/SSM layers, or mixed attention types silently falls back to full prompt recomputation on every request. This makes multi-turn conversations unusably slow for the majority of modern open-weight models.

This is not a single bug but a **systemic gap** in how mlx-lm handles non-standard cache types. I'm filing this as a unifying issue because the symptoms are spread across many separate reports that all trace back to the same root causes.

## Empirical Evidence

Tested on Mac Studio M3 Ultra (512GB unified memory), LM Studio 0.4.6, mlx-engine. Three sequential requests with identical system prompts, different user messages, `max_tokens=30`:

### MiniMax M2.5 (pure attention, MoE) — caching works

| Request | Time | Speedup |
|---------|------|---------|
| 1 (cold) | 29.33s | — |
| 2 (warm) | 6.15s | 4.8x |
| 3 (warm) | 2.79s | 10.5x |

### GPT-OSS 120B (sliding_attention + full_attention hybrid) — no caching

| Request | Time | Speedup |
|---------|------|---------|
| 1 (cold) | 1.54s | — |
| 2 (warm) | 1.77s | none |
| 3 (warm) | 1.67s | none |

### Qwen 3.5 9B (attention + Mamba/SSM hybrid) — no caching

| Request | Time | Speedup |
|---------|------|---------|
| 1 (cold) | 5.02s | — |
| 2 (warm) | 7.76s | none (slower) |
| 3 (warm) | 8.00s | none (slower) |

MiniMax shows clear prefix reuse (29s → 3s). GPT-OSS and Qwen 3.5 show zero improvement — the full prompt is recomputed every turn.

## Root Cause Analysis

There are **two distinct failure modes**, both stemming from the assumption that all layers use identical, trimmable KV caches:

### 1. Sliding window models → RotatingKVCache can't be trimmed

Models like GPT-OSS 120B and Gemma 3 27B alternate between sliding window and full attention layers:

```
# GPT-OSS 120B config.json
"layer_types": ["sliding_attention", "full_attention", "sliding_attention", "full_attention", ...]
"sliding_window": 128

# Gemma 3 27B — 5:1 pattern
5x local attention (window=1024) + 1x global attention, repeated
```

Sliding window layers use `RotatingKVCache` (circular buffer). When the cache wrapper attempts to trim to a common prefix for reuse, the circular buffer state can't be meaningfully trimmed — so the entire cache is erased and recomputed from scratch. See lmstudio-ai/mlx-engine#177.

### 2. SSM/Mamba hybrid models → non-trimmable state

Models like Qwen 3.5 (all sizes) use attention + Mamba layers:

```
# Qwen 3.5 — hybrid attention + SSM
Attention layers: standard KVCache (trimmable)
Mamba/SSM layers: recurrent state (NOT trimmable)
```

The Mamba state is fundamentally different from a KV cache — it's a compressed recurrent state that can't be split at an arbitrary token boundary. Additionally, the `KVCache.make_mask()` interface requires `window_size` and `return_array` arguments that don't apply to SSM state, causing `TypeError` on multi-turn prefill (see QwenLM/Qwen3.5#37).

## Affected Models

Every popular hybrid-architecture model is affected. This covers the **majority** of modern open-weight models:

| Model | Architecture | Cache behavior |
|-------|-------------|---------------|
| Qwen 3.5 (all sizes) | Attention + Mamba/SSM | Broken — crashes or no reuse |
| GPT-OSS 120B / 20B | Sliding + full attention | Broken — full recompute |
| Gemma 3 (all sizes) | 5:1 sliding + global | Broken — full recompute |
| Llama 4 Scout/Maverick | iRoPE chunked (8K) + NoPE | Likely broken |
| Qwen2.5-VL | Partial sliding window | Likely broken |
| **MiniMax M2.5** | **Pure full attention (MoE)** | **Works** |

As of March 2026, MiniMax M2.5 appears to be the **only** major model where prefix caching works correctly on MLX.

## Impact

This is particularly painful for agentic workloads where:
- System prompts are large (tool definitions, personas, instructions)
- Conversations are multi-turn (each turn should only process new tokens)
- Multiple agents share the same model (each request recomputes from scratch)

Without prefix caching, a 40K-token context takes ~200 seconds to process vs ~5 seconds with cache reuse. For agentic frameworks running on local MLX models, this is the difference between usable and unusable.

## Proposed Solution

Implement **per-layer cache logic** instead of assuming uniform cache types:

1. **For sliding window layers**: Either make `RotatingKVCache` trimmable to a prefix boundary, or maintain a parallel standard cache for the prefix portion that gets replayed into the rotating buffer on reuse.

2. **For SSM/Mamba layers**: Use `make_prompt_cache(model)` (which correctly creates `ArrayCache` for linear attention layers) instead of uniform `KVCache()` allocation. The workaround in QwenLM/Qwen3.5#37 demonstrates this works for Qwen 3.5 at the code level.

3. **Cache type introspection**: The cache wrapper should inspect what type of cache each layer requires and handle trim/reuse differently per type.

PR #923 proposes a partial fix for Qwen 3.5 specifically. The RotatingKVCache trim issue has PRs at lmstudio-ai/mlx-engine#188 and #192. These should be unified into a comprehensive solution.

## Related Issues

- #903 — Caching doesn't work for Qwen3.5
- #965 — KV cache cross-contamination between concurrent requests
- #975 — Strange cache behavior in server mode (cache corruption)
- #259 — Prompt caching returns different logits for repeat prompts
- #923 — PR: Qwen3.5 cache fix (not merged)
- lmstudio-ai/mlx-engine#177 — RotatingKVCache trim erases entire cache
- lmstudio-ai/mlx-engine#188, #192 — PRs for RotatingKVCache fix (not merged)
- lmstudio-ai/lmstudio-bug-tracker#1563 — Qwen3.5 full recompute in LM Studio
- lmstudio-ai/lmstudio-bug-tracker#1319 — KV caching broken in LM Studio 0.3.35
- QwenLM/Qwen3.5#37 — Prefill failure with KV cache (workaround found)
- ggml-org/llama.cpp#19858 — Qwen3.5 always re-processes full prompt (fixed in b8212)

## Environment

- macOS 26.2.0 (Tahoe), Mac Studio M3 Ultra, 512GB unified memory
- LM Studio 0.4.6 (build 1)
- Models tested: MiniMax-M2.5-MLX-8bit, gpt-oss-120b-mlx-8bit, Qwen3.5-9B (MLX)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba) #980

Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba)

Summary

Empirical Evidence

MiniMax M2.5 (pure attention, MoE) — caching works

GPT-OSS 120B (sliding_attention + full_attention hybrid) — no caching

Qwen 3.5 9B (attention + Mamba/SSM hybrid) — no caching

Root Cause Analysis

1. Sliding window models → RotatingKVCache can't be trimmed

2. SSM/Mamba hybrid models → non-trimmable state

Affected Models

Impact

Proposed Solution

Related Issues

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request	Time	Speedup
1 (cold)	5.02s	—
2 (warm)	7.76s	none (slower)
3 (warm)	8.00s	none (slower)

Model	Architecture	Cache behavior
Qwen 3.5 (all sizes)	Attention + Mamba/SSM	Broken — crashes or no reuse
GPT-OSS 120B / 20B	Sliding + full attention	Broken — full recompute
Gemma 3 (all sizes)	5:1 sliding + global	Broken — full recompute
Llama 4 Scout/Maverick	iRoPE chunked (8K) + NoPE	Likely broken
Qwen2.5-VL	Partial sliding window	Likely broken
MiniMax M2.5	Pure full attention (MoE)	Works

Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba) #980

Description

Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba)

Summary

Empirical Evidence

MiniMax M2.5 (pure attention, MoE) — caching works

GPT-OSS 120B (sliding_attention + full_attention hybrid) — no caching

Qwen 3.5 9B (attention + Mamba/SSM hybrid) — no caching

Root Cause Analysis

1. Sliding window models → RotatingKVCache can't be trimmed

2. SSM/Mamba hybrid models → non-trimmable state

Affected Models

Impact

Proposed Solution

Related Issues

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions