Skip to content

Add TurboQuant KV cache backend#160

Open
timonharz wants to merge 2 commits intoml-explore:mainfrom
timonharz:codex/turboquant-mlx-vlm-858
Open

Add TurboQuant KV cache backend#160
timonharz wants to merge 2 commits intoml-explore:mainfrom
timonharz:codex/turboquant-mlx-vlm-858

Conversation

@timonharz
Copy link
Copy Markdown

@timonharz timonharz commented Mar 25, 2026

Summary

This PR ports the TurboQuant KV-cache backend from mlx-vlm PR #858 into mlx-swift-lm, with the implementation centered in MLXLMCommon so it works for both MLXLLM and the text-decoder side of MLXVLM.

The behavior matches the upstream intent:

  • fractional kvBits automatically use TurboQuant
  • integer kvBits continue to use uniform quantization by default
  • kvQuantizationScheme = .turboQuant can explicitly force TurboQuant for integer bit widths

What changed

  • Added a new shared TurboQuantKVCache backend in MLXLMCommon
  • Ported the active TurboQuant codec/runtime stack, including split-codec support for .5 bit widths
  • Extended GenerateParameters:
    • kvBits is now Float?
    • added KVQuantizationScheme
    • added kvQuantizationScheme
  • Updated dynamic KV-cache quantization to recurse into nested CacheList contents
  • Preserved existing skips for unsupported cache types such as RotatingKVCache, MambaCache, and plain ArraysCache
  • Integrated TurboQuant into the shared attention/cache-update path
  • Patched model-specific attention implementations that bypass the shared helper, including GPTOSS and MiMoV2Flash
  • Kept the existing “no attention sinks on quantized caches” behavior for TurboQuant as well
  • Extended prompt-cache save/load to serialize and restore TurboQuantKVCache
  • Updated docs/examples to cover fractional kvBits and explicit quantization scheme selection

Tests

Added coverage for:

  • TurboQuant codec behavior
  • cache selection for fractional bits and explicit .turboQuant
  • nested cache quantization behavior
  • prompt-cache round-tripping for TurboQuantKVCache

@N1k1tung
Copy link
Copy Markdown

Runtime test execution is still blocked in this environment because MLX fails to load default.metallib, so I could not complete the full TurboQuant test run here.

kinda explains it was in codex sandbox =)

@davidkoski davidkoski added the swift-format Swift format failure in CI label Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

swift-format Swift format failure in CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants