KV cache memory is a bottleneck at long context lengths. NexusQuant offers training-free 7–10x KV compression via E8 lattice quantization + attention-aware token eviction (up to 17x with token merging).
Integration points:
- After prefill, compress the KV cache in-place before storing in the paged block pool
- Use attention mask to exclude evicted tokens during generation
- API:
with nexusquant_evict(model): model.generate(...)
Why this matters for lmdeploy:
lmdeploy's TurboMind engine already supports INT4/INT8 KV quant. NexusQuant's E8 lattice VQ is a natural extension — it achieves higher compression than INT4 while maintaining quality, and is drop-in since it doesn't change tensor shapes (only precision).
Validated results:
- Mistral-7B: 7x compression, -2.26% PPL vs baseline
- Llama-3-8B: 5.3x compression, -0.002% PPL
- Training-free, no calibration data required
Library details:
Would you be interested in exploring this as a compression backend for TurboMind? Happy to help with the integration and provide benchmarks on your target models.
KV cache memory is a bottleneck at long context lengths. NexusQuant offers training-free 7–10x KV compression via E8 lattice quantization + attention-aware token eviction (up to 17x with token merging).
Integration points:
with nexusquant_evict(model): model.generate(...)Why this matters for lmdeploy:
lmdeploy's TurboMind engine already supports INT4/INT8 KV quant. NexusQuant's E8 lattice VQ is a natural extension — it achieves higher compression than INT4 while maintaining quality, and is drop-in since it doesn't change tensor shapes (only precision).
Validated results:
Library details:
pip install nexusquant-kvWould you be interested in exploring this as a compression backend for TurboMind? Happy to help with the integration and provide benchmarks on your target models.