This repository is for our survey paper:
Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization
Jiantong Jiang1, Peiyu Yang1, Rui Zhang2, Feng Liu1
1The University of Melbourne, 2Huazhong University of Science and Technology
This repository aims to record papers of system-aware, serving-time, KV-centric optimization methods that improve system metrics without retraining or architecture modification (which we call this scope sKis). We systematize recent advances through a distinct system behavior-oriented taxonomy, organizing existing efforts into three behavioral dimensions:
π· Temporal β when is KV cache accessed or computed?
π· Spatial β where is KV cache placed and migrated?
π· Structural β how is KV cache represented and managed?
π§ Grounded in this taxonomy, we analyze cross-behavior co-design affinity and behaviorβobjective effects, revealing overlooked regions and concrete open challenges.
The survey and the repository are still under active development and will be updated regularly.
π If you would like to include your paper in this survey and repository, please feel free to submit a pull request. You can generate the markdown row for each paper by filling in the first part of generate.py
and running python generate.py. Alternatively, you can open an issue with the paper's title and a brief summary highlighting its key techniques. You can also contact us via email.
ππ»ββοΈ Please let us know if you find out a mistake or have any suggestions! We greatly appreciate your feedback regarding this repository or survey!
π If you find this resource helpful for your work, please consider giving us a star and citing our research.
- Temporal β Execution & Scheduling
- Spatial β Placement & Migration
- Structural β Representation & Retention
- KV Cache Compression (KVCC) (including quantization, low-rank approximation, and structural compression)
- KV Cache Retention Management (KVRM) (including allocation, reuse, and eviction)
- Cross-behavior Co-design Affinity
- Behavior-objective Effects
These methods act on when KV data is executed, computed, or scheduled to improve latency and throughput. We divide these methods into three categories: KV-centric scheduling, pipelining & overlapping, and hardware-aware execution.
KV-centric scheduling methods explicitly integrate KV characteristics into runtime scheduling decisions.
| Paper | Type | Code |
|---|---|---|
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection [Link] Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong |
Token-level attention compute scheduling | TokenSelect |
RefreshKV: Updating Small KV Cache During Long-form Generation [Link] Fangyuan Xu, Tanya Goyal, Eunsol Choi |
Token-level attention compute scheduling | RefreshKV |
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression [Link] Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov |
Token-level attention compute scheduling | RocketKV |
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Link] Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze |
Kernel-level workload scheduling across CUDA thread blocks; Also belongs to allocation & reuse (structural) | FlashInfer π |
Mooncake: Trading More Storage for Less Computation β A KVCache-centric Architecture for Serving LLM Chatbot [Link] Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu |
KV reuse-aware request-level scheduling; Also belongs to HW-aware execution | Mooncake π |
Loki: Low-rank Keys for Efficient Sparse Attention [Link] Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele |
Token-level attention compute scheduling | Loki |
SGLang: Efficient Execution of Structured Language Model Programs [Link] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng |
KV reuse-aware request-level scheduling; Also belongs to allocation & reuse (structural) | SGLang π |
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism [Link] Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin |
KV usage-aware request-level scheduling | LoongServe |
| Fast Inference for Augmented Large Language Models [Link] Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher |
KV usage-aware request-level scheduling | |
| LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [Link] Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, Zhenxuan Pan |
KV usage-aware request-level scheduling; Also belongs to memory hierarchy KV orchestration (spatial) | |
SparQ Attention: Bandwidth-Efficient LLM Inference [Link] Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr |
Token-level attention compute scheduling | SparQ Attention |
QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference [Link] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han |
Token-level attention compute scheduling | Quest |
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving [Link] Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang |
KV usage-aware request-level scheduling; Also belongs to HW-aware execution | MuxServe |
Preble: Efficient Distributed Prompt Scheduling for LLM Serving [Link] Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang |
KV reuse-aware request-level scheduling | Preble |
| Inference without interference: Disaggregate LLM inference for mixed downstream workloads [Link] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan |
KV usage-aware request-level scheduling; Also belongs to HW-aware execution |
Pipelining and overlapping methods hide latency by concurrently executing KV-related compute, communication, and I/O. They often embedded in the broader systems.
| Paper | Type | Code |
|---|---|---|
KVPR: Efficient LLM inference with i/o-aware KV cache partial recomputation [Link] Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram |
GPU KV recompute β KV transfer (CPUβGPU) | KVPR |
| PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [Link] Ahmet Caner YΓΌzΓΌgΓΌler, Jiawei Zhuang, Lukas Cavigelli |
GPU KV prefetch (HBMβL2) β GPU collective communication; Also belongs to memory hierarchy KV orchestration (spatial) | |
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Link] Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu |
CPU attention compute β GPU linear ops; Also belongs to HW-aware execution | NEO |
| Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching [Link] Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng, Chao Wang, Feng Lyu |
GPU KV prefetch (HBMβL2) β GPU attention compute; Also belongs to memory hierarchy KV orchestration (spatial) | |
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Link] Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo |
KV load/store (CPUβGPU) β GPU compute; Also belongs to memory hierarchy KV orchestration (spatial) | |
| FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines [Link] Jiaao He, Jidong Zhai |
CPU R-part compute β GPU S-part compute; Also belongs to HW-aware execution | |
Improving Throughput-Oriented LLM Inference with CPU Computations [Link] Daon Park, Bernhard Egger |
CPU MHSA compute β FFN data transfer (CPUβGPU); Also belongs to HW-aware execution | Heterogen |
Hardware-aware execution methods adapt KV cache-related operations to the underlying heterogeneous hardware.
Disaggregated inference separates heterogeneous computation in LLM inference and maps them to distinct hardware resources to reduce interference and improve utilization.
| Paper | Type | Code |
|---|---|---|
Mooncake: Trading More Storage for Less Computation β A KVCache-centric Architecture for Serving LLM Chatbot [Link] Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu |
PD disaggregation; Also belongs to KV-centric scheduling | Mooncake π |
DΓ©jΓ Vu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving [Link] Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic |
Decoupling prefill and decode to different GPUs; Also belongs to memory hierarchy KV orchestration (spatial) | DΓ©jΓ Vu |
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving [Link] Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang |
Colocating PD jobs of multiple LLMs within each GPU via SM partitioning; Also belongs to KV-centric scheduling | MuxServe |
DistServe: Decoupling Prefill and Decoding for Goodput-optimized Large Language Model Serving [Link] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang |
Decoupling prefill and decode to different GPUs; Also belongs to compute device KV orchestration (spatial) | DistServe |
| Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [Link] Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin |
Disaggregating at the operator level; Also belongs to compute device KV orchestration (spatial) | |
Splitwise: Efficient generative LLM inference using phase splitting [Link] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ΓΓ±igo Goiri, Saeed Maleki, Ricardo Bianchini |
Decoupling prefill and decode to different GPUs; Also belongs to compute device KV orchestration (spatial) | Splitwise (integrated into vLLM) |
| Inference without interference: Disaggregate LLM inference for mixed downstream workloads [Link] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan |
Decoupling prefill and decode to different GPUs; Also belongs to KV-centric scheduling |
Compute offloading relocates partial compute to auxiliary devices to reduce GPU bottlenecks, utilizing hardware heterogeneity and workload features.
| Paper | Type | Code |
|---|---|---|
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Link] Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu |
CPU offloading (attention and KV caches); Also belongs to pipelining & overlapping | NEO |
MagicPIG: LSH Sampling for Efficient LLM Generation [Link] Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen |
CPU offloading (attention and retrieval) | MagicPIG |
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System [Link] Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan GΓ³mez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu |
PIM-based offloading | ASPLOS |
TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference [Link] Chengye Yu, Tianyu Wang, Zili Shao, Linjie Zhu, Xu Zhou, Song JiangAuthors Info & Claims |
CPU offloading | |
| InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [Link] Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang |
CSD Offloading; Also belongs to compute device KV orchestration (spatial) | |
AttAcc! unleashing the power of PIM for batched transformer-based generative model inference [Link] Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn |
PIM-based offloading; Also belongs to compute device KV orchestration (spatial) | AttAcc |
| FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines [Link] Jiaao He, Jidong Zhai |
CPU offloading (attention and KV caches); Also belongs to pipelining & overlapping | |
Improving Throughput-Oriented LLM Inference with CPU Computations [Link] Daon Park, Bernhard Egger |
CPU offloading with dynamic GPU-CPU division; Also belongs to pipelining & overlapping | Heterogen |
These works optimize where KV data is stored or transferred to balance memory and I/O pressure. We divide these methods into two categories: memory hierarchy KV orchestration, and compute device KV orchestration.
Memory hierarchy KV orchestration methods distribute KV caches across memory hierarchies.
These methods manage KV caches across fast but limited GPU HBM, and larger but slower alternatives like CPU DRAM or SSD.
| Paper | Type | Code |
|---|---|---|
KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows [Link] Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, Yufei Ding |
Importance-aware methods | |
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [Link] Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu |
Importance-aware methods | RetrievalAttention |
| LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference [Link] Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang |
System cost-driven decision; Also belongs to compute device KV orchestration | LMCache π |
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation [Link] Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin |
System cost-driven decision | |
| SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning [Link] Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang |
Importance-aware methods | |
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs [Link] Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han |
Importance-aware methods | |
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [Link] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen |
Importance-aware methods; Also belongs to KV cache compression (structural) | ShadowKV |
PQCache: Product Quantization-based KVCache for Long Context LLM Inference [Link] Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui |
Importance-aware methods | PQCache |
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [Link] Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo |
Importance-aware methods | ClusterKV |
Stateful Large Language Model Serving with Pensieve [Link] Lingfan Yu, Jinkun Lin, Jinyang Li |
Importance-aware methods | |
IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference [Link] Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, Gang Chen |
Importance-aware methods | |
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction [Link] Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, Yun Liang |
Importance-aware methods | ArkVale |
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [Link] Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun |
Importance-aware methods | InfLLM |
| FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving [Link] Ao Shen, Zhiyao Li, Mingyu Gao |
System cost-driven decision; Also belongs to allocation & reuse (structural) | |
| LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [Link] Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, Zhenxuan Pan |
System cost-driven decision; Also belongs to KV-centric scheduling (temporal) | |
DΓ©jΓ Vu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving [Link] Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic |
System cost-driven decision; Also belongs to HW-aware execution (temporal) | DΓ©jΓ Vu |
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Link] Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo |
System cost-driven decision; Also belongs to pipelining & overlapping (temporal) | |
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [Link] Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim |
Importance-aware methods | InfiniGen |
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching [Link] Youpeng Zhao, Di Wu, Jun Wang |
Importance-aware methods | |
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [Link] Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel |
System cost-driven decision; Also belongs to compute device KV orchestration | FastServe |
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [Link] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RΓ©, Ion Stoica, Ce Zhang |
System cost-driven decision; Also belongs to KV cache compression (structural) | FlexLLMGen π |
These methods migrates KV entries between on-chip L1/L2 caches and off-chip HBM to hide latency.
| Paper | Type | Code |
|---|---|---|
| PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [Link] Ahmet Caner YΓΌzΓΌgΓΌler, Jiawei Zhuang, Lukas Cavigelli |
Prefetching KV caches from HBM to the L2 cache; Also belongs to pipelining & overlapping (temporal) | |
| Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng, Chao Wang, Feng Lyu [Link] |
Prefetching KV caches from HBM to the L2 cache; Also belongs to pipelining & overlapping (temporal) |
Compute device KV orchestration methods place and move KV caches across compute-capable devices like GPUs, CPUs, and storage-attached processors, to enable distributed or heterogeneous serving.
| Paper | Type | Code |
|---|---|---|
| LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference [Link] Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang |
KV cache placement & migration across GPUs; Also belongs to memory hierarchy KV orchestration | LMCache π |
| InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [Link] Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang |
KV cache placement & migration across GPUs and CSDs; Also belongs to HW-aware execution (temporal) | |
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [Link] Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang |
Remote KV cache transmission in distributed networked setups; Also belongs to KV cache compression (structural) | CacheGen |
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [Link] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang |
KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal) | DistServe |
| Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [Link] Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin |
KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal) | |
Splitwise: Efficient generative LLM inference using phase splitting [Link] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ΓΓ±igo Goiri, Saeed Maleki, Ricardo Bianchini |
KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal) | Splitwise (integrated into vLLM) |
AttAcc! unleashing the power of PIM for batched transformer-based generative model inference [Link] Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn |
KV cache placement & migration across GPUs and PIM devices; Also belongs to HW-aware execution (temporal) | AttAcc |
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [Link] Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel |
KV cache placement & migration across GPUs; Also belongs to memory hierarchy KV orchestration | FastServe |
These methods target how KV data is represented and maintained for memory efficiency. We divide these methods into two categories: KV cache compression, and KV cache retention management.
KV cache compression methods directly compress the size of KV caches.
Quantization compresses floating-point KV tensors into lower-precision formats. One recurring insight is asymmetric KV quantization: keys and values exhibit distinct outlier patterns and quantization sensitivities. A second insight is that outliers play a crucial role in low-bit quantization.
| Paper | Type | Code |
|---|---|---|
NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache [Link] Donghyun Son, Euntae Choi, Sungjoo Yoo |
VQ-based method | |
Accurate KV Cache Quantization with Outlier Tokens Tracing [Link] Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang |
Mixed-precision asymmetric KV quantization | OTT |
CommVQ: Commutative Vector Quantization for KV Cache Compression [Link] Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan |
VQ-based method | CommVQ |
QServe: W4A8KV4 quantization and system co-design for efficient LLM serving [Link] Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han |
Fixed-precision quantization | OmniServe |
| SQuat: Subspace-orthogonal KV cache quantization [Link] Hao Wang, Ligong Han, Kai Xu, Akash Srivastava |
Fixed-precision asymmetric KV quantization | SQuat |
VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference [Link] Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Jingwen Leng, Chen Jin |
VQ-based method | |
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead [Link] Amir Zandieh, Majid Daliri, Insu Han |
Fixed-precision quantization | QJL |
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [Link] Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang |
Mixed-precision asymmetric KV quantization | ZipCache |
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization [Link] Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava |
VQ-based method | |
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [Link] Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami |
Mixed-precision quantization | KVQuant |
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models [Link] Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin |
Mixed-precision quantization | SKVQ |
| GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM [Link] Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao |
Fixed-precision asymmetric KV quantization | GEAR |
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [Link] Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen |
Mixed-precision quantization | DecoQuant |
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [Link] Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang |
Mixed-precision quantization; Also belongs to compute device KV orchestration (spatial) | CacheGen |
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [Link] Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu |
Mixed-precision asymmetric KV quantization | KIVI |
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving [Link] Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci |
Mixed-precision quantization | Atom |
| QAQ: Quality Adaptive Quantization for LLM KV Cache [Link] Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang |
Mixed-precision quantization | QAQ |
| No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization [Link] June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee |
Mixed-precision quantization | |
| WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [Link] Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie |
Mixed-precision quantization | |
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [Link] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RΓ©, Ion Stoica, Ce Zhang |
Fixed-precision quantization; Also belongs to memory hierarchy KV orchestration (spatial) | FlexLLMGen π |
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [Link] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han |
Fixed-precision quantization | SmoothQuant π |
Low-rank compression exploits hidden-dimension redundancy by factorizing KV tensors into compact components. Methods differ in the approximation target, granularity, and rank setting.
| Paper | Type | Code |
|---|---|---|
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [Link] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen |
Target cached K tensors, layer-wise, fixed rank; Also belongs memory hierarchy KV orchestration (spatial) | ShadowKV |
| ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [Link] Xianglong Yan, Zhiteng Li, Tianao Zhang, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang |
Target cached KV tensors, head-group-wise for keys and layer-wise for values, budgeted-driven rank | ReCalKV |
Palu: KV-Cache Compression with Low-Rank Projection [Link] Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, Kai-Chiang Wu |
Target KV projection weights, head-group-wise, searched rank | Palu |
| xKV: Cross-Layer SVD for KV-Cache Compression [Link] Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah |
Target KV tensors, layer-group-wise, fixed rank | xKV |
| Eigen Attention: Attention in Low-Rank Space for KV Cache Compression [Link] Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy |
Target QKV attention subspace, layer-wise, budget-driven rank | EigenAttn |
| LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [Link] Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen |
Target KV projection weights, layer-wise, progressive rank |
Unlike value-level compression (e.g., quantization and low-rank approximation), structural compression reduces KV memory by modifying cache organization (e.g., layer, head, channel, token).
| Paper | Type | Code |
|---|---|---|
ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering [Link] Minwei Zhang, Haifeng Sun, Jingyu Wang, Shaolong Li, Wanyi Ning, Qi Qi, Zirui Zhuang, Jianxin Liao |
Structural pruning on prompt tokens guided by attention score | |
ThinK: Thinner Key Cache by Query-Driven Pruning [Link] Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo |
Structural pruning on key channels guided by query-driven signal | ThinK |
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models [Link] Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, Mi Zhang |
Intra-layer structural merging guided by similarity; Also belongs to eviction | D2O |
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [Link] Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang |
Cross-layer structural merging guided by layer similarity | |
| KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [Link] Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen |
Cross-layer structural merging guided by layer dissimilarity | KVSharer |
CHAI: Clustered Head Attention for Efficient LLM Inference [Link] Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu |
Structural pruning on heads guided by head attention score | CHAI |
CaM: Cache Merging for Memory-efficient LLMs Inference [Link] Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji |
Intra-layer structural merging guided by attention score | CaM |
| Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [Link] Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang |
Intra-layer structural merging guided by key similarity |
These methods manage the retention of the KV cache during serving.
| Paper | Type | Code |
|---|---|---|
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Link] Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze |
Structure-aware method; Also belongs to KV-centric scheduling (temporal) | FlashInfer π |
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [Link] Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar |
Structure-aware method | vAttention |
| MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool [Link] Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan |
Semantics-guided method | |
SGLang: Efficient Execution of Structured Language Model Programs [Link] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng |
Structure-aware method; Also belongs to KV-centric scheduling (temporal) | SGLang π |
| FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving [Link] Ao Shen, Zhiyao Li, Mingyu Gao |
Structure-aware method; Also belongs to memory hierarchy KV orchestration (spatial) | |
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [Link] Lu Ye, Ze Tao, Yong Huang, Yang Li |
Structure-aware method | Chunk Attention |
| vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [Link] Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, Jingwen Leng |
Structure-aware method | |
| LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference [Link] Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi |
Semantics-guided method | |
Prompt Cache: Modular Attention Reuse for Low-Latency Inference [Link] In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong |
Structure-aware method | Prompt Cache |
Efficient Memory Management for Large Language Model Serving with PagedAttention [Link] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica |
Structure-aware method | vllm π |
KV cache eviction discards less critical KV entries (i.e., tokens) based on certain rules.
| Paper | Type | Code |
|---|---|---|
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [Link] Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou |
Eviction policy: plug-in; budget policy: adaptive (head-wise, attention sparsity) | AdaKV |
DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs [Link] Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding |
Eviction policy: recent + attention w.r.t. instruction tokens; budget policy: adaptive (layer-wise, task-aware) | |
EvolKV: EvolutionaryΒ KVΒ Cache Compression forΒ LLMΒ Inference [Link] Bohan Yu, Yekun Chai |
Eviction policy: plug-in; budget policy: adaptive (layer-wise, evolutionary search) | |
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction [Link] |
||
| Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen | Eviction policy: recent + relative significance of attention scores; budget policy: adaptive (head-wise, sparsity pattern) | |
| KVCompose: Efficient Structured KV Cache Compression with Composite Tokens [Link] Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed |
Eviction policy: aggregated attention & form composite token; budget policy: adaptive (layer-wise, composite importance) | |
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [Link] Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine) Lin |
Eviction policy: ladder pattern based; budget policy: preset (layer-wise, ladder) | LaCache |
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [Link] Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang |
Recent + sink + separator tokens | SepLLM |
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models [Link] Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, Mi Zhang |
Eviction policy: recent + sink + H2 & recall via merging; budget policy: adaptive (layer-wise, attention density); Also belongs to structural compression | D2O |
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences [Link] Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li |
Eviction policy: recent + mean & variance of attention scores; budget policy: adaptive (layer-wise, layer preference) | CAKE |
SnapKV: LLM Knows What You are Looking for Before Generation [Link] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen |
Observation window-based identification | SnapKV |
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression [Link] Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini |
Key L2 norm | l2compress |
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters [Link] Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe |
Sink + attention & value L1 norm | VATP |
Transformers are Multi-State RNNs [Link] Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz |
Drop lowest attention score token at each step | TOVA |
| BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference [Link] Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He |
Recent + sink + segmented local H2 | BUZZ |
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [Link] Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao |
Eviction policy: recent + PvC (via ensemble attention); budget policy: preset (layer-wise, pyramid) | PyramidInfer |
NACL: A General and Effective KV Cache Eviction Framework for LLM at Inference Time [Link] Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu |
Attention w.r.t. proxy token & randomness | NACL |
| PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [Link] Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, Wen Xiao |
Eviction policy: observation window-based identification; budget policy: preset (layer-wise, pyramid) | KVCache-Factory π |
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference [Link] Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath |
Recent + key (Gumbel-softmax scores) | Keyformer |
Efficient Streaming Language Models with Attention Sinks [Link] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis |
Recent + sink | StreamingLLM π |
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [Link] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao |
Hybrid (special token/punctuation/locality/H2) | FastGen π |
| On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference [Link] Siyu Ren, Kenny Q. Zhu |
Mean & standard deviation of attention scores | EasyKV |
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time [Link] Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava |
Recent + attention scores | |
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [Link] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher RΓ©, Clark Barrett, Zhangyang Wang, Beidi Chen |
Recent + H2 |
The figure below (behavior-behavior co-design affinity network) visualizes cross-behavior co-occurrence in the literature. Node size reflects research density; edge thickness scales with behavior co-occurrence frequency. We found that HAEβCDO is the strongest cross-dimension co-design pattern.
Please check our paper (Section 6) for more details!
The table below (behavior
Please check our paper (Section 6) for detailed analysis!
@article{jiang2025towards,
title = {Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization},
author = {Jiang, Jiantong and Yang, Peiyu and Zhang, Rui and Liu, Feng},
journal = {Authorea Preprints},
year = {2025},
publisher = {Authorea},
url = {http://dx.doi.org/10.36227/techrxiv.176046306.66521015/v1},
doi = {10.36227/techrxiv.176046306.66521015/v1},
}



