Skip to content

jjiantong/Awesome-KV-Cache-Optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome KV Cache Optimization

DOI

This repository is for our survey paper:

Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization
Jiantong Jiang1, Peiyu Yang1, Rui Zhang2, Feng Liu1
1The University of Melbourne, 2Huazhong University of Science and Technology

This repository aims to record papers of system-aware, serving-time, KV-centric optimization methods that improve system metrics without retraining or architecture modification (which we call this scope sKis). We systematize recent advances through a distinct system behavior-oriented taxonomy, organizing existing efforts into three behavioral dimensions:
πŸ”· Temporal β€” when is KV cache accessed or computed?
πŸ”· Spatial β€” where is KV cache placed and migrated?
πŸ”· Structural β€” how is KV cache represented and managed?

🧠 Grounded in this taxonomy, we analyze cross-behavior co-design affinity and behavior–objective effects, revealing overlooked regions and concrete open challenges.

Contributing

The survey and the repository are still under active development and will be updated regularly.

πŸ™‹ If you would like to include your paper in this survey and repository, please feel free to submit a pull request. You can generate the markdown row for each paper by filling in the first part of generate.py and running python generate.py. Alternatively, you can open an issue with the paper's title and a brief summary highlighting its key techniques. You can also contact us via email.

πŸ™‹πŸ»β€β™€οΈ Please let us know if you find out a mistake or have any suggestions! We greatly appreciate your feedback regarding this repository or survey!

🌟 If you find this resource helpful for your work, please consider giving us a star and citing our research.


Quick Index


Temporal β€” Execution & Scheduling

These methods act on when KV data is executed, computed, or scheduled to improve latency and throughput. We divide these methods into three categories: KV-centric scheduling, pipelining & overlapping, and hardware-aware execution.

KV-Centric Scheduling

KV-centric scheduling methods explicitly integrate KV characteristics into runtime scheduling decisions.

Paper Type Code
Publish
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection [Link]
Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong
Token-level attention compute scheduling stars

TokenSelect
Publish
RefreshKV: Updating Small KV Cache During Long-form Generation [Link]
Fangyuan Xu, Tanya Goyal, Eunsol Choi
Token-level attention compute scheduling stars

RefreshKV
Publish
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression [Link]
Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Token-level attention compute scheduling stars

RocketKV
Publish Award
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Link]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze
Kernel-level workload scheduling across CUDA thread blocks; Also belongs to allocation & reuse (structural) stars

FlashInfer 🌟
Publish Award
Mooncake: Trading More Storage for Less Computation β€” A KVCache-centric Architecture for Serving LLM Chatbot [Link]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu
KV reuse-aware request-level scheduling; Also belongs to HW-aware execution stars

Mooncake 🌟
Publish
Loki: Low-rank Keys for Efficient Sparse Attention [Link]
Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele
Token-level attention compute scheduling stars

Loki
Publish
SGLang: Efficient Execution of Structured Language Model Programs [Link]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng
KV reuse-aware request-level scheduling; Also belongs to allocation & reuse (structural) stars

SGLang 🌟
Publish
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism [Link]
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin
KV usage-aware request-level scheduling stars

LoongServe
Fast Inference for Augmented Large Language Models [Link]
Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher
KV usage-aware request-level scheduling
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [Link]
Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, Zhenxuan Pan
KV usage-aware request-level scheduling; Also belongs to memory hierarchy KV orchestration (spatial)
Publish
SparQ Attention: Bandwidth-Efficient LLM Inference [Link]
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
Token-level attention compute scheduling stars

SparQ Attention
Publish
QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference [Link]
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han
Token-level attention compute scheduling stars

Quest
Publish
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving [Link]
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang
KV usage-aware request-level scheduling; Also belongs to HW-aware execution stars

MuxServe
Publish
Preble: Efficient Distributed Prompt Scheduling for LLM Serving [Link]
Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang
KV reuse-aware request-level scheduling stars

Preble
Inference without interference: Disaggregate LLM inference for mixed downstream workloads [Link]
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan
KV usage-aware request-level scheduling; Also belongs to HW-aware execution

↑ Back to Index ↑

Pipelining & Overlapping

Pipelining and overlapping methods hide latency by concurrently executing KV-related compute, communication, and I/O. They often embedded in the broader systems.

Paper Type Code
Publish
KVPR: Efficient LLM inference with i/o-aware KV cache partial recomputation [Link]
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram
GPU KV recompute ↔ KV transfer (CPU↔GPU) stars

KVPR
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [Link]
Ahmet Caner YΓΌzΓΌgΓΌler, Jiawei Zhuang, Lukas Cavigelli
GPU KV prefetch (HBMβ†’L2) ↔ GPU collective communication; Also belongs to memory hierarchy KV orchestration (spatial)
Publish
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Link]
Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu
CPU attention compute ↔ GPU linear ops; Also belongs to HW-aware execution stars

NEO
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching [Link]
Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng, Chao Wang, Feng Lyu
GPU KV prefetch (HBMβ†’L2) ↔ GPU attention compute; Also belongs to memory hierarchy KV orchestration (spatial)
Publish
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Link]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo
KV load/store (CPU↔GPU) ↔ GPU compute; Also belongs to memory hierarchy KV orchestration (spatial)
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines [Link]
Jiaao He, Jidong Zhai
CPU R-part compute ↔ GPU S-part compute; Also belongs to HW-aware execution
Publish
Improving Throughput-Oriented LLM Inference with CPU Computations [Link]
Daon Park, Bernhard Egger
CPU MHSA compute ↔ FFN data transfer (CPUβ†’GPU); Also belongs to HW-aware execution GitLab
Heterogen

↑ Back to Index ↑

Hardware-aware Execution

Hardware-aware execution methods adapt KV cache-related operations to the underlying heterogeneous hardware.

Disaggregated Inference

Disaggregated inference separates heterogeneous computation in LLM inference and maps them to distinct hardware resources to reduce interference and improve utilization.

Paper Type Code
Publish Award
Mooncake: Trading More Storage for Less Computation β€” A KVCache-centric Architecture for Serving LLM Chatbot [Link]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu
PD disaggregation; Also belongs to KV-centric scheduling stars

Mooncake 🌟
Publish
DΓ©jΓ Vu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving [Link]
Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic
Decoupling prefill and decode to different GPUs; Also belongs to memory hierarchy KV orchestration (spatial) stars

DΓ©jΓ Vu
Publish
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving [Link]
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang
Colocating PD jobs of multiple LLMs within each GPU via SM partitioning; Also belongs to KV-centric scheduling stars

MuxServe
Publish
DistServe: Decoupling Prefill and Decoding for Goodput-optimized Large Language Model Serving [Link]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang
Decoupling prefill and decode to different GPUs; Also belongs to compute device KV orchestration (spatial) stars

DistServe
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [Link]
Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin
Disaggregating at the operator level; Also belongs to compute device KV orchestration (spatial)
Publish
Splitwise: Efficient generative LLM inference using phase splitting [Link]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini
Decoupling prefill and decode to different GPUs; Also belongs to compute device KV orchestration (spatial) Splitwise (integrated into vLLM)
Inference without interference: Disaggregate LLM inference for mixed downstream workloads [Link]
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan
Decoupling prefill and decode to different GPUs; Also belongs to KV-centric scheduling

Compute Offloading

Compute offloading relocates partial compute to auxiliary devices to reduce GPU bottlenecks, utilizing hardware heterogeneity and workload features.

Paper Type Code
Publish
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Link]
Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu
CPU offloading (attention and KV caches); Also belongs to pipelining & overlapping stars

NEO
Publish Award
MagicPIG: LSH Sampling for Efficient LLM Generation [Link]
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen
CPU offloading (attention and retrieval) stars

MagicPIG
Publish
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System [Link]
Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan GΓ³mez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu
PIM-based offloading ASPLOS
Publish
TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference [Link]
Chengye Yu, Tianyu Wang, Zili Shao, Linjie Zhu, Xu Zhou, Song JiangAuthors Info & Claims
CPU offloading
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [Link]
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang
CSD Offloading; Also belongs to compute device KV orchestration (spatial)
Publish
AttAcc! unleashing the power of PIM for batched transformer-based generative model inference [Link]
Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn
PIM-based offloading; Also belongs to compute device KV orchestration (spatial) stars

AttAcc
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines [Link]
Jiaao He, Jidong Zhai
CPU offloading (attention and KV caches); Also belongs to pipelining & overlapping
Publish
Improving Throughput-Oriented LLM Inference with CPU Computations [Link]
Daon Park, Bernhard Egger
CPU offloading with dynamic GPU-CPU division; Also belongs to pipelining & overlapping GitLab
Heterogen

↑ Back to Index ↑


Spatial β€” Placement & Migration

These works optimize where KV data is stored or transferred to balance memory and I/O pressure. We divide these methods into two categories: memory hierarchy KV orchestration, and compute device KV orchestration.

Memory Hierarchy KV Orchestration

Memory hierarchy KV orchestration methods distribute KV caches across memory hierarchies.

Cross-device Memory Hierarchy

These methods manage KV caches across fast but limited GPU HBM, and larger but slower alternatives like CPU DRAM or SSD.

Paper Type Code
Publish
KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows [Link]
Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, Yufei Ding
Importance-aware methods
Publish
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [Link]
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu
Importance-aware methods stars

RetrievalAttention
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference [Link]
Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang
System cost-driven decision; Also belongs to compute device KV orchestration stars

LMCache 🌟
Publish
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation [Link]
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin
System cost-driven decision
SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning [Link]
Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang
Importance-aware methods
Publish
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs [Link]
Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han
Importance-aware methods
Publish Award
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [Link]
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen
Importance-aware methods; Also belongs to KV cache compression (structural) stars

ShadowKV
Publish
PQCache: Product Quantization-based KVCache for Long Context LLM Inference [Link]
Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui
Importance-aware methods stars

PQCache
Publish
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [Link]
Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo
Importance-aware methods stars

ClusterKV
Publish
Stateful Large Language Model Serving with Pensieve [Link]
Lingfan Yu, Jinkun Lin, Jinyang Li
Importance-aware methods
Publish
IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference [Link]
Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, Gang Chen
Importance-aware methods
Publish
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction [Link]
Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, Yun Liang
Importance-aware methods stars

ArkVale
Publish
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [Link]
Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun
Importance-aware methods stars

InfLLM
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving [Link]
Ao Shen, Zhiyao Li, Mingyu Gao
System cost-driven decision; Also belongs to allocation & reuse (structural)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [Link]
Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, Zhenxuan Pan
System cost-driven decision; Also belongs to KV-centric scheduling (temporal)
Publish
DΓ©jΓ Vu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving [Link]
Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic
System cost-driven decision; Also belongs to HW-aware execution (temporal) stars

DΓ©jΓ Vu
Publish
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Link]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo
System cost-driven decision; Also belongs to pipelining & overlapping (temporal)
Publish
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [Link] Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim
Importance-aware methods stars

InfiniGen
Publish
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching [Link]
Youpeng Zhao, Di Wu, Jun Wang
Importance-aware methods
Publish
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [Link]
Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel
System cost-driven decision; Also belongs to compute device KV orchestration stars

FastServe
Publish
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [Link]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RΓ©, Ion Stoica, Ce Zhang
System cost-driven decision; Also belongs to KV cache compression (structural) stars

FlexLLMGen 🌟

Intra-GPU Memory Hierarchy

These methods migrates KV entries between on-chip L1/L2 caches and off-chip HBM to hide latency.

Paper Type Code
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [Link]
Ahmet Caner YΓΌzΓΌgΓΌler, Jiawei Zhuang, Lukas Cavigelli
Prefetching KV caches from HBM to the L2 cache; Also belongs to pipelining & overlapping (temporal)
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng, Chao Wang, Feng Lyu [Link]
Prefetching KV caches from HBM to the L2 cache; Also belongs to pipelining & overlapping (temporal)

↑ Back to Index ↑

Compute Device KV Orchestration

Compute device KV orchestration methods place and move KV caches across compute-capable devices like GPUs, CPUs, and storage-attached processors, to enable distributed or heterogeneous serving.

Paper Type Code
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference [Link]
Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang
KV cache placement & migration across GPUs; Also belongs to memory hierarchy KV orchestration stars

LMCache 🌟
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [Link]
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang
KV cache placement & migration across GPUs and CSDs; Also belongs to HW-aware execution (temporal)
Publish
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [Link]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang
Remote KV cache transmission in distributed networked setups; Also belongs to KV cache compression (structural) stars

CacheGen
Publish
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [Link]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang
KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal) stars

DistServe
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [Link]
Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin
KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal)
Publish
Splitwise: Efficient generative LLM inference using phase splitting [Link]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini
KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal) Splitwise (integrated into vLLM)
Publish
AttAcc! unleashing the power of PIM for batched transformer-based generative model inference [Link]
Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn
KV cache placement & migration across GPUs and PIM devices; Also belongs to HW-aware execution (temporal) stars

AttAcc
Publish
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [Link]
Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel
KV cache placement & migration across GPUs; Also belongs to memory hierarchy KV orchestration stars

FastServe

↑ Back to Index ↑


Structural β€” Representation & Retention

These methods target how KV data is represented and maintained for memory efficiency. We divide these methods into two categories: KV cache compression, and KV cache retention management.

KV Cache Compression

KV cache compression methods directly compress the size of KV caches.

Quantization

Quantization compresses floating-point KV tensors into lower-precision formats. One recurring insight is asymmetric KV quantization: keys and values exhibit distinct outlier patterns and quantization sensitivities. A second insight is that outliers play a crucial role in low-bit quantization.

Paper Type Code
Publish
NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache [Link]
Donghyun Son, Euntae Choi, Sungjoo Yoo
VQ-based method
Publish
Accurate KV Cache Quantization with Outlier Tokens Tracing [Link]
Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Mixed-precision asymmetric KV quantization

OTT
Publish
CommVQ: Commutative Vector Quantization for KV Cache Compression [Link]
Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan
VQ-based method

CommVQ
Publish
QServe: W4A8KV4 quantization and system co-design for efficient LLM serving [Link]
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han
Fixed-precision quantization

OmniServe
SQuat: Subspace-orthogonal KV cache quantization [Link]
Hao Wang, Ligong Han, Kai Xu, Akash Srivastava
Fixed-precision asymmetric KV quantization

SQuat
Publish
VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference [Link]
Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Jingwen Leng, Chen Jin
VQ-based method
Publish
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead [Link]
Amir Zandieh, Majid Daliri, Insu Han
Fixed-precision quantization

QJL
Publish
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [Link]
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang
Mixed-precision asymmetric KV quantization

ZipCache
Publish
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization [Link]
Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava
VQ-based method
Publish
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [Link]
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
Mixed-precision quantization

KVQuant
Publish
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models [Link]
Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin
Mixed-precision quantization

SKVQ
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM [Link]
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao
Fixed-precision asymmetric KV quantization stars

GEAR
Publish
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [Link]
Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen
Mixed-precision quantization

DecoQuant
Publish
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [Link]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang
Mixed-precision quantization; Also belongs to compute device KV orchestration (spatial)

CacheGen
Publish
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [Link]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
Mixed-precision asymmetric KV quantization

KIVI
Publish
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving [Link]
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci
Mixed-precision quantization

Atom
QAQ: Quality Adaptive Quantization for LLM KV Cache [Link]
Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang
Mixed-precision quantization

QAQ
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization [Link]
June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee
Mixed-precision quantization
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [Link]
Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie
Mixed-precision quantization
Publish
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [Link]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RΓ©, Ion Stoica, Ce Zhang
Fixed-precision quantization; Also belongs to memory hierarchy KV orchestration (spatial)

FlexLLMGen 🌟
Publish
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [Link]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han
Fixed-precision quantization

SmoothQuant 🌟

Low-rank Approximation

Low-rank compression exploits hidden-dimension redundancy by factorizing KV tensors into compact components. Methods differ in the approximation target, granularity, and rank setting.

Paper Type Code
Publish Award
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [Link]
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen
Target cached K tensors, layer-wise, fixed rank; Also belongs memory hierarchy KV orchestration (spatial)

ShadowKV
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [Link]
Xianglong Yan, Zhiteng Li, Tianao Zhang, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang
Target cached KV tensors, head-group-wise for keys and layer-wise for values, budgeted-driven rank

ReCalKV
Publish
Palu: KV-Cache Compression with Low-Rank Projection [Link]
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, Kai-Chiang Wu
Target KV projection weights, head-group-wise, searched rank

Palu
xKV: Cross-Layer SVD for KV-Cache Compression [Link]
Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah
Target KV tensors, layer-group-wise, fixed rank

xKV
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression [Link]
Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy
Target QKV attention subspace, layer-wise, budget-driven rank

EigenAttn
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [Link]
Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen
Target KV projection weights, layer-wise, progressive rank

Structural Compression

Unlike value-level compression (e.g., quantization and low-rank approximation), structural compression reduces KV memory by modifying cache organization (e.g., layer, head, channel, token).

Paper Type Code
Publish
ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering [Link]
Minwei Zhang, Haifeng Sun, Jingyu Wang, Shaolong Li, Wanyi Ning, Qi Qi, Zirui Zhuang, Jianxin Liao
Structural pruning on prompt tokens guided by attention score
Publish Award
ThinK: Thinner Key Cache by Query-Driven Pruning [Link]
Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo
Structural pruning on key channels guided by query-driven signal

ThinK
Publish
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models [Link]
Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, Mi Zhang
Intra-layer structural merging guided by similarity; Also belongs to eviction

D2O
Publish
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [Link]
Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang
Cross-layer structural merging guided by layer similarity
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [Link]
Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen
Cross-layer structural merging guided by layer dissimilarity

KVSharer
Publish
CHAI: Clustered Head Attention for Efficient LLM Inference [Link]
Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu
Structural pruning on heads guided by head attention score

CHAI
Publish
CaM: Cache Merging for Memory-efficient LLMs Inference [Link]
Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji
Intra-layer structural merging guided by attention score

CaM
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [Link]
Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang
Intra-layer structural merging guided by key similarity

↑ Back to Index ↑

KV Cache Retention Management

These methods manage the retention of the KV cache during serving.

Allocation & Reuse

Paper Type Code
Publish Award
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Link]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze
Structure-aware method; Also belongs to KV-centric scheduling (temporal)

FlashInfer 🌟
Publish
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [Link]
Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar
Structure-aware method

vAttention
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool [Link]
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan
Semantics-guided method
Publish
SGLang: Efficient Execution of Structured Language Model Programs [Link]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng
Structure-aware method; Also belongs to KV-centric scheduling (temporal)

SGLang 🌟
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving [Link]
Ao Shen, Zhiyao Li, Mingyu Gao
Structure-aware method; Also belongs to memory hierarchy KV orchestration (spatial)
Publish
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [Link]
Lu Ye, Ze Tao, Yong Huang, Yang Li
Structure-aware method

Chunk Attention
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [Link]
Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, Jingwen Leng
Structure-aware method
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference [Link]
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
Semantics-guided method
Publish
Prompt Cache: Modular Attention Reuse for Low-Latency Inference [Link]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong
Structure-aware method

Prompt Cache
Publish
Efficient Memory Management for Large Language Model Serving with PagedAttention [Link]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
Structure-aware method

vllm 🌟

Eviction

KV cache eviction discards less critical KV entries (i.e., tokens) based on certain rules.

Paper Type Code
Publish
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [Link]
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou
Eviction policy: plug-in; budget policy: adaptive (head-wise, attention sparsity)

AdaKV
Publish
DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs [Link]
Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding
Eviction policy: recent + attention w.r.t. instruction tokens; budget policy: adaptive (layer-wise, task-aware)
Publish
EvolKV: EvolutionaryΒ KVΒ Cache Compression forΒ LLMΒ Inference [Link]
Bohan Yu, Yekun Chai
Eviction policy: plug-in; budget policy: adaptive (layer-wise, evolutionary search)
Publish
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction [Link]
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen Eviction policy: recent + relative significance of attention scores; budget policy: adaptive (head-wise, sparsity pattern)
KVCompose: Efficient Structured KV Cache Compression with Composite Tokens [Link]
Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed
Eviction policy: aggregated attention & form composite token; budget policy: adaptive (layer-wise, composite importance)
Publish
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [Link]
Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine) Lin
Eviction policy: ladder pattern based; budget policy: preset (layer-wise, ladder)

LaCache
Publish
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [Link]
Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang
Recent + sink + separator tokens

SepLLM
Publish
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models [Link]
Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, Mi Zhang
Eviction policy: recent + sink + H2 & recall via merging; budget policy: adaptive (layer-wise, attention density); Also belongs to structural compression

D2O
Publish
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences [Link]
Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li
Eviction policy: recent + mean & variance of attention scores; budget policy: adaptive (layer-wise, layer preference)

CAKE
Publish
SnapKV: LLM Knows What You are Looking for Before Generation [Link]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen
Observation window-based identification

SnapKV
Publish
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression [Link]
Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini
Key L2 norm

l2compress
Publish
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters [Link]
Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe
Sink + attention & value L1 norm

VATP
Publish
Transformers are Multi-State RNNs [Link]
Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz
Drop lowest attention score token at each step

TOVA
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference [Link]
Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He
Recent + sink + segmented local H2

BUZZ
Publish
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [Link]
Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao
Eviction policy: recent + PvC (via ensemble attention); budget policy: preset (layer-wise, pyramid)

PyramidInfer
Publish
NACL: A General and Effective KV Cache Eviction Framework for LLM at Inference Time [Link]
Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu
Attention w.r.t. proxy token & randomness NACL
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [Link]
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, Wen Xiao
Eviction policy: observation window-based identification; budget policy: preset (layer-wise, pyramid)

KVCache-Factory 🌟
Publish
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference [Link]
Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath
Recent + key (Gumbel-softmax scores)

Keyformer
Publish
Efficient Streaming Language Models with Attention Sinks [Link]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
Recent + sink

StreamingLLM 🌟
Publish Award
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [Link]
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao
Hybrid (special token/punctuation/locality/H2)

FastGen 🌟
On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference [Link]
Siyu Ren, Kenny Q. Zhu
Mean & standard deviation of attention scores

EasyKV
Publish
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time [Link]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava
Recent + attention scores
Publish
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [Link]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher RΓ©, Clark Barrett, Zhangyang Wang, Beidi Chen
Recent + H2

↑ Back to Index ↑


Cross-behavior Co-design Affinity

The figure below (behavior-behavior co-design affinity network) visualizes cross-behavior co-occurrence in the literature. Node size reflects research density; edge thickness scales with behavior co-occurrence frequency. We found that HAE–CDO is the strongest cross-dimension co-design pattern.

Please check our paper (Section 6) for more details!

↑ Back to Index ↑


Behavior-objective Effects

The table below (behavior $\times$ objective matrix) marks each behavior's impact on serving objectives as direct (●) or indirect (β—‹); stars (β˜…) on direct cells statistically flag $\geq70%$ of papers reporting such gains. Side bars show research density (rows/columns). Objectives cover latency, throughput, GPU memory, interconnect I/O, and energy. We also include quality impact $\downarrow$ to capture degradation as a trade-off.

Please check our paper (Section 6) for detailed analysis!

↑ Back to Index ↑

Citation

@article{jiang2025towards,
  title = {Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization},
  author = {Jiang, Jiantong and Yang, Peiyu and Zhang, Rui and Liu, Feng},
  journal = {Authorea Preprints},
  year = {2025},
  publisher = {Authorea},
  url = {http://dx.doi.org/10.36227/techrxiv.176046306.66521015/v1},
  doi = {10.36227/techrxiv.176046306.66521015/v1},
}

Contributors

Star History

Star History Chart

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages