Awesome KV Cache Optimization

This repository is for our survey paper:

Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization
Jiantong Jiang¹, Peiyu Yang¹, Rui Zhang², Feng Liu¹
¹The University of Melbourne, ²Huazhong University of Science and Technology

This repository aims to record papers of system-aware, serving-time, KV-centric optimization methods that improve system metrics without retraining or architecture modification (which we call this scope sKis). We systematize recent advances through a distinct system behavior-oriented taxonomy, organizing existing efforts into three behavioral dimensions:
🔷 Temporal — when is KV cache accessed or computed?
🔷 Spatial — where is KV cache placed and migrated?
🔷 Structural — how is KV cache represented and managed?

🧠 Grounded in this taxonomy, we analyze cross-behavior co-design affinity and behavior–objective effects, revealing overlooked regions and concrete open challenges.

Contributing

The survey and the repository are still under active development and will be updated regularly.

🙋 If you would like to include your paper in this survey and repository, please feel free to submit a pull request. You can generate the markdown row for each paper by filling in the first part of generate.py and running python generate.py. Alternatively, you can open an issue with the paper's title and a brief summary highlighting its key techniques. You can also contact us via email.

🙋🏻‍♀️ Please let us know if you find out a mistake or have any suggestions! We greatly appreciate your feedback regarding this repository or survey!

🌟 If you find this resource helpful for your work, please consider giving us a star and citing our research.

Quick Index

Temporal — Execution & Scheduling
Spatial — Placement & Migration
- Memory Hierarchy KV Orchestration (MHO)
- Compute Device KV Orchestration (CDO)
Structural — Representation & Retention
- KV Cache Compression (KVCC) (including quantization, low-rank approximation, and structural compression)
- KV Cache Retention Management (KVRM) (including allocation, reuse, and eviction)
Cross-behavior Co-design Affinity
Behavior-objective Effects

Temporal — Execution & Scheduling

These methods act on when KV data is executed, computed, or scheduled to improve latency and throughput. We divide these methods into three categories: KV-centric scheduling, pipelining & overlapping, and hardware-aware execution.

KV-Centric Scheduling

KV-centric scheduling methods explicitly integrate KV characteristics into runtime scheduling decisions.

Paper	Type	Code
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection [Link] Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong	Token-level attention compute scheduling	TokenSelect
RefreshKV: Updating Small KV Cache During Long-form Generation [Link] Fangyuan Xu, Tanya Goyal, Eunsol Choi	Token-level attention compute scheduling	RefreshKV
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression [Link] Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov	Token-level attention compute scheduling	RocketKV
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Link] Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze	Kernel-level workload scheduling across CUDA thread blocks; Also belongs to allocation & reuse (structural)	FlashInfer 🌟
Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot [Link] Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu	KV reuse-aware request-level scheduling; Also belongs to HW-aware execution	Mooncake 🌟
Loki: Low-rank Keys for Efficient Sparse Attention [Link] Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele	Token-level attention compute scheduling	Loki
SGLang: Efficient Execution of Structured Language Model Programs [Link] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng	KV reuse-aware request-level scheduling; Also belongs to allocation & reuse (structural)	SGLang 🌟
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism [Link] Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin	KV usage-aware request-level scheduling	LoongServe
Fast Inference for Augmented Large Language Models [Link] Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher	KV usage-aware request-level scheduling
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [Link] Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, Zhenxuan Pan	KV usage-aware request-level scheduling; Also belongs to memory hierarchy KV orchestration (spatial)
SparQ Attention: Bandwidth-Efficient LLM Inference [Link] Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr	Token-level attention compute scheduling	SparQ Attention
QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference [Link] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han	Token-level attention compute scheduling	Quest
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving [Link] Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang	KV usage-aware request-level scheduling; Also belongs to HW-aware execution	MuxServe
Preble: Efficient Distributed Prompt Scheduling for LLM Serving [Link] Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang	KV reuse-aware request-level scheduling	Preble
Inference without interference: Disaggregate LLM inference for mixed downstream workloads [Link] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan	KV usage-aware request-level scheduling; Also belongs to HW-aware execution

↑ Back to Index ↑

Pipelining & Overlapping

Pipelining and overlapping methods hide latency by concurrently executing KV-related compute, communication, and I/O. They often embedded in the broader systems.

Paper	Type	Code
KVPR: Efficient LLM inference with i/o-aware KV cache partial recomputation [Link] Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram	GPU KV recompute ↔ KV transfer (CPU↔GPU)	KVPR
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [Link] Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli	GPU KV prefetch (HBM→L2) ↔ GPU collective communication; Also belongs to memory hierarchy KV orchestration (spatial)
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Link] Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu	CPU attention compute ↔ GPU linear ops; Also belongs to HW-aware execution	NEO
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching [Link] Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng, Chao Wang, Feng Lyu	GPU KV prefetch (HBM→L2) ↔ GPU attention compute; Also belongs to memory hierarchy KV orchestration (spatial)
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Link] Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo	KV load/store (CPU↔GPU) ↔ GPU compute; Also belongs to memory hierarchy KV orchestration (spatial)
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines [Link] Jiaao He, Jidong Zhai	CPU R-part compute ↔ GPU S-part compute; Also belongs to HW-aware execution
Improving Throughput-Oriented LLM Inference with CPU Computations [Link] Daon Park, Bernhard Egger	CPU MHSA compute ↔ FFN data transfer (CPU→GPU); Also belongs to HW-aware execution	Heterogen

↑ Back to Index ↑

Hardware-aware Execution

Hardware-aware execution methods adapt KV cache-related operations to the underlying heterogeneous hardware.

Disaggregated Inference

Disaggregated inference separates heterogeneous computation in LLM inference and maps them to distinct hardware resources to reduce interference and improve utilization.

Paper	Type	Code
Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot [Link] Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu	PD disaggregation; Also belongs to KV-centric scheduling	Mooncake 🌟
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving [Link] Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic	Decoupling prefill and decode to different GPUs; Also belongs to memory hierarchy KV orchestration (spatial)	DéjàVu
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving [Link] Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang	Colocating PD jobs of multiple LLMs within each GPU via SM partitioning; Also belongs to KV-centric scheduling	MuxServe
DistServe: Decoupling Prefill and Decoding for Goodput-optimized Large Language Model Serving [Link] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang	Decoupling prefill and decode to different GPUs; Also belongs to compute device KV orchestration (spatial)	DistServe
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [Link] Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin	Disaggregating at the operator level; Also belongs to compute device KV orchestration (spatial)
Splitwise: Efficient generative LLM inference using phase splitting [Link] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini	Decoupling prefill and decode to different GPUs; Also belongs to compute device KV orchestration (spatial)	Splitwise (integrated into vLLM)
Inference without interference: Disaggregate LLM inference for mixed downstream workloads [Link] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan	Decoupling prefill and decode to different GPUs; Also belongs to KV-centric scheduling

Compute Offloading

Compute offloading relocates partial compute to auxiliary devices to reduce GPU bottlenecks, utilizing hardware heterogeneity and workload features.

Paper	Type	Code
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Link] Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu	CPU offloading (attention and KV caches); Also belongs to pipelining & overlapping	NEO
MagicPIG: LSH Sampling for Efficient LLM Generation [Link] Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen	CPU offloading (attention and retrieval)	MagicPIG
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System [Link] Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan Gómez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu	PIM-based offloading	ASPLOS
TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference [Link] Chengye Yu, Tianyu Wang, Zili Shao, Linjie Zhu, Xu Zhou, Song JiangAuthors Info & Claims	CPU offloading
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [Link] Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang	CSD Offloading; Also belongs to compute device KV orchestration (spatial)
AttAcc! unleashing the power of PIM for batched transformer-based generative model inference [Link] Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn	PIM-based offloading; Also belongs to compute device KV orchestration (spatial)	AttAcc
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines [Link] Jiaao He, Jidong Zhai	CPU offloading (attention and KV caches); Also belongs to pipelining & overlapping
Improving Throughput-Oriented LLM Inference with CPU Computations [Link] Daon Park, Bernhard Egger	CPU offloading with dynamic GPU-CPU division; Also belongs to pipelining & overlapping	Heterogen

↑ Back to Index ↑

Spatial — Placement & Migration

These works optimize where KV data is stored or transferred to balance memory and I/O pressure. We divide these methods into two categories: memory hierarchy KV orchestration, and compute device KV orchestration.

Memory Hierarchy KV Orchestration

Memory hierarchy KV orchestration methods distribute KV caches across memory hierarchies.

Cross-device Memory Hierarchy

These methods manage KV caches across fast but limited GPU HBM, and larger but slower alternatives like CPU DRAM or SSD.

Paper	Type	Code
KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows [Link] Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, Yufei Ding	Importance-aware methods
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [Link] Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu	Importance-aware methods	RetrievalAttention
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference [Link] Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang	System cost-driven decision; Also belongs to compute device KV orchestration	LMCache 🌟
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation [Link] Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin	System cost-driven decision
SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning [Link] Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang	Importance-aware methods
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs [Link] Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han	Importance-aware methods
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [Link] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen	Importance-aware methods; Also belongs to KV cache compression (structural)	ShadowKV
PQCache: Product Quantization-based KVCache for Long Context LLM Inference [Link] Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui	Importance-aware methods	PQCache
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [Link] Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo	Importance-aware methods	ClusterKV
Stateful Large Language Model Serving with Pensieve [Link] Lingfan Yu, Jinkun Lin, Jinyang Li	Importance-aware methods
IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference [Link] Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, Gang Chen	Importance-aware methods
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction [Link] Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, Yun Liang	Importance-aware methods	ArkVale
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [Link] Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun	Importance-aware methods	InfLLM
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving [Link] Ao Shen, Zhiyao Li, Mingyu Gao	System cost-driven decision; Also belongs to allocation & reuse (structural)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [Link] Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, Zhenxuan Pan	System cost-driven decision; Also belongs to KV-centric scheduling (temporal)
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving [Link] Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic	System cost-driven decision; Also belongs to HW-aware execution (temporal)	DéjàVu
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Link] Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo	System cost-driven decision; Also belongs to pipelining & overlapping (temporal)
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [Link] Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim	Importance-aware methods	InfiniGen
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching [Link] Youpeng Zhao, Di Wu, Jun Wang	Importance-aware methods
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [Link] Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel	System cost-driven decision; Also belongs to compute device KV orchestration	FastServe
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [Link] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang	System cost-driven decision; Also belongs to KV cache compression (structural)	FlexLLMGen 🌟

Intra-GPU Memory Hierarchy

These methods migrates KV entries between on-chip L1/L2 caches and off-chip HBM to hide latency.

Paper	Type	Code
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [Link] Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli	Prefetching KV caches from HBM to the L2 cache; Also belongs to pipelining & overlapping (temporal)
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng, Chao Wang, Feng Lyu [Link]	Prefetching KV caches from HBM to the L2 cache; Also belongs to pipelining & overlapping (temporal)

↑ Back to Index ↑

Compute Device KV Orchestration

Compute device KV orchestration methods place and move KV caches across compute-capable devices like GPUs, CPUs, and storage-attached processors, to enable distributed or heterogeneous serving.

Paper	Type	Code
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference [Link] Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang	KV cache placement & migration across GPUs; Also belongs to memory hierarchy KV orchestration	LMCache 🌟
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [Link] Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang	KV cache placement & migration across GPUs and CSDs; Also belongs to HW-aware execution (temporal)
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [Link] Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang	Remote KV cache transmission in distributed networked setups; Also belongs to KV cache compression (structural)	CacheGen
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving [Link] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang	KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal)	DistServe
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache [Link] Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin	KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal)
Splitwise: Efficient generative LLM inference using phase splitting [Link] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini	KV cache placement & migration across GPUs; Also belongs to HW-aware execution (temporal)	Splitwise (integrated into vLLM)
AttAcc! unleashing the power of PIM for batched transformer-based generative model inference [Link] Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, Jung Ho Ahn	KV cache placement & migration across GPUs and PIM devices; Also belongs to HW-aware execution (temporal)	AttAcc
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [Link] Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel	KV cache placement & migration across GPUs; Also belongs to memory hierarchy KV orchestration	FastServe

↑ Back to Index ↑

Structural — Representation & Retention

These methods target how KV data is represented and maintained for memory efficiency. We divide these methods into two categories: KV cache compression, and KV cache retention management.

KV Cache Compression

KV cache compression methods directly compress the size of KV caches.

Quantization

Quantization compresses floating-point KV tensors into lower-precision formats. One recurring insight is asymmetric KV quantization: keys and values exhibit distinct outlier patterns and quantization sensitivities. A second insight is that outliers play a crucial role in low-bit quantization.

Paper	Type	Code
NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache [Link] Donghyun Son, Euntae Choi, Sungjoo Yoo	VQ-based method
Accurate KV Cache Quantization with Outlier Tokens Tracing [Link] Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang	Mixed-precision asymmetric KV quantization	OTT
CommVQ: Commutative Vector Quantization for KV Cache Compression [Link] Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan	VQ-based method	CommVQ
QServe: W4A8KV4 quantization and system co-design for efficient LLM serving [Link] Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han	Fixed-precision quantization	OmniServe
SQuat: Subspace-orthogonal KV cache quantization [Link] Hao Wang, Ligong Han, Kai Xu, Akash Srivastava	Fixed-precision asymmetric KV quantization	SQuat
VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference [Link] Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Jingwen Leng, Chen Jin	VQ-based method
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead [Link] Amir Zandieh, Majid Daliri, Insu Han	Fixed-precision quantization	QJL
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [Link] Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang	Mixed-precision asymmetric KV quantization	ZipCache
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization [Link] Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava	VQ-based method
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [Link] Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami	Mixed-precision quantization	KVQuant
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models [Link] Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin	Mixed-precision quantization	SKVQ
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM [Link] Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao	Fixed-precision asymmetric KV quantization	GEAR
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [Link] Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen	Mixed-precision quantization	DecoQuant
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [Link] Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang	Mixed-precision quantization; Also belongs to compute device KV orchestration (spatial)	CacheGen
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [Link] Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu	Mixed-precision asymmetric KV quantization	KIVI
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving [Link] Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci	Mixed-precision quantization	Atom
QAQ: Quality Adaptive Quantization for LLM KV Cache [Link] Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang	Mixed-precision quantization	QAQ
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization [Link] June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee	Mixed-precision quantization
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [Link] Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie	Mixed-precision quantization
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [Link] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang	Fixed-precision quantization; Also belongs to memory hierarchy KV orchestration (spatial)	FlexLLMGen 🌟
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [Link] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han	Fixed-precision quantization	SmoothQuant 🌟

Low-rank Approximation

Low-rank compression exploits hidden-dimension redundancy by factorizing KV tensors into compact components. Methods differ in the approximation target, granularity, and rank setting.

Paper	Type	Code
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [Link] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen	Target cached K tensors, layer-wise, fixed rank; Also belongs memory hierarchy KV orchestration (spatial)	ShadowKV
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [Link] Xianglong Yan, Zhiteng Li, Tianao Zhang, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang	Target cached KV tensors, head-group-wise for keys and layer-wise for values, budgeted-driven rank	ReCalKV
Palu: KV-Cache Compression with Low-Rank Projection [Link] Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, Kai-Chiang Wu	Target KV projection weights, head-group-wise, searched rank	Palu
xKV: Cross-Layer SVD for KV-Cache Compression [Link] Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah	Target KV tensors, layer-group-wise, fixed rank	xKV
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression [Link] Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy	Target QKV attention subspace, layer-wise, budget-driven rank	EigenAttn
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [Link] Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen	Target KV projection weights, layer-wise, progressive rank

Structural Compression

Unlike value-level compression (e.g., quantization and low-rank approximation), structural compression reduces KV memory by modifying cache organization (e.g., layer, head, channel, token).

Paper	Type	Code
ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering [Link] Minwei Zhang, Haifeng Sun, Jingyu Wang, Shaolong Li, Wanyi Ning, Qi Qi, Zirui Zhuang, Jianxin Liao	Structural pruning on prompt tokens guided by attention score
ThinK: Thinner Key Cache by Query-Driven Pruning [Link] Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo	Structural pruning on key channels guided by query-driven signal	ThinK
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models [Link] Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, Mi Zhang	Intra-layer structural merging guided by similarity; Also belongs to eviction	D2O
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [Link] Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang	Cross-layer structural merging guided by layer similarity
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [Link] Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen	Cross-layer structural merging guided by layer dissimilarity	KVSharer
CHAI: Clustered Head Attention for Efficient LLM Inference [Link] Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu	Structural pruning on heads guided by head attention score	CHAI
CaM: Cache Merging for Memory-efficient LLMs Inference [Link] Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji	Intra-layer structural merging guided by attention score	CaM
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [Link] Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang	Intra-layer structural merging guided by key similarity

↑ Back to Index ↑

KV Cache Retention Management

These methods manage the retention of the KV cache during serving.

Allocation & Reuse

Paper	Type	Code
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Link] Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze	Structure-aware method; Also belongs to KV-centric scheduling (temporal)	FlashInfer 🌟
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [Link] Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar	Structure-aware method	vAttention
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool [Link] Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan	Semantics-guided method
SGLang: Efficient Execution of Structured Language Model Programs [Link] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng	Structure-aware method; Also belongs to KV-centric scheduling (temporal)	SGLang 🌟
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving [Link] Ao Shen, Zhiyao Li, Mingyu Gao	Structure-aware method; Also belongs to memory hierarchy KV orchestration (spatial)
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [Link] Lu Ye, Ze Tao, Yong Huang, Yang Li	Structure-aware method	Chunk Attention
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [Link] Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, Jingwen Leng	Structure-aware method
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference [Link] Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi	Semantics-guided method
Prompt Cache: Modular Attention Reuse for Low-Latency Inference [Link] In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong	Structure-aware method	Prompt Cache
Efficient Memory Management for Large Language Model Serving with PagedAttention [Link] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica	Structure-aware method	vllm 🌟

Eviction

KV cache eviction discards less critical KV entries (i.e., tokens) based on certain rules.

Paper	Type	Code
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [Link] Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou	Eviction policy: plug-in; budget policy: adaptive (head-wise, attention sparsity)	AdaKV
DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs [Link] Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding	Eviction policy: recent + attention w.r.t. instruction tokens; budget policy: adaptive (layer-wise, task-aware)
EvolKV: Evolutionary KV Cache Compression for LLM Inference [Link] Bohan Yu, Yekun Chai	Eviction policy: plug-in; budget policy: adaptive (layer-wise, evolutionary search)
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction [Link]
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen	Eviction policy: recent + relative significance of attention scores; budget policy: adaptive (head-wise, sparsity pattern)
KVCompose: Efficient Structured KV Cache Compression with Composite Tokens [Link] Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed	Eviction policy: aggregated attention & form composite token; budget policy: adaptive (layer-wise, composite importance)
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [Link] Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine) Lin	Eviction policy: ladder pattern based; budget policy: preset (layer-wise, ladder)	LaCache
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [Link] Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang	Recent + sink + separator tokens	SepLLM
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models [Link] Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, Mi Zhang	Eviction policy: recent + sink + H2 & recall via merging; budget policy: adaptive (layer-wise, attention density); Also belongs to structural compression	D2O
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences [Link] Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li	Eviction policy: recent + mean & variance of attention scores; budget policy: adaptive (layer-wise, layer preference)	CAKE
SnapKV: LLM Knows What You are Looking for Before Generation [Link] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen	Observation window-based identification	SnapKV
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression [Link] Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini	Key L2 norm	l2compress
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters [Link] Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe	Sink + attention & value L1 norm	VATP
Transformers are Multi-State RNNs [Link] Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz	Drop lowest attention score token at each step	TOVA
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference [Link] Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He	Recent + sink + segmented local H2	BUZZ
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [Link] Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao	Eviction policy: recent + PvC (via ensemble attention); budget policy: preset (layer-wise, pyramid)	PyramidInfer
NACL: A General and Effective KV Cache Eviction Framework for LLM at Inference Time [Link] Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu	Attention w.r.t. proxy token & randomness	NACL
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [Link] Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, Wen Xiao	Eviction policy: observation window-based identification; budget policy: preset (layer-wise, pyramid)	KVCache-Factory 🌟
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference [Link] Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath	Recent + key (Gumbel-softmax scores)	Keyformer
Efficient Streaming Language Models with Attention Sinks [Link] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis	Recent + sink	StreamingLLM 🌟
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [Link] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao	Hybrid (special token/punctuation/locality/H2)	FastGen 🌟
On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference [Link] Siyu Ren, Kenny Q. Zhu	Mean & standard deviation of attention scores	EasyKV
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time [Link] Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava	Recent + attention scores
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [Link] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen	Recent + H2

↑ Back to Index ↑

Cross-behavior Co-design Affinity

The figure below (behavior-behavior co-design affinity network) visualizes cross-behavior co-occurrence in the literature. Node size reflects research density; edge thickness scales with behavior co-occurrence frequency. We found that HAE–CDO is the strongest cross-dimension co-design pattern.

Please check our paper (Section 6) for more details!

↑ Back to Index ↑

Behavior-objective Effects

The table below (behavior $\times$ objective matrix) marks each behavior's impact on serving objectives as direct (●) or indirect (○); stars (★) on direct cells statistically flag $\geq70%$ of papers reporting such gains. Side bars show research density (rows/columns). Objectives cover latency, throughput, GPU memory, interconnect I/O, and energy. We also include quality impact $\downarrow$ to capture degradation as a trade-off.

Please check our paper (Section 6) for detailed analysis!

↑ Back to Index ↑

Citation

@article{jiang2025towards,
  title = {Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization},
  author = {Jiang, Jiantong and Yang, Peiyu and Zhang, Rui and Liu, Feng},
  journal = {Authorea Preprints},
  year = {2025},
  publisher = {Authorea},
  url = {http://dx.doi.org/10.36227/techrxiv.176046306.66521015/v1},
  doi = {10.36227/techrxiv.176046306.66521015/v1},
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
generate.py		generate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome KV Cache Optimization

Contributing

Quick Index

Temporal — Execution & Scheduling

KV-Centric Scheduling

Pipelining & Overlapping

Hardware-aware Execution

Disaggregated Inference

Compute Offloading

Spatial — Placement & Migration

Memory Hierarchy KV Orchestration

Cross-device Memory Hierarchy

Intra-GPU Memory Hierarchy

Compute Device KV Orchestration

Structural — Representation & Retention

KV Cache Compression

Quantization

Low-rank Approximation

Structural Compression

KV Cache Retention Management

Allocation & Reuse

Eviction

Cross-behavior Co-design Affinity

Behavior-objective Effects

Citation

Contributors

Star History

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

jjiantong/Awesome-KV-Cache-Optimization

Folders and files

Latest commit

History

Repository files navigation

Awesome KV Cache Optimization

Contributing

Quick Index

Temporal — Execution & Scheduling

KV-Centric Scheduling

Pipelining & Overlapping

Hardware-aware Execution

Disaggregated Inference

Compute Offloading

Spatial — Placement & Migration

Memory Hierarchy KV Orchestration

Cross-device Memory Hierarchy

Intra-GPU Memory Hierarchy

Compute Device KV Orchestration

Structural — Representation & Retention

KV Cache Compression

Quantization

Low-rank Approximation

Structural Compression

KV Cache Retention Management

Allocation & Reuse

Eviction

Cross-behavior Co-design Affinity

Behavior-objective Effects

Citation

Contributors

Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages