Awesome-Efficient-LLM

Toxonomy and Papers

Sparsity and Pruning
Quantization
- LLM Quantization
- VLM Quantization
Knowledge Distillation
Low-Rank Decomposition
KV Cache Compression
Speculative Decoding

Sparsity and Pruning

Year	Title	Venue	Paper	code
2023	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot	ICML 2023	Link	Link

Quantization

LLM Quantization

Year	Title	Venue	Paper	code
2023	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	ICLR 2023	Link	Link
2025	OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting	ICLR 2025	Link	Link
2025	SpinQuant: LLM quantization with learned rotations	ICLR 2025	Link	Link
2022	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	ICML 2023	Link	Link
2023	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	MLSys 2024	Link	Link
2024	QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks	ICML 2024	Link	Link
2025	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	MLSys 2025	Link	Link
2024	QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs	NeurIPS 2024	Link	Link
2024	Atom: Low-bit Quantization for Efficient and Accurate LLM Serving	MLSys 2024	Link	Link
2024	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models	ICLR 2024	Link	Link
2023	QuIP: 2-Bit Quantization of Large Language Models With Guarantees	NeurIPS 2023	Link	Link
2022	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale	NeurIPS 2022	Link	Link

VLM Quantization

Year	Title	Venue	Paper	code
2024	Q-VLM: Post-training Quantization for Large Vision Language Models	NIPS 2024	Link	Link

DiT Quantization

Year	Title	Venue	Task	Paper	Code
2025	SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models	ICLR 2025	T2I	Link	Link
2025	ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation	ICLR 2025	Image Generation	Link	Link
2023	Post-training Quantization on Diffusion Models	CVPR 2023	T2I、T2V	Link	Link
2023	Q-Diffusion: Quantizing Diffusion Models	ICCV 2023	Image Generation	Link	Link

Knowledge Distillation

Year	Title	Venue	Paper	code
2025	LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation	ICLR 2025	Link	Link

Low-Rank Decomposition

Year	Title	Venue	Paper	code
2024	Compressing Large Language Models using Low Rank and Low Precision Decomposition	NeurIPS 2024	Link	Link
2022	Compressible-composable NeRF via Rank-residual Decomposition	NeurIPS 2022	Link	Link
2024	Unified Low-rank Compression Framework for Click-through Rate Prediction	KDD 2024	Link	Link
2025	Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models	ICML 2025	Link	Link
2024	SliceGPT: Orthogonal Slicing for Parameter-Efficient Transformer Compression	ICLR 2024	Link	Link

KV Cache Compression

Token Eviction(also known as Token Selection)

Year	Title	Venue	Paper	code
2023	H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	NeurIPS 2023	Link	Link
2023	Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time	NeurIPS 2023	Link	Link
2023	Efficient Streaming Language Models with Attention Sinks	ICLR 2024	Link	Link
2024	SnapKV: LLM Knows What You are Looking for Before Generation	NeurIPS 2024	Link	Link
2024	InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory	NeurIPS 2024	Link	Link
2024	Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs	ICLR 2024	Link	Link
2024	Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference	MLSys 2024	Link	Link
2025	R-KV: Redundancy-aware KV Cache Compression for Reasoning Models		Link	Link

Budget Allocation

Year	Title	Venue	Paper	code
2024	PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling		Link	Link
2025	LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models	ICML 2025	Link	Link
2025	CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences	ICLR 2025	Link	Link

KV Cache Merging

Year	Title	Venue	Paper	code
2024	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	NeurIPS 2024	Link	Link
2024	CaM: Cache Merging for Memory-efficient LLMs Inference	ICML 2024	Link	Link
2024	D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models	ICLR 2025	Link	Link

KV Cache Quantization

Year	Title	Venue	Paper	code
2024	IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact	ACL 2024	Link	Link
2024	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	ICML 2024	Link	Link
2024	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	NeurIPS 2024	Link	Link
2024	SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models	COLM 2024	Link	Link

Speculative Decoding

Year	Title	Venue	Paper	code
2024	Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting	NeurIPS 2024	Kangaroo	code
2024	EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees	EMNLP 2024	EAGLE2	code
2025	Learning Harmonized Representations for Speculative Sampling	ICLR 2025	HASS	code
2025	Parallel Speculative Decoding with Adaptive Draft Length	ICLR 2025	PEARL	code

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-Efficient-LLM

Toxonomy and Papers

Sparsity and Pruning

Quantization

LLM Quantization

VLM Quantization

DiT Quantization

Knowledge Distillation

Low-Rank Decomposition

KV Cache Compression

Token Eviction(also known as Token Selection)

Budget Allocation

KV Cache Merging

KV Cache Quantization

Speculative Decoding

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

MAC-AutoML/Awesome-Efficient-LLM

Folders and files

Latest commit

History

Repository files navigation

Awesome-Efficient-LLM

Toxonomy and Papers

Sparsity and Pruning

Quantization

LLM Quantization

VLM Quantization

DiT Quantization

Knowledge Distillation

Low-Rank Decomposition

KV Cache Compression

Token Eviction(also known as Token Selection)

Budget Allocation

KV Cache Merging

KV Cache Quantization

Speculative Decoding

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Packages