Skip to content

MAC-AutoML/Awesome-Efficient-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 

Repository files navigation

Awesome-Efficient-LLM

Toxonomy and Papers


Sparsity and Pruning

Year Title Venue Paper code
2023 SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot ICML 2023 Link Link

Quantization

LLM Quantization

Year Title Venue Paper code
2023 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR 2023 Link Link
2025 OSTQuant: Refining Large Language Model Quantization with
Orthogonal and Scaling Transformations for Better Distribution Fitting
ICLR 2025 Link Link
2025 SpinQuant: LLM quantization with learned rotations ICLR 2025 Link Link
2022 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML 2023 Link Link
2023 AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration MLSys 2024 Link Link
2024 QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks ICML 2024 Link Link
2025 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving MLSys 2025 Link Link
2024 QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs NeurIPS 2024 Link Link
2024 Atom: Low-bit Quantization for Efficient and Accurate LLM Serving MLSys 2024 Link Link
2024 OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models ICLR 2024 Link Link
2023 QuIP: 2-Bit Quantization of Large Language Models With Guarantees NeurIPS 2023 Link Link
2022 LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale NeurIPS 2022 Link Link

VLM Quantization

Year Title Venue Paper code
2024 Q-VLM: Post-training Quantization for Large Vision Language Models NIPS 2024 Link Link

DiT Quantization

Year Title Venue Task Paper Code
2025 SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models ICLR 2025 T2I Link Link
2025 ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation ICLR 2025 Image Generation Link Link
2023 Post-training Quantization on Diffusion Models CVPR 2023 T2I、T2V Link Link
2023 Q-Diffusion: Quantizing Diffusion Models ICCV 2023 Image Generation Link Link

Knowledge Distillation

Year Title Venue Paper code
2025 LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation ICLR 2025 Link Link

Low-Rank Decomposition

Year Title Venue Paper code
2024 Compressing Large Language Models using Low Rank and Low Precision Decomposition NeurIPS 2024 Link Link
2022 Compressible-composable NeRF via Rank-residual Decomposition NeurIPS 2022 Link Link
2024 Unified Low-rank Compression Framework for Click-through Rate Prediction KDD 2024 Link Link
2025 Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models ICML 2025 Link Link
2024 SliceGPT: Orthogonal Slicing for Parameter-Efficient Transformer Compression ICLR 2024 Link Link

KV Cache Compression

Token Eviction(also known as Token Selection)

Year Title Venue Paper code
2023 H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models NeurIPS 2023 Link Link
2023 Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time NeurIPS 2023 Link Link
2023 Efficient Streaming Language Models with Attention Sinks ICLR 2024 Link Link
2024 SnapKV: LLM Knows What You are Looking for Before Generation NeurIPS 2024 Link Link
2024 InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory NeurIPS 2024 Link Link
2024 Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs ICLR 2024 Link Link
2024 Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference MLSys 2024 Link Link
2025 R-KV: Redundancy-aware KV Cache Compression for Reasoning Models Link Link

Budget Allocation

Year Title Venue Paper code
2024 PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling Link Link
2025 LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models ICML 2025 Link Link
2025 CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences ICLR 2025 Link Link

KV Cache Merging

Year Title Venue Paper code
2024 MiniCache: KV Cache Compression in Depth Dimension for Large Language Models NeurIPS 2024 Link Link
2024 CaM: Cache Merging for Memory-efficient LLMs Inference ICML 2024 Link Link
2024 D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models ICLR 2025 Link Link

KV Cache Quantization

Year Title Venue Paper code
2024 IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact ACL 2024 Link Link
2024 KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache ICML 2024 Link Link
2024 KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization NeurIPS 2024 Link Link
2024 SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models COLM 2024 Link Link

Speculative Decoding

Year Title Venue Paper code
2024 Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting NeurIPS 2024 Kangaroo code
2024 EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees EMNLP 2024 EAGLE2 code
2025 Learning Harmonized Representations for Speculative Sampling ICLR 2025 HASS code
2025 Parallel Speculative Decoding with Adaptive Draft Length ICLR 2025 PEARL code

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7