- Sparsity and Pruning
- Quantization
- Knowledge Distillation
- Low-Rank Decomposition
- KV Cache Compression
- Speculative Decoding
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2023 | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ICML 2023 | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2023 | GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | ICLR 2023 | Link | Link |
2025 | OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting |
ICLR 2025 | Link | Link |
2025 | SpinQuant: LLM quantization with learned rotations | ICLR 2025 | Link | Link |
2022 | SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | ICML 2023 | Link | Link |
2023 | AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | MLSys 2024 | Link | Link |
2024 | QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks | ICML 2024 | Link | Link |
2025 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | MLSys 2025 | Link | Link |
2024 | QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | NeurIPS 2024 | Link | Link |
2024 | Atom: Low-bit Quantization for Efficient and Accurate LLM Serving | MLSys 2024 | Link | Link |
2024 | OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | ICLR 2024 | Link | Link |
2023 | QuIP: 2-Bit Quantization of Large Language Models With Guarantees | NeurIPS 2023 | Link | Link |
2022 | LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | NeurIPS 2022 | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2024 | Q-VLM: Post-training Quantization for Large Vision Language Models | NIPS 2024 | Link | Link |
Year | Title | Venue | Task | Paper | Code |
---|---|---|---|---|---|
2025 | SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models | ICLR 2025 | T2I | Link | Link |
2025 | ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation | ICLR 2025 | Image Generation | Link | Link |
2023 | Post-training Quantization on Diffusion Models | CVPR 2023 | T2I、T2V | Link | Link |
2023 | Q-Diffusion: Quantizing Diffusion Models | ICCV 2023 | Image Generation | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2025 | LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation | ICLR 2025 | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2024 | Compressing Large Language Models using Low Rank and Low Precision Decomposition | NeurIPS 2024 | Link | Link |
2022 | Compressible-composable NeRF via Rank-residual Decomposition | NeurIPS 2022 | Link | Link |
2024 | Unified Low-rank Compression Framework for Click-through Rate Prediction | KDD 2024 | Link | Link |
2025 | Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models | ICML 2025 | Link | Link |
2024 | SliceGPT: Orthogonal Slicing for Parameter-Efficient Transformer Compression | ICLR 2024 | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2023 | H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | NeurIPS 2023 | Link | Link |
2023 | Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time | NeurIPS 2023 | Link | Link |
2023 | Efficient Streaming Language Models with Attention Sinks | ICLR 2024 | Link | Link |
2024 | SnapKV: LLM Knows What You are Looking for Before Generation | NeurIPS 2024 | Link | Link |
2024 | InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory | NeurIPS 2024 | Link | Link |
2024 | Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs | ICLR 2024 | Link | Link |
2024 | Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference | MLSys 2024 | Link | Link |
2025 | R-KV: Redundancy-aware KV Cache Compression for Reasoning Models | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2024 | PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling | Link | Link |
|
2025 | LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models | ICML 2025 | Link | Link |
2025 | CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences | ICLR 2025 | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2024 | MiniCache: KV Cache Compression in Depth Dimension for Large Language Models | NeurIPS 2024 | Link | Link |
2024 | CaM: Cache Merging for Memory-efficient LLMs Inference | ICML 2024 | Link | Link |
2024 | D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models | ICLR 2025 | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2024 | IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact | ACL 2024 | Link | Link |
2024 | KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache | ICML 2024 | Link | Link |
2024 | KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization | NeurIPS 2024 | Link | Link |
2024 | SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models | COLM 2024 | Link | Link |
Year | Title | Venue | Paper | code |
---|---|---|---|---|
2024 | Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting | NeurIPS 2024 | Kangaroo | code |
2024 | EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees | EMNLP 2024 | EAGLE2 | code |
2025 | Learning Harmonized Representations for Speculative Sampling | ICLR 2025 | HASS | code |
2025 | Parallel Speculative Decoding with Adaptive Draft Length | ICLR 2025 | PEARL | code |