| GPT-3 |
Language Models are Few-Shot Learners |
NuerIPS 20 |
|
LLM / Pre-training |
⭐️⭐️⭐️⭐️ |
| LLaMA |
LLaMA: Open and Efficient Foundation Language Models |
arXiv 23 |
Code |
LLM / Pre-training |
⭐️⭐️⭐️⭐️ |
| Llama 2 |
Llama 2: Open Foundation and Fine-Tuned Chat Models |
arXiv 23 |
Model |
LLM / Pre-training / Fine-tuning / Safety |
⭐️⭐️⭐️⭐️ |
| MQA |
Fast Transformer Decoding: One Write-Head is All You Need |
arXiv 19 |
Multi-Query Attention |
⭐️⭐️⭐️ |
|
| GQA |
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints |
arXiv 23 |
Grouped-Query Attention |
⭐️⭐️⭐️⭐️ |
|
| RoPE |
Roformer: Enhanced transformer with rotary position embedding |
arXiv 21 |
|
Rotary Position Embedding |
⭐️⭐️⭐️⭐️ |
| Megatron-LM |
Efficient large-scale language model training on GPU clusters using megatron-LM |
SC 21 |
Code |
Tensor Parallel / Pipeline Parallel |
⭐️⭐️⭐️⭐️⭐️ |
| Alpa |
Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning |
OSDI 22 |
Code |
Automatic Parallel |
⭐️⭐️⭐️ |
| Gpipe |
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism |
NeurIPS 19 |
|
Pipeline Parallel |
⭐️⭐️⭐️ |
| Google's Practice |
Efficiently Scaling Transformer Inference |
MLSys 23 |
|
Partition |
⭐️⭐️⭐️⭐️ |
| FlashAttention |
Fast and Memory-Efficient Exact Attention with IO-Awareness |
NeurIPS 23 |
Code |
Memory Hierachy / Softmax Tiling |
⭐️⭐️⭐️⭐️⭐️ |
| Orca |
Orca: A distributed serving system for Transformer-Based generative models |
OSDI 22 |
Code |
Continuous Batching |
⭐️⭐️⭐️⭐️⭐️ |
| PagedAttention |
Efficient Memory Management for Large Language Model Serving with PagedAttention |
SOSP 23 |
Code |
GPU Memory Paging |
⭐️⭐️⭐️⭐️⭐️ |
| FlexGen |
FlexGen: High-throughput generative inference of large language models with a single GPU |
ICML 23 |
Code |
Offloading |
⭐️⭐️⭐️ |
| Speculative Decoding |
Fast Inference from Transformers via Speculative Decoding |
ICML 23 |
|
Speculative Decoding |
⭐️⭐️⭐️⭐️ |
| LLM.int8() |
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale |
NeurIPS 22 |
Code |
Mixed-Precision Quantization |
⭐️⭐️⭐️⭐️ |
| ZeroQuant |
ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers |
NeurIPS 22 |
Code |
Group-wise and Token-wise Quantization |
⭐️⭐️⭐️⭐️ |
| SmoothQuant |
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models |
ICML 23 |
|
Quantization by Scaling |
⭐️⭐️⭐️⭐️ |
| AWQ |
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration |
arXiv 23 |
Code |
Activation-aware and Scaling |
⭐️⭐️⭐️⭐️ |
| GPTQ |
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers |
ICLR 23 |
Code |
Optimal Brain Quantization |
⭐️⭐️⭐️⭐️ |
| FP8 |
FP8 Formats for Deep Learning |
arXiv 22 |
|
FP8 format |
⭐️⭐️⭐️ |
| Wanda |
A Simple and Effective Pruning Approach for Large Language Models |
ICLR 24 |
Code |
Pruning by Weights and activations |
⭐️⭐️⭐️⭐️ |
| Deja Vu |
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time |
ICML 23 |
Code |
Pruning based on Contextual Sparsity |
⭐️⭐️⭐️ |
| PowerInfer |
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU |
arXiv 23 |
Code |
Deja Vu + CPU Offloading |
⭐️⭐️⭐️ |