LLM Inference

Glossary and Illustration

Llama Architecture: stacks of N Transformer decoders; each decoder consists of Grouped-Query Attention (GQA), Rotary Position Embedding (RoPE), Residual Add, Root Mean Square Layer Normalization (RMSNorm), and Multi-Layer Perceptron (MLPs).

Prompt: the initial text or instruction given to the model.
Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
Weight: the parameter of the model, the $w$ in $y = w \cdot x + b$.
Activation: the output of a neuron, which is computed using an activation function, the $z$ in $z = f(y)$, where $f$ is the activation function like ReLU.
GPU Kernel: function that is executed on multiple GPU computing cores to perform parallel computations.
HBM (High Bandwidth Memory): a type of advanced memory technology, which is like the main memory of data-center GPUs.
Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
Post-Training Quantization (PTQ): quantizing the weights and activations of the model after the model has been trained.
Quantization-Aware Training (QAT): incorporating quantization considerations during training.
W8A8: quantizing both weights and activations into 8-bit INT8; W16A16, W8A16, and similar terms follow the same pattern.

Open Source Software

Name	Hardware	Org
Transformers	CPU / NVIDIA GPU / TPU / AMD GPU	Hugging Face
Text Generation Inference	CPU / NVIDIA GPU / AMD GPU	Hugging Face
gpt-fast	CPU / NVIDIA GPU / AMD GPU	PyTorch
TensorRT-LLM	NVIDIA GPU	NVIDIA
vLLM	NVIDIA GPU	University of California, Berkeley
llama.cpp / ggml	CPU / Apple Silicon / NVIDIA GPU / AMD GPU	ggml
ctransformers	CPU / Apple Silicon / NVIDIA GPU / AMD GPU	Ravindra Marella
DeepSpeed	CPU / NVIDIA GPU	Microsoft
FastChat	CPU / NVIDIA GPU / Apple Silicon	lmsys.org
MLC-LLM	CPU / NVIDIA GPU	MLC
LightLLM	CPU / NVIDIA GPU	SenseTime
LMDeploy	CPU / NVIDIA GPU	Shanghai AI Lab & SenseTime
PowerInfer	CPU / NVIDIA GPU	Shanghai Jiao Tong University
OpenLLM	CPU / NVIDIA GPU / AMD GPU	BentoML
OpenPPL.nn / OpenPPL.nn.llm	CPU / NVIDIA GPU	OpenMMLab & SenseTime
ScaleLLM	NVIDIA GPU	Vectorch
RayLLM	CPU / NVIDIA GPU / AMD GPU	Anyscale
Xorbits Inference	CPU / NVIDIA GPU / AMD GPU	Xorbits

Paper List

Name	Paper Title	Paper Link	Artifact	Keywords	Recommend
GPT-3	Language Models are Few-Shot Learners	NuerIPS 20		LLM / Pre-training	⭐️⭐️⭐️⭐️
LLaMA	LLaMA: Open and Efficient Foundation Language Models	arXiv 23	Code	LLM / Pre-training	⭐️⭐️⭐️⭐️
Llama 2	Llama 2: Open Foundation and Fine-Tuned Chat Models	arXiv 23	Model	LLM / Pre-training / Fine-tuning / Safety	⭐️⭐️⭐️⭐️
MQA	Fast Transformer Decoding: One Write-Head is All You Need	arXiv 19	Multi-Query Attention	⭐️⭐️⭐️
GQA	GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	arXiv 23	Grouped-Query Attention	⭐️⭐️⭐️⭐️
RoPE	Roformer: Enhanced transformer with rotary position embedding	arXiv 21		Rotary Position Embedding	⭐️⭐️⭐️⭐️
Megatron-LM	Efficient large-scale language model training on GPU clusters using megatron-LM	SC 21	Code	Tensor Parallel / Pipeline Parallel	⭐️⭐️⭐️⭐️⭐️
Alpa	Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	OSDI 22	Code	Automatic Parallel	⭐️⭐️⭐️
Gpipe	GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	NeurIPS 19		Pipeline Parallel	⭐️⭐️⭐️
Google's Practice	Efficiently Scaling Transformer Inference	MLSys 23		Partition	⭐️⭐️⭐️⭐️
FlashAttention	Fast and Memory-Efficient Exact Attention with IO-Awareness	NeurIPS 23	Code	Memory Hierachy / Softmax Tiling	⭐️⭐️⭐️⭐️⭐️
Orca	Orca: A distributed serving system for Transformer-Based generative models	OSDI 22	Code	Continuous Batching	⭐️⭐️⭐️⭐️⭐️
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	SOSP 23	Code	GPU Memory Paging	⭐️⭐️⭐️⭐️⭐️
FlexGen	FlexGen: High-throughput generative inference of large language models with a single GPU	ICML 23	Code	Offloading	⭐️⭐️⭐️
Speculative Decoding	Fast Inference from Transformers via Speculative Decoding	ICML 23		Speculative Decoding	⭐️⭐️⭐️⭐️
LLM.int8()	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale	NeurIPS 22	Code	Mixed-Precision Quantization	⭐️⭐️⭐️⭐️
ZeroQuant	ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers	NeurIPS 22	Code	Group-wise and Token-wise Quantization	⭐️⭐️⭐️⭐️
SmoothQuant	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	ICML 23		Quantization by Scaling	⭐️⭐️⭐️⭐️
AWQ	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	arXiv 23	Code	Activation-aware and Scaling	⭐️⭐️⭐️⭐️
GPTQ	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	ICLR 23	Code	Optimal Brain Quantization	⭐️⭐️⭐️⭐️
FP8	FP8 Formats for Deep Learning	arXiv 22		FP8 format	⭐️⭐️⭐️
Wanda	A Simple and Effective Pruning Approach for Large Language Models	ICLR 24	Code	Pruning by Weights and activations	⭐️⭐️⭐️⭐️
Deja Vu	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	ICML 23	Code	Pruning based on Contextual Sparsity	⭐️⭐️⭐️
PowerInfer	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU	arXiv 23	Code	Deja Vu + CPU Offloading	⭐️⭐️⭐️

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference

Table of Contents

Glossary and Illustration

Open Source Software

Paper List

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLM Inference

Table of Contents

Glossary and Illustration

Open Source Software

Paper List

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!