This repository aims to consolidate resources for learning about systems for LLM. I have attempted to compile a list of resources (blogs/papers) that are essential for building a fundamental knowledge of the field. This is by no means exhaustive. The criteria for a resource to be in this list are:
- It is simple (not necessarily easy!) to follow
- It is fundamental to the domain of systems and LLM, i.e, it is either widely adopted or has a critical idea explored
- It is good for someone starting in the area or someone with intermediate knowledge in the field
- What Every Developer Should Know About GPU Computing
- GPU Glossary
- A starting point for understanding about GPUs and terms used in GPU programming
- Domain specific architectures for AI inference
- A primer on what a good GPU architecture looks like
- From Online Softmax to FlashAttention
- Derivation of Flash attention, starting from softmax
- ELI5: FlashAttention
- Making Deep Learning Go Brrrr From First Principles
- Fast Inference from Transformers via Speculative Decoding
IMO, understanding parameter arithmetic is the key to performance optimization in LLMs.
- Transformer Inference Arithmetic
- Understanding compute bound vs memory bound
- How is LLaMa.cpp possible?
- Real world example of what it means to be memory bound
- LLM Inference Economics from First Principles
- Finally merging parameter arithmetic with costs
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- The Ultra-Scale Playbook: Training LLMs on GPU Clusters
- One of the best resource to understand distributed training
- How to Scale Your Model
- Visualizing 6D Mesh Parallelism
- 1.5x faster MoE training with custom MXFP8 kernels
- Accelerate ND-Parallel
- How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
- An introduction to batching in LLMs
- Large Transformer Model Inference Optimization
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- Throughput is all you need
- A primer on how to think about throughput in LLM systems. Talks about continuous batching, paged attention and the basics of vLLM orchestrator
- Flash-Decoding for long-context inference
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation
- One Kernel for All Your GPUs
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- Tiled Matrix Multiplication
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Outperforming cuBLAS on H100: a Worklog
- Writing Speed-of-Light Flash Attention for 5090 in CUDA C++
- How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores
- GPUs Go Brrr
- Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B
- Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
List of labs working on LLM systems