This repository aims to consolidate resources for learning about systems for LLM. I have attempted to compile a list of resources (blogs/papers) that are essential for building a fundamental knowledge of the field. This is by no means exhaustive. The criteria for a resource to be in this list are:
- It is simple (not necessarily easy!) to follow
- It is fundamental to the domain of systems and LLM, i.e, it is either widely adopted or has a critical idea explored
- It is good for someone starting in the area or someone with intermediate knowledge in the field
- What Every Developer Should Know About GPU Computing
- GPU Glossary
- A starting point for understanding about GPUs and terms used in GPU programming
- Domain specific architectures for AI inference
- A primer on what a good GPU architecture looks like
- From Online Softmax to FlashAttention
- Derivation of Flash attention, starting from softmax
- ELI5: FlashAttention
- Making Deep Learning Go Brrrr From First Principles
- Fast Inference from Transformers via Speculative Decoding
- PyTorch and CPU-GPU Synchronizations
IMO, understanding parameter arithmetic is the key to performance optimization in LLMs.
- Transformer Inference Arithmetic
- Understanding compute bound vs memory bound
- How is LLaMa.cpp possible?
- Real world example of what it means to be memory bound
- LLM Inference Economics from First Principles
- Finally merging parameter arithmetic with costs
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- The Ultra-Scale Playbook: Training LLMs on GPU Clusters
- One of the best resource to understand distributed training
- How to Scale Your Model
- Visualizing 6D Mesh Parallelism
- 1.5x faster MoE training with custom MXFP8 kernels
- Accelerate ND-Parallel
- How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
- An introduction to batching in LLMs
- Continuous batching from first principles
- Large Transformer Model Inference Optimization
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- Optimizing AI Inference at Character.AI
- Throughput is all you need
- A primer on how to think about throughput in LLM systems. Talks about continuous batching, paged attention and the basics of vLLM orchestrator
- Flash-Decoding for long-context inference
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation
- One Kernel for All Your GPUs
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- Inside Kaiju
- Tiled Matrix Multiplication
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Outperforming cuBLAS on H100: a Worklog
- Writing Speed-of-Light Flash Attention for 5090 in CUDA C++
- How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores
- GPUs Go Brrr
- Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B
- Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
- Matrix Multiplication on Blackwell: Part 1 - Introduction
- Dissecting FlashInfer - A Systems Perspective on High-Performance LLM Inference
- Notes About Nvidia GPU Shared Memory Banks
- Chasing 6+ TB/s: an MXFP8 quantizer on Blackwell
- Notes on packing quantization scales in a format required by downstream GEMM
- Implementing a fast Tensor Core matmul on the Ada Architecture
- GPU networking basics
- A Beginner's Guide to Interconnects in AI Datacenters
- Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
- Faster LLMs
- Has multiple lectures from industry leaders on topics around serving LLMs as applications and how they are different from traditional ML models and regular web services
List of labs working on LLM systems