Awesome LLM systems

This repository aims to consolidate resources for learning about systems for LLM. I have attempted to compile a list of resources (blogs/papers) that are essential for building a fundamental knowledge of the field. This is by no means exhaustive. The criteria for a resource to be in this list are:

It is simple (not necessarily easy!) to follow
It is fundamental to the domain of systems and LLM, i.e, it is either widely adopted or has a critical idea explored
It is good for someone starting in the area or someone with intermediate knowledge in the field

Basics

What Every Developer Should Know About GPU Computing
GPU Glossary
- A starting point for understanding about GPUs and terms used in GPU programming
Domain specific architectures for AI inference
- A primer on what a good GPU architecture looks like
From Online Softmax to FlashAttention
- Derivation of Flash attention, starting from softmax
ELI5: FlashAttention
Making Deep Learning Go Brrrr From First Principles
Fast Inference from Transformers via Speculative Decoding

IMO, understanding parameter arithmetic is the key to performance optimization in LLMs.

Transformer Inference Arithmetic
- Understanding compute bound vs memory bound
How is LLaMa.cpp possible?
- Real world example of what it means to be memory bound
LLM Inference Economics from First Principles
- Finally merging parameter arithmetic with costs

Architecture

TransMLA: MLA Is All You Need

Quantization

Training

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
The Ultra-Scale Playbook: Training LLMs on GPU Clusters
- One of the best resource to understand distributed training
How to Scale Your Model
Visualizing 6D Mesh Parallelism
1.5x faster MoE training with custom MXFP8 kernels
Accelerate ND-Parallel

Inference

How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
- An introduction to batching in LLMs
Large Transformer Model Inference Optimization
Efficient Memory Management for Large Language Model Serving with PagedAttention
Throughput is all you need
- A primer on how to think about throughput in LLM systems. Talks about continuous batching, paged attention and the basics of vLLM orchestrator
Flash-Decoding for long-context inference
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation
One Kernel for All Your GPUs
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Awesome LLM systems

Basics

Architecture

Quantization

Training

Inference

Kernels

Misc

Labs

About

Uh oh!

Uh oh!

Uh oh!

romitjain/awesome-llm-systems

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM systems

Basics

Architecture

Quantization

Training

Inference

Kernels

Misc

Labs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks