Awesome AI System

This repo is motivated by awesome tensor compilers.

Paper-Code

Researcher

Name	University	Homepage
Ion Stoica	UC Berkeley
Joseph E. Gonzalez	UC Berkeley
Matei Zaharia	UC Berkeley
Zhihao Jia	CMU
Tianqi Chen	CMU
Stephanie Wang	UW
Xingda Wei	SJTU
Zeyu Min	SJTU
Xin Jin	PKU
Harry Xu	UCLA
Anand Iyer	Georgia Tech
Ravi Netravali	Princeton
Christos Kozyrakis	Stanford
Christopher Ré	Stanford
Tri Dao	Princeton
Mosharaf Chowdhury	UMich
Shivaram Venkataraman	Wisc
Hao Zhang	UCSD
Yiying Zhang	UCSD
Ana Klimovic	ETH
Fan Lai	UIUC
Lianmin Zheng	UC Berkeley
Ying Sheng	Stanford
Zhuohan Li	UC Berkeley
Woosuk Kwon	UC Berkeley
Zihao Ye	University of Washington
Amey Agrawal	Georgia Tech

LLM Serving Framework

Title	Github
MLC LLM
TensorRT-LLM
xFasterTransformer
CTranslate2(low latency)
llama2.c

LLM Evaluation Platform

Title	Github	Website
FastChat

LLM Inference (System Side)

Title	Paper	WebSite	Pub. & Date
XSched: Preemptive Scheduling for Diverse XPUs		-	OSDI 25
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference		-	Arxiv 25
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production		-	Arxiv 25
Resource Multiplexing in Tuning and Serving Large Language Models		-	ATC'25
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference		-	Arxiv May 2025
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting		-	ISCA'25
LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading		-	ISCA'25
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving		-	SIGMOD'25
Marconi: Prefix Caching for the Era of Hybrid LLMs		-	MLSys'25
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs		-	Eurosys'25 Best Paper
NeuStream: Bridging Deep Learning Serving and Stream Processing		-	Eurosys'25
Towards End-to-End Optimization of LLM-based Applications with Ayo		-	ASPLOS'25
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference		-	MLSYS'25
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion		-	Eurosys'25 Best Paper
Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow		-	ASPLOS'25
GLINTHAWK: A Two-Tiered Architecture for High-Throughput LLM Inference		-	Arxiv'25,Jan
Queue Management for SLO-Oriented Large Language Model Serving		-	SOCC'24
NanoFlow: Towards Optimal Large Language Model Serving Throughput		-	OSDI'25
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU		-	SOSP'24
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism		-	SOSP'24
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference		-	MLSYS'24
PLLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference		-	ISCA'24
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference		-	ISCA'24
Prompt Cache: Modular Attention Reuse for Low-Latency Inference		-	MLSYS'24
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve		-	OSDI'24
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving		-	OSDI'24
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving		-	July'24
Llumnix: Dynamic Scheduling for Large Language Model Serving		-	OSDI'24
Parrot: Efficient Serving of LLM-based Application with Semantic Variables		-	OSDI'24
CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming		-	SIGCOMM'24
Efficiently Programming Large Language Models using SGLang		-	Jan, 2024
Efficient Memory Management for Large Language Model Serving with PagedAttention		-	SOSP'23
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification		-	Dec,2023
Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference	-	-	PPOPP'24
Efficiently Programming Large Language Models using SGLang		-	Nurips'24
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity		-	VLDB'24

RAG And ANNS

Title	WebSite	Pub. & Date
LEANN: A Low-Storage Vector Index	-	Arxiv 25
OdinANN: Direct Insert for Consistently Stable Performance in Billion-Scale Graph-Based Vector Search	-	FAST'26
Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD	-	OSDI'25
Quake: Adaptive Indexing for Vector Search	-	OSDI'25
Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale	-	ISCA'25
PathWeaver: A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search	-	ATC'25
In-Storage Acceleration of Retrieval Augmented Generation as a Service: Artifact Evaluation README	-	ISCA'25
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving	-	ISCA'25

RLHF

Title	WebSite	Pub. & Date
Optimizing RLHF Training for Large Language Models with Stage Fusion	-	NSDI'25
HybridFlow: A Flexible and Efficient RLHF Framework	-	Eurosys'25
ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation	-	June. 2024
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework	-	May. 2024

Video

Title	WebSite	Pub. & Date
Katz: Efficient Workflow Serving for Diffusion Models with Many Adapters	-	ATC'25
PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism	-	Nov. 2024
xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism	-	Nov. 2024
FastVideo	-	Dec. 2024

LLM Inference(AI Side)

Title	WebSite	Pub. & Date
InferCept: Efficient Intercept Support for Augmented Large Language Model Inference	-	ICML'24
Online Speculative Decoding	-	ICML'24
MuxServe: Flexible Spatial-Temporal Multiplexing for LLM Serving	-	ICML'24
BitDelta: Your Fine-Tune May Only Be Worth One Bit	-	Feb,2024
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads	-	Jan,2024
LLMCompiler: An LLM Compiler for Parallel Function Calling	-	Dec,2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces	-	Dec,2023
Teaching LLMs memory management for unbounded context	-	Oct,2023
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding	-	Feb,2024
EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation	-	Jan,2024

LLM MoE

Title	WebSite	Pub. & Date
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference	-	ISCA'24
SIDA-MOE: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND SCALABLE LARGE MIXTURE-OF-EXPERTS MODELS	-	MLSYS'24
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling	-	Eurosys'24

LoRA

Title	WebSite	Pub. & Date
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving	-	OSDI'24
S-LoRA: Serving Thousands of Concurrent LoRA Adapters	-	Nov,2023
Punica: Serving multiple LoRA finetuned LLM as one	-	Oct,2023

Framework

code Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22

paper Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22
code Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization OSDI'22 OSDI'22

paper Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization OSDI'22
code Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM SC21

paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM SC21
code A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters OSDI'20

paper A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters OSDI'20
code Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP'23

paper Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP'23
code HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework VLDB'22

paper HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework VLDB'22

Parallellism Training

Training

Communication

Serving-Inference

MoE

GPU Cluster Management

Schedule and Resource Management

Optimization

GNN

Fine-Tune

code Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism ATC'21

paper Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism ATC'21

Energy

Misc

Contribute

We encourage all contributions to this repository. Open an issue or send a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome AI System

Contents

Paper-Code

Researcher

LLM Serving Framework

LLM Evaluation Platform

LLM Inference (System Side)

RAG And ANNS

RLHF

Video

LLM Inference(AI Side)

LLM MoE

LoRA

Framework

Parallellism Training

Training

Communication

Serving-Inference

MoE

GPU Cluster Management

Schedule and Resource Management

Optimization

GNN

Fine-Tune

Energy

Misc

Contribute

About

Uh oh!

Releases

Packages

lambda7xx/awesome-AI-system

Folders and files

Latest commit

History

Repository files navigation

Awesome AI System

Contents

Paper-Code

Researcher

LLM Serving Framework

LLM Evaluation Platform

LLM Inference (System Side)

RAG And ANNS

RLHF

Video

LLM Inference(AI Side)

LLM MoE

LoRA

Framework

Parallellism Training

Training

Communication

Serving-Inference

MoE

GPU Cluster Management

Schedule and Resource Management

Optimization

GNN

Fine-Tune

Energy

Misc

Contribute

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages