Skip to content

lambda7xx/awesome-AI-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Awesome AI System

This repo is motivated by awesome tensor compilers.

Contents

Paper-Code

Researcher

Name University Homepage
Ion Stoica UC Berkeley Website
Joseph E. Gonzalez UC Berkeley Website
Matei Zaharia UC Berkeley Website
Zhihao Jia CMU Website
Tianqi Chen CMU Website
Stephanie Wang UW Website
Xingda Wei SJTU Website
Zeyu Min SJTU Website
Xin Jin PKU Website
Harry Xu UCLA Website
Anand Iyer Georgia Tech Website
Ravi Netravali Princeton Website
Christos Kozyrakis Stanford Website
Christopher RĂ© Stanford Website
Tri Dao Princeton Website
Mosharaf Chowdhury UMich Website
Shivaram Venkataraman Wisc Website
Hao Zhang UCSD Website
Yiying Zhang UCSD Website
Ana Klimovic ETH Website
Fan Lai UIUC Website
Lianmin Zheng UC Berkeley Website
Ying Sheng Stanford Website
Zhuohan Li UC Berkeley Website
Woosuk Kwon UC Berkeley Website
Zihao Ye University of Washington Website
Amey Agrawal Georgia Tech Website

LLM Serving Framework

Title Github
MLC LLM Star
TensorRT-LLM Star
xFasterTransformer Star
CTranslate2(low latency) Star
llama2.c Star

LLM Evaluation Platform

Title Github Website
FastChat Star Website

LLM Inference (System Side)

Title Paper Github WebSite Pub. & Date
XSched: Preemptive Scheduling for Diverse XPUs arXiv Star - OSDI 25
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference arXiv Star - Arxiv 25
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production arXiv Star - Arxiv 25
Resource Multiplexing in Tuning and Serving Large Language Models arXiv Star - ATC'25
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference arXiv Star - Arxiv May 2025
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting arXiv Star - ISCA'25
LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading arXiv Star - ISCA'25
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving arXiv Star - SIGMOD'25
Marconi: Prefix Caching for the Era of Hybrid LLMs arXiv Star - MLSys'25
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs arXiv Star - Eurosys'25 Best Paper
NeuStream: Bridging Deep Learning Serving and Stream Processing arXiv Star - Eurosys'25
Towards End-to-End Optimization of LLM-based Applications with Ayo arXiv Star - ASPLOS'25
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference arXiv Star - MLSYS'25
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion arXiv Star - Eurosys'25 Best Paper
Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow arXiv Star - ASPLOS'25
GLINTHAWK: A Two-Tiered Architecture for High-Throughput LLM Inference arXiv Star - Arxiv'25,Jan
Queue Management for SLO-Oriented Large Language Model Serving arXiv Star - SOCC'24
NanoFlow: Towards Optimal Large Language Model Serving Throughput arXiv Star - OSDI'25
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU arXiv Star - SOSP'24
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism arXiv Star - SOSP'24
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference arXiv Star - MLSYS'24
PLLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference arXiv Star - ISCA'24
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference arXiv Star - ISCA'24
Prompt Cache: Modular Attention Reuse for Low-Latency Inference arXiv Star - MLSYS'24
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve arXiv Star - OSDI'24
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving arXiv Star - OSDI'24
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving arXiv Star - July'24
Llumnix: Dynamic Scheduling for Large Language Model Serving arXiv Star - OSDI'24
Parrot: Efficient Serving of LLM-based Application with Semantic Variables arXiv Star - OSDI'24
CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming arXiv Star - SIGCOMM'24
Efficiently Programming Large Language Models using SGLang arXiv Star - Jan, 2024
Efficient Memory Management for Large Language Model Serving with PagedAttention arXiv Star - SOSP'23
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification arXiv Star - Dec,2023
Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference - Star - PPOPP'24
Efficiently Programming Large Language Models using SGLang arXiv Star - Nurips'24
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity arXiv Star - VLDB'24

RAG And ANNS

Title Paper Github WebSite Pub. & Date
LEANN: A Low-Storage Vector Index arXiv Star - Arxiv 25
OdinANN: Direct Insert for Consistently Stable Performance in Billion-Scale Graph-Based Vector Search arXiv Star - FAST'26
Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD arXiv Star - OSDI'25
Quake: Adaptive Indexing for Vector Search arXiv Star - OSDI'25
Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale arXiv Star - ISCA'25
PathWeaver: A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search arXiv Star - ATC'25
In-Storage Acceleration of Retrieval Augmented Generation as a Service: Artifact Evaluation README arXiv Star - ISCA'25
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving arXiv Star - ISCA'25

RLHF

Title Paper Github WebSite Pub. & Date
Optimizing RLHF Training for Large Language Models with Stage Fusion arXiv Star - NSDI'25
HybridFlow: A Flexible and Efficient RLHF Framework arXiv Star - Eurosys'25
ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation arXiv Star - June. 2024
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework arXiv Star - May. 2024

Video

Title Paper Github WebSite Pub. & Date
Katz: Efficient Workflow Serving for Diffusion Models with Many Adapters arXiv Star - ATC'25
PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism arXiv Star - Nov. 2024
xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism arXiv Star - Nov. 2024
FastVideo arXiv Star - Dec. 2024

LLM Inference(AI Side)

Title Paper Github WebSite Pub. & Date
InferCept: Efficient Intercept Support for Augmented Large Language Model Inference arXiv Star - ICML'24
Online Speculative Decoding arXiv Star - ICML'24
MuxServe: Flexible Spatial-Temporal Multiplexing for LLM Serving arXiv Star - ICML'24
BitDelta: Your Fine-Tune May Only Be Worth One Bit arXiv Star - Feb,2024
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads arXiv Star - Jan,2024
LLMCompiler: An LLM Compiler for Parallel Function Calling arXiv Star - Dec,2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces arXiv Star - Dec,2023
Teaching LLMs memory management for unbounded context arXiv Star - Oct,2023
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding arXiv Star - Feb,2024
EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation arXiv Star - Jan,2024

LLM MoE

Title Paper Github WebSite Pub. & Date
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference arXiv Star - ISCA'24
SIDA-MOE: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND SCALABLE LARGE MIXTURE-OF-EXPERTS MODELS arXiv Star - MLSYS'24
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling arXiv Star - Eurosys'24

LoRA

Title Paper Github WebSite Pub. & Date
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving arXiv Star - OSDI'24
S-LoRA: Serving Thousands of Concurrent LoRA Adapters arXiv Star - Nov,2023
Punica: Serving multiple LoRA finetuned LLM as one arXiv Star - Oct,2023

Framework

Parallellism Training

Training

Communication

Serving-Inference

MoE

GPU Cluster Management

Schedule and Resource Management

Optimization

GNN

Fine-Tune

Energy

Misc

Contribute

We encourage all contributions to this repository. Open an issue or send a pull request.

About

paper and its code for AI System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published