Skip to content

Commit 96a38cd

Browse files
[BLOG] Mini-SGLang (#282)
* feat: mini-sglang init * feat: refine * feat: improve clarity and readability * feat: move a sentence * feat: finalize * minor: update figure and link * feat: update figures and some words * minor: update date --------- Co-authored-by: MisakaVan <[email protected]>
1 parent 9416dba commit 96a38cd

File tree

8 files changed

+111
-0
lines changed

8 files changed

+111
-0
lines changed

blog/2025-12-17-minisgl.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: "Mini-SGLang: Efficient Inference Engine in a Nutshell"
3+
author: "SGLang Team"
4+
date: "December 17, 2025"
5+
previewImg: /images/blog/minisgl/logo.png
6+
---
7+
8+
We're excited to introduce **Mini-SGLang**, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the [SGLang](https://github.com/sgl-project/sglang) project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced features that define state-of-the-art performance, including **Radix Attention** for efficient KV cache reuse, **Chunked Prefill** for controlled memory footprint, and **Tensor Parallelism** for scalable distributed serving. With an OpenAI-compatible API and out-of-the-box support for models like Llama-3 and Qwen-3, Mini-SGLang serves as both a capable inference engine and a transparent reference implementation for researchers and developers.
9+
10+
The source code is available at [https://github.com/sgl-project/mini-sglang](https://github.com/sgl-project/mini-sglang).
11+
12+
<!-- ![Header](/images/blog/minisgl/logo.png) -->
13+
14+
## Motivation: Why Mini-SGLang?
15+
16+
Although SGLang has achieved state-of-the-art inference performance with a comprehensive feature set, its codebase has grown massive, reaching nearly 300k lines of Python code. To address the complexity barrier for learners and researchers, we built Mini-SGLang, focusing on two main objectives: providing learning resources and enabling fast prototyping for research.
17+
18+
### Educational Purposes
19+
20+
Mini-SGLang features a clean, highly modular codebase of only **5k lines of Python code**, which makes it significantly easier for beginners to understand the core components of a modern LLM serving engine.
21+
22+
Despite its simplicity, Mini-SGLang supports both online and offline inference and implements essential modern optimizations, including **Tensor Parallelism**, **Overlap Scheduling**, **Chunked Prefill**, **Radix Cache**, and **JIT CUDA kernels**. This makes it a comprehensive learning resource.
23+
24+
### Quick Research Prototype
25+
26+
Many ML and system researchers struggle to integrate their optimizations into existing framework. On one hand, injecting new logic into complex frameworks like SGLang is risky: you may easily break implicit invariants of the system, which gives rise to subtle bugs. On the other hand, building an inference engine from scratch is tedious, requiring significant effort to handle infrastructure details (e.g., frontend servers, tokenization, NCCL communication) just to match state-of-the-art baselines.
27+
28+
Mini-SGLang strikes a balance. It offers an out-of-the-box, high-performance framework that is easy to inspect, extend and optimize. It handles the heavy lifting of infrastructure while being flexible enough for rapid prototyping. Additionally, Mini-SGLang provides **OpenAI-compatible benchmark utilities**, facilitating end-to-end performance analysis and comparison against various serving engines, such as [SGLang](https://github.com/sgl-project/sglang), [vLLM](https://github.com/vllm-project/vllm) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). For kernel developers, Mini-SGLang also provides fine-grained **NVTX annotations**, which are very valuable for kernel debugging and performance profiling.
29+
30+
## Features
31+
32+
Mini-SGLang shares the same high-level system architecture as SGLang, consisting of a frontend API server, a tokenizer server, and a backend scheduler for each GPU.
33+
34+
![system-design](/images/blog/minisgl/design.drawio.png)
35+
36+
### Overlap Scheduling
37+
38+
LLM inference is not just about GPU computation; a significant amount of work is handled by the CPU, including batch scheduling, memory management, and token processing. Without optimization, this CPU overhead can lead to GPU idling, hurting overall performance.
39+
40+
Mini-SGLang implements an **overlap scheduling** mechanism, similar to the one in SGLang, to mitigate this. By preparing the next batch of requests on the CPU while the GPU is busy with the current batch, it effectively hides the CPU overhead. As the Nsight-Systems profile below shows, this keeps the GPU consistently utilized, eliminating GPU idleness and maximizing throughput. More technical details are available in our [previous blog post](https://lmsys.org/blog/2024-12-04-sglang-v0-4/).
41+
42+
![overlap](/images/blog/minisgl/overlap.png)
43+
44+
> An example of overlapped execution. CPU execution overhead is fully overlapped.
45+
46+
![no-overlap](/images/blog/minisgl/no-overlap.png)
47+
48+
> An example of non-overlapped execution. CPU execution overhead leads to substantial GPU stalls.
49+
50+
To run an ablation study without overlap scheduling, set the environment variable `MINISGL_DISABLE_OVERLAP_SCHEDULING=1`.
51+
52+
### High-Performance Kernels
53+
54+
Mini-SGLang integrates state-of-the-art attention kernels to ensure top performance. It leverages [FlashAttention-3](https://github.com/Dao-AILab/flash-attention) for prefill kernel and [FlashInfer](https://github.com/flashinfer-ai/flashinfer) for decode kernel on NVIDIA Hopper architecture.
55+
56+
Following [FlashInfer](https://github.com/flashinfer-ai/flashinfer) and [SGLang](https://github.com/sgl-project/sglang), Mini-SGLang also integrates just-in-time (JIT) compiled kernel for better runtime performance. We adopt [TVM FFI](https://github.com/apache/tvm-ffi) for Python binding, which is much faster than the default PyTorch interface due to its lightweight design.
57+
58+
### Interactive Shell Mode
59+
60+
For direct interaction and testing, Mini-SGLang includes a simple shell mode. This allows users to chat with LLMs directly from the command line, providing a convenient way to test models and observe their behavior without needing a separate client.
61+
62+
![Shell Example](/images/blog/minisgl/shell.png)
63+
64+
## Performance Benchmark
65+
66+
To evaluate the performance of Mini-SGLang, we conducted comprehensive experiments covering both offline throughput and online serving latency.
67+
68+
### Offline Inference Throughput
69+
70+
We evaluated Mini-SGLang's offline throughput against Nano-vLLM on a single NVIDIA H200 GPU. Following the methodology from [Nano-vLLM](https://github.com/GeeeekExplorer/nano-vllm/), we used the [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B/) model and also tested the larger [Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B/) model to assess performance at scale. We focused on Qwen3 models due to the current limitations of the Nano-vLLM baseline.
71+
72+
The throughput results (in tokens per second) are shown below:
73+
74+
![Offline-Benchmark](/images/blog/minisgl/offline.png)
75+
76+
The results show that Mini-SGLang consistently outperforms Nano-vLLM baseline on both Qwen3 models, thanks to our **overlap scheduling** mechanism that effectively hides CPU overhead.
77+
78+
**Reproducibility**: The offline benchmark script is available at [this link](https://github.com/sgl-project/mini-sglang/blob/main/benchmark/offline/bench_nanovllm.py).
79+
80+
### Online Serving Latency
81+
82+
To assess real-world serving performance, we benchmarked Mini-SGLang against SGLang using a realistic workload from the [Qwen trace](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon/blob/main/qwen_traceA_blksz_16.jsonl). We replayed 1,000 requests to a [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model deployed with 4-way tensor parallelism on 4 H200 GPUs. We measured throughput, 90th percentile (P90) Time To First Token (TTFT), and Time Between Tokens (TBT).
83+
84+
![Online-Benchmark](/images/blog/minisgl/online.png)
85+
86+
The results demonstrate that Mini-SGLang achieves nearly identical performance to SGLang, confirming that its lightweight design does not compromise on throughput or latency.
87+
88+
**Reproducibility**: Use the following commands to launch each system:
89+
90+
```bash
91+
# Mini-SGLang
92+
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive
93+
94+
# SGLang
95+
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
96+
--disable-radix --port 1919 --decode-attention flashinfer
97+
```
98+
99+
The online benchmark script is available at [this link](https://github.com/sgl-project/mini-sglang/blob/main/benchmark/online/bench_qwen.py).
100+
101+
## Conclusion
102+
103+
Mini-SGLang successfully distills the power of a state-of-the-art inference engine into a compact and understandable codebase. By retaining key optimizations like overlap scheduling and high-performance attention kernels, it delivers impressive performance while serving as an invaluable educational tool and a flexible platform for research.
104+
105+
We invite you to explore the [source code](https://github.com/sgl-project/mini-sglang), run the benchmarks, and see for yourself how Mini-SGLang makes high-performance LLM inference more accessible than ever.
106+
107+
## Acknowledgements
108+
109+
- We would like to thank the SGLang team and community for their generous support and feedback, especially Liangsheng Yin, Lianmin Zheng and many others.
110+
- We would like to thank [MisakaVan](https://github.com/MisakaVan) for his prominent contribution in testing, documentation, code improvement, and [Yi Pan](https://github.com/Conless) for the initial PyTorch implementation of C++ NCCL communicator.
111+
- We learn a lot from the system design of SGLang, FlashInfer, vLLM and Nano-vLLM, which jointly help make Mini-SGLang a clean yet robust system.
562 KB
Loading
120 KB
Loading
170 KB
Loading
150 KB
Loading
183 KB
Loading
152 KB
Loading
2.82 MB
Loading

0 commit comments

Comments
 (0)