You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-12-17-minisgl.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
---
2
2
title: "Mini-SGLang: Efficient Inference Engine in a Nutshell"
3
-
author: "SGLang Team"
3
+
author: "Ziyi Xu"
4
4
date: "December 17, 2025"
5
5
previewImg: /images/blog/minisgl/logo.png
6
6
---
7
7
8
-
We're excited to introduce **Mini-SGLang**, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the [SGLang](https://github.com/sgl-project/sglang) project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced features that define state-of-the-art performance, including **Radix Attention** for efficient KV cache reuse, **Chunked Prefill** for controlled memory footprint, and **Tensor Parallelism** for scalable distributed serving. With an OpenAI-compatible API and out-of-the-box support for models like Llama-3 and Qwen-3, Mini-SGLang serves as both a capable inference engine and a transparent reference implementation for researchers and developers.
8
+
We're excited to introduce **Mini-SGLang**, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the [SGLang](https://github.com/sgl-project/sglang) project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced features that define state-of-the-art performance, including **Radix Attention** for efficient KV cache reuse, **Chunked Prefill** for controlled memory footprint, **Overlap Scheduling** for reduced CPU overhead, and **Tensor Parallelism** for scalable distributed serving. With an OpenAI-compatible API and out-of-the-box support for models like Llama-3 and Qwen-3, Mini-SGLang serves as both a capable inference engine and a transparent reference implementation for researchers and developers.
9
9
10
10
The source code is available at [https://github.com/sgl-project/mini-sglang](https://github.com/sgl-project/mini-sglang).
11
11
@@ -25,7 +25,7 @@ Despite its simplicity, Mini-SGLang supports both online and offline inference a
25
25
26
26
Many ML and system researchers struggle to integrate their optimizations into existing framework. On one hand, injecting new logic into complex frameworks like SGLang is risky: you may easily break implicit invariants of the system, which gives rise to subtle bugs. On the other hand, building an inference engine from scratch is tedious, requiring significant effort to handle infrastructure details (e.g., frontend servers, tokenization, NCCL communication) just to match state-of-the-art baselines.
27
27
28
-
Mini-SGLang strikes a balance. It offers an out-of-the-box, high-performance framework that is easy to inspect, extend and optimize. It handles the heavy lifting of infrastructure while being flexible enough for rapid prototyping. Additionally, Mini-SGLang provides **OpenAI-compatible benchmark utilities**, facilitating end-to-end performance analysis and comparison against various serving engines, such as [SGLang](https://github.com/sgl-project/sglang), [vLLM](https://github.com/vllm-project/vllm) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). For kernel developers, Mini-SGLang also provides fine-grained **NVTX annotations**, which are very valuable for kernel debugging and performance profiling.
28
+
Mini-SGLang strikes a balance. It started as a research prototype we used to validate new system ideas quickly, without spending weeks handling a full-scale codebase or re-implementing infrastructure from scratch. It offers an out-of-the-box, high-performance framework that is easy to inspect, extend and optimize. It handles the heavy lifting of infrastructure while being flexible enough for rapid prototyping. Additionally, Mini-SGLang provides **OpenAI-compatible benchmark utilities**, facilitating end-to-end performance analysis and comparison against various serving engines, such as [SGLang](https://github.com/sgl-project/sglang), [vLLM](https://github.com/vllm-project/vllm) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). For kernel developers, Mini-SGLang also provides fine-grained **NVTX annotations**, which are very valuable for kernel debugging and performance profiling.
29
29
30
30
## Features
31
31
@@ -75,7 +75,7 @@ The throughput results (in tokens per second) are shown below:
75
75
76
76
The results show that Mini-SGLang consistently outperforms Nano-vLLM baseline on both Qwen3 models, thanks to our **overlap scheduling** mechanism that effectively hides CPU overhead.
77
77
78
-
**Reproducibility**: The offline benchmark script is available at [this link](https://github.com/sgl-project/mini-sglang/blob/main/benchmark/offline/bench_nanovllm.py).
78
+
**Reproducibility**: The offline benchmark script is available at [this link](https://github.com/sgl-project/mini-sglang/blob/main/benchmark/offline/bench.py).
0 commit comments