Skip to content

Commit 6fd22d6

Browse files
minor: minor update author, content (#284)
1 parent 70c3879 commit 6fd22d6

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

blog/2025-12-17-minisgl.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
22
title: "Mini-SGLang: Efficient Inference Engine in a Nutshell"
3-
author: "SGLang Team"
3+
author: "Ziyi Xu"
44
date: "December 17, 2025"
55
previewImg: /images/blog/minisgl/logo.png
66
---
77

8-
We're excited to introduce **Mini-SGLang**, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the [SGLang](https://github.com/sgl-project/sglang) project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced features that define state-of-the-art performance, including **Radix Attention** for efficient KV cache reuse, **Chunked Prefill** for controlled memory footprint, and **Tensor Parallelism** for scalable distributed serving. With an OpenAI-compatible API and out-of-the-box support for models like Llama-3 and Qwen-3, Mini-SGLang serves as both a capable inference engine and a transparent reference implementation for researchers and developers.
8+
We're excited to introduce **Mini-SGLang**, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the [SGLang](https://github.com/sgl-project/sglang) project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced features that define state-of-the-art performance, including **Radix Attention** for efficient KV cache reuse, **Chunked Prefill** for controlled memory footprint, **Overlap Scheduling** for reduced CPU overhead, and **Tensor Parallelism** for scalable distributed serving. With an OpenAI-compatible API and out-of-the-box support for models like Llama-3 and Qwen-3, Mini-SGLang serves as both a capable inference engine and a transparent reference implementation for researchers and developers.
99

1010
The source code is available at [https://github.com/sgl-project/mini-sglang](https://github.com/sgl-project/mini-sglang).
1111

@@ -25,7 +25,7 @@ Despite its simplicity, Mini-SGLang supports both online and offline inference a
2525

2626
Many ML and system researchers struggle to integrate their optimizations into existing framework. On one hand, injecting new logic into complex frameworks like SGLang is risky: you may easily break implicit invariants of the system, which gives rise to subtle bugs. On the other hand, building an inference engine from scratch is tedious, requiring significant effort to handle infrastructure details (e.g., frontend servers, tokenization, NCCL communication) just to match state-of-the-art baselines.
2727

28-
Mini-SGLang strikes a balance. It offers an out-of-the-box, high-performance framework that is easy to inspect, extend and optimize. It handles the heavy lifting of infrastructure while being flexible enough for rapid prototyping. Additionally, Mini-SGLang provides **OpenAI-compatible benchmark utilities**, facilitating end-to-end performance analysis and comparison against various serving engines, such as [SGLang](https://github.com/sgl-project/sglang), [vLLM](https://github.com/vllm-project/vllm) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). For kernel developers, Mini-SGLang also provides fine-grained **NVTX annotations**, which are very valuable for kernel debugging and performance profiling.
28+
Mini-SGLang strikes a balance. It started as a research prototype we used to validate new system ideas quickly, without spending weeks handling a full-scale codebase or re-implementing infrastructure from scratch. It offers an out-of-the-box, high-performance framework that is easy to inspect, extend and optimize. It handles the heavy lifting of infrastructure while being flexible enough for rapid prototyping. Additionally, Mini-SGLang provides **OpenAI-compatible benchmark utilities**, facilitating end-to-end performance analysis and comparison against various serving engines, such as [SGLang](https://github.com/sgl-project/sglang), [vLLM](https://github.com/vllm-project/vllm) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). For kernel developers, Mini-SGLang also provides fine-grained **NVTX annotations**, which are very valuable for kernel debugging and performance profiling.
2929

3030
## Features
3131

@@ -75,7 +75,7 @@ The throughput results (in tokens per second) are shown below:
7575

7676
The results show that Mini-SGLang consistently outperforms Nano-vLLM baseline on both Qwen3 models, thanks to our **overlap scheduling** mechanism that effectively hides CPU overhead.
7777

78-
**Reproducibility**: The offline benchmark script is available at [this link](https://github.com/sgl-project/mini-sglang/blob/main/benchmark/offline/bench_nanovllm.py).
78+
**Reproducibility**: The offline benchmark script is available at [this link](https://github.com/sgl-project/mini-sglang/blob/main/benchmark/offline/bench.py).
7979

8080
### Online Serving Latency
8181

0 commit comments

Comments
 (0)