|
| 1 | +--- |
| 2 | +title: "Mini-SGLang: Efficient Inference Engine in a Nutshell" |
| 3 | +author: "SGLang Team" |
| 4 | +date: "December 17, 2025" |
| 5 | +previewImg: /images/blog/minisgl/logo.png |
| 6 | +--- |
| 7 | + |
| 8 | +We're excited to introduce **Mini-SGLang**, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the [SGLang](https://github.com/sgl-project/sglang) project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced features that define state-of-the-art performance, including **Radix Attention** for efficient KV cache reuse, **Chunked Prefill** for controlled memory footprint, and **Tensor Parallelism** for scalable distributed serving. With an OpenAI-compatible API and out-of-the-box support for models like Llama-3 and Qwen-3, Mini-SGLang serves as both a capable inference engine and a transparent reference implementation for researchers and developers. |
| 9 | + |
| 10 | +The source code is available at [https://github.com/sgl-project/mini-sglang](https://github.com/sgl-project/mini-sglang). |
| 11 | + |
| 12 | +<!--  --> |
| 13 | + |
| 14 | +## Motivation: Why Mini-SGLang? |
| 15 | + |
| 16 | +Although SGLang has achieved state-of-the-art inference performance with a comprehensive feature set, its codebase has grown massive, reaching nearly 300k lines of Python code. To address the complexity barrier for learners and researchers, we built Mini-SGLang, focusing on two main objectives: providing learning resources and enabling fast prototyping for research. |
| 17 | + |
| 18 | +### Educational Purposes |
| 19 | + |
| 20 | +Mini-SGLang features a clean, highly modular codebase of only **5k lines of Python code**, which makes it significantly easier for beginners to understand the core components of a modern LLM serving engine. |
| 21 | + |
| 22 | +Despite its simplicity, Mini-SGLang supports both online and offline inference and implements essential modern optimizations, including **Tensor Parallelism**, **Overlap Scheduling**, **Chunked Prefill**, **Radix Cache**, and **JIT CUDA kernels**. This makes it a comprehensive learning resource. |
| 23 | + |
| 24 | +### Quick Research Prototype |
| 25 | + |
| 26 | +Many ML and system researchers struggle to integrate their optimizations into existing framework. On one hand, injecting new logic into complex frameworks like SGLang is risky: you may easily break implicit invariants of the system, which gives rise to subtle bugs. On the other hand, building an inference engine from scratch is tedious, requiring significant effort to handle infrastructure details (e.g., frontend servers, tokenization, NCCL communication) just to match state-of-the-art baselines. |
| 27 | + |
| 28 | +Mini-SGLang strikes a balance. It offers an out-of-the-box, high-performance framework that is easy to inspect, extend and optimize. It handles the heavy lifting of infrastructure while being flexible enough for rapid prototyping. Additionally, Mini-SGLang provides **OpenAI-compatible benchmark utilities**, facilitating end-to-end performance analysis and comparison against various serving engines, such as [SGLang](https://github.com/sgl-project/sglang), [vLLM](https://github.com/vllm-project/vllm) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). For kernel developers, Mini-SGLang also provides fine-grained **NVTX annotations**, which are very valuable for kernel debugging and performance profiling. |
| 29 | + |
| 30 | +## Features |
| 31 | + |
| 32 | +Mini-SGLang shares the same high-level system architecture as SGLang, consisting of a frontend API server, a tokenizer server, and a backend scheduler for each GPU. |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +### Overlap Scheduling |
| 37 | + |
| 38 | +LLM inference is not just about GPU computation; a significant amount of work is handled by the CPU, including batch scheduling, memory management, and token processing. Without optimization, this CPU overhead can lead to GPU idling, hurting overall performance. |
| 39 | + |
| 40 | +Mini-SGLang implements an **overlap scheduling** mechanism, similar to the one in SGLang, to mitigate this. By preparing the next batch of requests on the CPU while the GPU is busy with the current batch, it effectively hides the CPU overhead. As the Nsight-Systems profile below shows, this keeps the GPU consistently utilized, eliminating GPU idleness and maximizing throughput. More technical details are available in our [previous blog post](https://lmsys.org/blog/2024-12-04-sglang-v0-4/). |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | +> An example of overlapped execution. CPU execution overhead is fully overlapped. |
| 45 | +
|
| 46 | + |
| 47 | + |
| 48 | +> An example of non-overlapped execution. CPU execution overhead leads to substantial GPU stalls. |
| 49 | +
|
| 50 | +To run an ablation study without overlap scheduling, set the environment variable `MINISGL_DISABLE_OVERLAP_SCHEDULING=1`. |
| 51 | + |
| 52 | +### High-Performance Kernels |
| 53 | + |
| 54 | +Mini-SGLang integrates state-of-the-art attention kernels to ensure top performance. It leverages [FlashAttention-3](https://github.com/Dao-AILab/flash-attention) for prefill kernel and [FlashInfer](https://github.com/flashinfer-ai/flashinfer) for decode kernel on NVIDIA Hopper architecture. |
| 55 | + |
| 56 | +Following [FlashInfer](https://github.com/flashinfer-ai/flashinfer) and [SGLang](https://github.com/sgl-project/sglang), Mini-SGLang also integrates just-in-time (JIT) compiled kernel for better runtime performance. We adopt [TVM FFI](https://github.com/apache/tvm-ffi) for Python binding, which is much faster than the default PyTorch interface due to its lightweight design. |
| 57 | + |
| 58 | +### Interactive Shell Mode |
| 59 | + |
| 60 | +For direct interaction and testing, Mini-SGLang includes a simple shell mode. This allows users to chat with LLMs directly from the command line, providing a convenient way to test models and observe their behavior without needing a separate client. |
| 61 | + |
| 62 | + |
| 63 | + |
| 64 | +## Performance Benchmark |
| 65 | + |
| 66 | +To evaluate the performance of Mini-SGLang, we conducted comprehensive experiments covering both offline throughput and online serving latency. |
| 67 | + |
| 68 | +### Offline Inference Throughput |
| 69 | + |
| 70 | +We evaluated Mini-SGLang's offline throughput against Nano-vLLM on a single NVIDIA H200 GPU. Following the methodology from [Nano-vLLM](https://github.com/GeeeekExplorer/nano-vllm/), we used the [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B/) model and also tested the larger [Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B/) model to assess performance at scale. We focused on Qwen3 models due to the current limitations of the Nano-vLLM baseline. |
| 71 | + |
| 72 | +The throughput results (in tokens per second) are shown below: |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +The results show that Mini-SGLang consistently outperforms Nano-vLLM baseline on both Qwen3 models, thanks to our **overlap scheduling** mechanism that effectively hides CPU overhead. |
| 77 | + |
| 78 | +**Reproducibility**: The offline benchmark script is available at [this link](https://github.com/sgl-project/mini-sglang/blob/main/benchmark/offline/bench_nanovllm.py). |
| 79 | + |
| 80 | +### Online Serving Latency |
| 81 | + |
| 82 | +To assess real-world serving performance, we benchmarked Mini-SGLang against SGLang using a realistic workload from the [Qwen trace](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon/blob/main/qwen_traceA_blksz_16.jsonl). We replayed 1,000 requests to a [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model deployed with 4-way tensor parallelism on 4 H200 GPUs. We measured throughput, 90th percentile (P90) Time To First Token (TTFT), and Time Between Tokens (TBT). |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | +The results demonstrate that Mini-SGLang achieves nearly identical performance to SGLang, confirming that its lightweight design does not compromise on throughput or latency. |
| 87 | + |
| 88 | +**Reproducibility**: Use the following commands to launch each system: |
| 89 | + |
| 90 | +```bash |
| 91 | +# Mini-SGLang |
| 92 | +python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive |
| 93 | + |
| 94 | +# SGLang |
| 95 | +python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \ |
| 96 | + --disable-radix --port 1919 --decode-attention flashinfer |
| 97 | +``` |
| 98 | + |
| 99 | +The online benchmark script is available at [this link](https://github.com/sgl-project/mini-sglang/blob/main/benchmark/online/bench_qwen.py). |
| 100 | + |
| 101 | +## Conclusion |
| 102 | + |
| 103 | +Mini-SGLang successfully distills the power of a state-of-the-art inference engine into a compact and understandable codebase. By retaining key optimizations like overlap scheduling and high-performance attention kernels, it delivers impressive performance while serving as an invaluable educational tool and a flexible platform for research. |
| 104 | + |
| 105 | +We invite you to explore the [source code](https://github.com/sgl-project/mini-sglang), run the benchmarks, and see for yourself how Mini-SGLang makes high-performance LLM inference more accessible than ever. |
| 106 | + |
| 107 | +## Acknowledgements |
| 108 | + |
| 109 | +- We would like to thank the SGLang team and community for their generous support and feedback, especially Liangsheng Yin, Lianmin Zheng and many others. |
| 110 | +- We would like to thank [MisakaVan](https://github.com/MisakaVan) for his prominent contribution in testing, documentation, code improvement, and [Yi Pan](https://github.com/Conless) for the initial PyTorch implementation of C++ NCCL communicator. |
| 111 | +- We learn a lot from the system design of SGLang, FlashInfer, vLLM and Nano-vLLM, which jointly help make Mini-SGLang a clean yet robust system. |
0 commit comments