You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Background: Hybrid Inference for Sparse MoE Models
9
9
Modern Mixture-of-Experts (MoE) language models such as **DeepSeek-V3** contain hundreds of billions of parameters, but only a small subset of experts are activated per token.
10
10
11
-
This **sparse activation** pattern makes MoE models ideal for **CPU/GPU hybrid inference**: the sparsely activated experts can run efficiently on CPUs with large memory capacity, while the dense and compute-intensive components — attention and shared experts — execute on GPUs with higher bandwidth and throughput.This hybrid design allows trillion-parameter models to be deployed on a single machine with limited GPU memory, enabling local inference for research and private applications.
11
+
This **sparse activation** pattern makes MoE models ideal for **CPU/GPU hybrid inference**: the sparsely activated experts can run efficiently on CPUs with large memory capacity, while the dense and compute-intensive components — attention and shared experts — execute on GPUs with higher bandwidth and throughput.
@@ -56,6 +56,84 @@ With this joint design, users across diverse hardware configurations can fully u
56
56
57
57
We have already developed a proof-of-concept implementation, and the [roadmap](https://github.com/sgl-project/sglang/issues/11425) for full integration into SGLang is underway.
58
58
59
+
## Installation
60
+
61
+
To use KTransformers hybrid inference with SGLang, you need to install both SGLang and the KTransformers CPU kernels (`kt-kernel`).
62
+
63
+
### Prerequisites
64
+
65
+
Before installation, ensure your system meets the following requirements:
66
+
67
+
-**CUDA**: Version 12.1 or above with proper PATH configuration
68
+
-**Operating System**: Linux x86_64
69
+
-**Compiler**: gcc, g++ >= 11
70
+
-**Build Tools**: CMake >= 3.25 (Note: Ubuntu 22.04 LTS default CMake may be too old)
71
+
-**Python**: Python 3.11 (via Miniconda3 or Anaconda3)
72
+
73
+
### Step 1: Install SGLang
74
+
75
+
Follow the official [SGLang installation guide](https://docs.sglang.ai/get_started/install.html) to install SGLang:
76
+
77
+
```bash
78
+
pip install "sglang[all]"
79
+
```
80
+
81
+
### Step 2: Install KTransformers CPU Kernels
82
+
83
+
The KTransformers CPU kernels (`kt-kernel`) provide AMX-optimized computation for hybrid inference, for detailed installation instructions and troubleshooting, refer to the [official kt-kernel installation guide](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md).
84
+
85
+
## Usage Example
86
+
87
+
### Downloading Models
88
+
89
+
The DeepSeek-R1 models optimized for KTransformers hybrid inference (including both GPU and CPU weights) can be downloaded from the [Approaching AI ModelScope profile](https://modelscope.cn/profile/ApproachingAI2024).
90
+
91
+
### Launching the Server
92
+
93
+
To launch an SGLang server with KTransformers hybrid inference enabled, you can use the following command:
94
+
95
+
```bash
96
+
python -m sglang.launch_server \
97
+
--host 0.0.0.0 \
98
+
--port 30000 \
99
+
--model /path/to/gpu-weight \
100
+
--kt-amx-weight-path /path/to/cpu-weight \
101
+
--kt-cpuinfer 80 \
102
+
--kt-threadpool-count 2 \
103
+
--kt-num-gpu-experts 200 \
104
+
--kt-amx-method AMXINT4 \
105
+
--attention-backend triton \
106
+
--trust-remote-code \
107
+
--mem-fraction-static 0.98 \
108
+
--chunked-prefill-size 4096 \
109
+
--max-running-requests 37 \
110
+
--max-total-tokens 37000 \
111
+
--served-model-name DeepSeek-R1-0528-FP8 \
112
+
--enable-mixed-chunk \
113
+
--tensor-parallel-size 8 \
114
+
--enable-p2p-check \
115
+
--disable-shared-experts-fusion
116
+
```
117
+
118
+
### Key Parameters
119
+
120
+
-`--kt-amx-weight-path`: Path to the CPU-optimized model weights. These weights are pre-quantized and formatted for efficient AMX computation.
121
+
-`--kt-cpuinfer`: Number of CPU cores dedicated to expert inference (e.g., 80 cores for dual-socket servers).
122
+
-`--kt-threadpool-count`: Number of thread pools for parallel CPU execution. Typically set to 2 for dual-socket NUMA configurations.
123
+
-`--kt-num-gpu-experts`: Number of "hot" experts to keep on GPU. More GPU experts reduce CPU compute pressure but require additional GPU memory. Adjust based on GPU capacity and workload patterns.
124
+
-`--kt-amx-method`: CPU kernel optimization method. Use `AMXINT4` for int4-quantized models to leverage Intel AMX instructions for maximum throughput.
125
+
126
+
### Hardware Requirements
127
+
128
+
For optimal performance with KTransformers hybrid inference:
129
+
130
+
-**CPUs**: Modern Intel Xeon processors with AMX support (e.g., Sapphire Rapids or later) for maximum CPU expert throughput.
131
+
-**Memory**: Sufficient DDR5 memory to hold all expert weights (typically 500GB+ for DeepSeek-V3-sized models).
132
+
-**GPUs**: One or more GPUs with enough memory for attention layers, shared experts, and a subset of routed experts.
133
+
-**NUMA**: Dual-socket configurations benefit from NUMA-aware thread pool assignment (`--kt-threadpool-count 2`).
134
+
135
+
After launching the server, you can send inference requests via the OpenAI-compatible API endpoint at `http://0.0.0.0:30000`.
136
+
59
137
## Benchmark Results (Preview)
60
138
61
139
### Single-GPU + CPU Performance
@@ -80,6 +158,68 @@ We further evaluate the multi-GPU + CPU hybrid inference capability enabled by i
80
158
81
159
The table above presents the total throughput (tokens/s) under different levels of concurrency and varying numbers of GPUs. As shown, under single-concurrency conditions, the 8-GPU configuration provides only a limited improvement over the 1-GPU setup (an increase of merely 26%). However, under 8-way concurrency, the same 8-GPU configuration achieves a **264% throughput** gain compared to 1 GPU, demonstrating excellent usability—each request achieves nearly 20 tokens per second on average. The improvement mainly comes from placing more experts on GPUs, which reduces CPU memory accesses under bandwidth bottlenecks.
82
160
161
+
#### ShareGPT Benchmark on RTX 4090 × 8 Setup
162
+
163
+
We further evaluated the SGLang + KTransformers integration on a consumer-grade GPU setup using **8× RTX 4090 GPUs** with an **Intel Xeon Platinum 8488C CPU**. The benchmark was conducted on **DeepSeek-R1-0528**, a large-scale MoE model from the DeepSeek-R1 series, using the ShareGPT dataset with 1000 conversation requests (301K input tokens, 188K output tokens).
This setup demonstrates that SGLang + KTransformers can effectively leverage consumer-grade GPUs for hybrid inference, achieving **over 300 tokens/s total throughput** on trillion-parameter MoE models. The relatively low inter-token latency (median 208ms) ensures smooth streaming generation for interactive applications.
222
+
83
223
## Acknowledgements
84
224
85
225
We would like to thank everyone in the community that helped make this effort possible.
@@ -88,7 +228,7 @@ We would like to thank everyone in the community that helped make this effort po
**SGLang team and community:**Jingyi Chen, Shangming Cai, Lianmin Zheng, Yineng Zhang and many others for their insightful review comments on this PR and for their work on SGLang framework.
231
+
**SGLang team and community:**Jingyi Chen, Shangming Cai, Lianmin Zheng, Yineng Zhang and many others for their insightful review comments on this PR and for their work on SGLang framework.
0 commit comments