Skip to content

Commit 1eb2f4a

Browse files
authored
Add installation and usage details for KTransformers (#225)
1 parent 7990316 commit 1eb2f4a

File tree

1 file changed

+142
-2
lines changed

1 file changed

+142
-2
lines changed

blog/2025-10-22-KTransformers.md

Lines changed: 142 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ previewImg: /images/blog/ktransformers/primary.png
88
## Background: Hybrid Inference for Sparse MoE Models
99
Modern Mixture-of-Experts (MoE) language models such as **DeepSeek-V3** contain hundreds of billions of parameters, but only a small subset of experts are activated per token.
1010

11-
This **sparse activation** pattern makes MoE models ideal for **CPU/GPU hybrid inference**: the sparsely activated experts can run efficiently on CPUs with large memory capacity, while the dense and compute-intensive components — attention and shared experts — execute on GPUs with higher bandwidth and throughput.This hybrid design allows trillion-parameter models to be deployed on a single machine with limited GPU memory, enabling local inference for research and private applications.
11+
This **sparse activation** pattern makes MoE models ideal for **CPU/GPU hybrid inference**: the sparsely activated experts can run efficiently on CPUs with large memory capacity, while the dense and compute-intensive components — attention and shared experts — execute on GPUs with higher bandwidth and throughput.
1212

1313
<img src="/images/blog/ktransformers/heterogeneous_computing.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%"></img>
1414

@@ -56,6 +56,84 @@ With this joint design, users across diverse hardware configurations can fully u
5656

5757
We have already developed a proof-of-concept implementation, and the [roadmap](https://github.com/sgl-project/sglang/issues/11425) for full integration into SGLang is underway.
5858

59+
## Installation
60+
61+
To use KTransformers hybrid inference with SGLang, you need to install both SGLang and the KTransformers CPU kernels (`kt-kernel`).
62+
63+
### Prerequisites
64+
65+
Before installation, ensure your system meets the following requirements:
66+
67+
- **CUDA**: Version 12.1 or above with proper PATH configuration
68+
- **Operating System**: Linux x86_64
69+
- **Compiler**: gcc, g++ >= 11
70+
- **Build Tools**: CMake >= 3.25 (Note: Ubuntu 22.04 LTS default CMake may be too old)
71+
- **Python**: Python 3.11 (via Miniconda3 or Anaconda3)
72+
73+
### Step 1: Install SGLang
74+
75+
Follow the official [SGLang installation guide](https://docs.sglang.ai/get_started/install.html) to install SGLang:
76+
77+
```bash
78+
pip install "sglang[all]"
79+
```
80+
81+
### Step 2: Install KTransformers CPU Kernels
82+
83+
The KTransformers CPU kernels (`kt-kernel`) provide AMX-optimized computation for hybrid inference, for detailed installation instructions and troubleshooting, refer to the [official kt-kernel installation guide](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md).
84+
85+
## Usage Example
86+
87+
### Downloading Models
88+
89+
The DeepSeek-R1 models optimized for KTransformers hybrid inference (including both GPU and CPU weights) can be downloaded from the [Approaching AI ModelScope profile](https://modelscope.cn/profile/ApproachingAI2024).
90+
91+
### Launching the Server
92+
93+
To launch an SGLang server with KTransformers hybrid inference enabled, you can use the following command:
94+
95+
```bash
96+
python -m sglang.launch_server \
97+
--host 0.0.0.0 \
98+
--port 30000 \
99+
--model /path/to/gpu-weight \
100+
--kt-amx-weight-path /path/to/cpu-weight \
101+
--kt-cpuinfer 80 \
102+
--kt-threadpool-count 2 \
103+
--kt-num-gpu-experts 200 \
104+
--kt-amx-method AMXINT4 \
105+
--attention-backend triton \
106+
--trust-remote-code \
107+
--mem-fraction-static 0.98 \
108+
--chunked-prefill-size 4096 \
109+
--max-running-requests 37 \
110+
--max-total-tokens 37000 \
111+
--served-model-name DeepSeek-R1-0528-FP8 \
112+
--enable-mixed-chunk \
113+
--tensor-parallel-size 8 \
114+
--enable-p2p-check \
115+
--disable-shared-experts-fusion
116+
```
117+
118+
### Key Parameters
119+
120+
- `--kt-amx-weight-path`: Path to the CPU-optimized model weights. These weights are pre-quantized and formatted for efficient AMX computation.
121+
- `--kt-cpuinfer`: Number of CPU cores dedicated to expert inference (e.g., 80 cores for dual-socket servers).
122+
- `--kt-threadpool-count`: Number of thread pools for parallel CPU execution. Typically set to 2 for dual-socket NUMA configurations.
123+
- `--kt-num-gpu-experts`: Number of "hot" experts to keep on GPU. More GPU experts reduce CPU compute pressure but require additional GPU memory. Adjust based on GPU capacity and workload patterns.
124+
- `--kt-amx-method`: CPU kernel optimization method. Use `AMXINT4` for int4-quantized models to leverage Intel AMX instructions for maximum throughput.
125+
126+
### Hardware Requirements
127+
128+
For optimal performance with KTransformers hybrid inference:
129+
130+
- **CPUs**: Modern Intel Xeon processors with AMX support (e.g., Sapphire Rapids or later) for maximum CPU expert throughput.
131+
- **Memory**: Sufficient DDR5 memory to hold all expert weights (typically 500GB+ for DeepSeek-V3-sized models).
132+
- **GPUs**: One or more GPUs with enough memory for attention layers, shared experts, and a subset of routed experts.
133+
- **NUMA**: Dual-socket configurations benefit from NUMA-aware thread pool assignment (`--kt-threadpool-count 2`).
134+
135+
After launching the server, you can send inference requests via the OpenAI-compatible API endpoint at `http://0.0.0.0:30000`.
136+
59137
## Benchmark Results (Preview)
60138

61139
### Single-GPU + CPU Performance
@@ -80,6 +158,68 @@ We further evaluate the multi-GPU + CPU hybrid inference capability enabled by i
80158

81159
The table above presents the total throughput (tokens/s) under different levels of concurrency and varying numbers of GPUs. As shown, under single-concurrency conditions, the 8-GPU configuration provides only a limited improvement over the 1-GPU setup (an increase of merely 26%). However, under 8-way concurrency, the same 8-GPU configuration achieves a **264% throughput** gain compared to 1 GPU, demonstrating excellent usability—each request achieves nearly 20 tokens per second on average. The improvement mainly comes from placing more experts on GPUs, which reduces CPU memory accesses under bandwidth bottlenecks.
82160

161+
#### ShareGPT Benchmark on RTX 4090 × 8 Setup
162+
163+
We further evaluated the SGLang + KTransformers integration on a consumer-grade GPU setup using **8× RTX 4090 GPUs** with an **Intel Xeon Platinum 8488C CPU**. The benchmark was conducted on **DeepSeek-R1-0528**, a large-scale MoE model from the DeepSeek-R1 series, using the ShareGPT dataset with 1000 conversation requests (301K input tokens, 188K output tokens).
164+
165+
**System Configuration:**
166+
- GPUs: 8× NVIDIA RTX 4090
167+
- CPU: Intel Xeon Platinum 8488C
168+
- Model: DeepSeek-R1-0528 (FP8 quantized MoE model)
169+
- Dataset: ShareGPT (1000 requests)
170+
171+
**Benchmark Commands:**
172+
173+
First, launch the SGLang server:
174+
175+
```bash
176+
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
177+
python -m sglang.launch_server \
178+
--host 0.0.0.0 \
179+
--port 30000 \
180+
--model models/DeepSeek-R1-0528-GPU-weight \
181+
--kt-amx-weight-path models/DeepSeek-R1-0528-CPU-weight \
182+
--kt-cpuinfer 80 \
183+
--kt-threadpool-count 2 \
184+
--kt-num-gpu-experts 200 \
185+
--kt-amx-method AMXINT4 \
186+
--attention-backend triton \
187+
--trust-remote-code \
188+
--mem-fraction-static 0.98 \
189+
--chunked-prefill-size 4096 \
190+
--max-running-requests 37 \
191+
--max-total-tokens 37000 \
192+
--served-model-name DeepSeek-R1-0528-FP8 \
193+
--enable-mixed-chunk \
194+
--tensor-parallel-size 8 \
195+
--enable-p2p-check \
196+
--disable-shared-experts-fusion
197+
```
198+
199+
Then, run the benchmark in a separate terminal:
200+
201+
```bash
202+
python -m sglang.bench_serving \
203+
--backend sglang \
204+
--host 127.0.0.1 \
205+
--port 30000 \
206+
--num-prompts 1000 \
207+
--model models/DeepSeek-R1-0528-GPU-weight
208+
```
209+
210+
**Performance Results:**
211+
212+
| Metric | Value |
213+
|--------|-------|
214+
| Total Token Throughput | 302.71 tok/s |
215+
| Output Token Throughput | 116.36 tok/s |
216+
| Request Throughput | 0.62 req/s |
217+
| Mean Inter-Token Latency (ITL) | 300.80 ms |
218+
| Median Inter-Token Latency | 208.43 ms |
219+
| P99 Inter-Token Latency | 1364.97 ms |
220+
221+
This setup demonstrates that SGLang + KTransformers can effectively leverage consumer-grade GPUs for hybrid inference, achieving **over 300 tokens/s total throughput** on trillion-parameter MoE models. The relatively low inter-token latency (median 208ms) ensures smooth streaming generation for interactive applications.
222+
83223
## Acknowledgements
84224

85225
We would like to thank everyone in the community that helped make this effort possible.
@@ -88,7 +228,7 @@ We would like to thank everyone in the community that helped make this effort po
88228

89229
**Approaching AI**: Jiahao Wang, Ziwei Yuan, Yaochen Han, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongsen Hu, Zhuo Wang, Daocheng Ye, Yanlong Wu, Yufeng Tian, Heng Guo, Hao Wu, Zirui Li, Yingqi Tian, Yue Qin, Xin Qu, Baijin Hao, Donghui Liu.
90230

91-
**SGLang team and community:** Jingyi Chen, Shangming Cai, Lianmin Zheng, Yineng Zhang and many others for their insightful review comments on this PR and for their work on SGLang framework.
231+
**SGLang team and community:** Jingyi Chen, Shangming Cai, Lianmin Zheng, Yineng Zhang and many others for their insightful review comments on this PR and for their work on SGLang framework.
92232

93233
## Related resources
94234

0 commit comments

Comments
 (0)