Add installation and usage details for KTransformers (#225)

Atream · web-flow · commit 1eb2f4aaa50d · 2025-10-22T10:51:42.000-07:00
diff --git a/blog/2025-10-22-KTransformers.md b/blog/2025-10-22-KTransformers.md
@@ -8,7 +8,7 @@ previewImg: /images/blog/ktransformers/primary.png
 ## Background: Hybrid Inference for Sparse MoE Models
 Modern Mixture-of-Experts (MoE) language models such as **DeepSeek-V3** contain hundreds of billions of parameters, but only a small subset of experts are activated per token.
 
-This **sparse activation** pattern makes MoE models ideal for **CPU/GPU hybrid inference**: the sparsely activated experts can run efficiently on CPUs with large memory capacity, while the dense and compute-intensive components — attention and shared experts — execute on GPUs with higher bandwidth and throughput.This hybrid design allows trillion-parameter models to be deployed on a single machine with limited GPU memory, enabling local inference for research and private applications.
+This **sparse activation** pattern makes MoE models ideal for **CPU/GPU hybrid inference**: the sparsely activated experts can run efficiently on CPUs with large memory capacity, while the dense and compute-intensive components — attention and shared experts — execute on GPUs with higher bandwidth and throughput.
 
 <img src="/images/blog/ktransformers/heterogeneous_computing.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%"></img>
 
@@ -56,6 +56,84 @@ With this joint design, users across diverse hardware configurations can fully u
 
 We have already developed a proof-of-concept implementation, and the [roadmap](https://github.com/sgl-project/sglang/issues/11425) for full integration into SGLang is underway.
 
+## Installation
+
+To use KTransformers hybrid inference with SGLang, you need to install both SGLang and the KTransformers CPU kernels (`kt-kernel`).
+
+### Prerequisites
+
+Before installation, ensure your system meets the following requirements:
+
+- **CUDA**: Version 12.1 or above with proper PATH configuration
+- **Operating System**: Linux x86_64
+- **Compiler**: gcc, g++ >= 11
+- **Build Tools**: CMake >= 3.25 (Note: Ubuntu 22.04 LTS default CMake may be too old)
+- **Python**: Python 3.11 (via Miniconda3 or Anaconda3)
+
+### Step 1: Install SGLang
+
+Follow the official [SGLang installation guide](https://docs.sglang.ai/get_started/install.html) to install SGLang:
+
+```bash
+pip install "sglang[all]"
+```
+
+### Step 2: Install KTransformers CPU Kernels
+
+The KTransformers CPU kernels (`kt-kernel`) provide AMX-optimized computation for hybrid inference, for detailed installation instructions and troubleshooting, refer to the [official kt-kernel installation guide](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md).
+
+## Usage Example
+
+### Downloading Models
+
+The DeepSeek-R1 models optimized for KTransformers hybrid inference (including both GPU and CPU weights) can be downloaded from the [Approaching AI ModelScope profile](https://modelscope.cn/profile/ApproachingAI2024).
+
+### Launching the Server
+
+To launch an SGLang server with KTransformers hybrid inference enabled, you can use the following command:
+
+```bash
+python -m sglang.launch_server \
+  --host 0.0.0.0 \
+  --port 30000 \
+  --model /path/to/gpu-weight \
+  --kt-amx-weight-path /path/to/cpu-weight \
+  --kt-cpuinfer 80 \
+  --kt-threadpool-count 2 \
+  --kt-num-gpu-experts 200 \
+  --kt-amx-method AMXINT4 \
+  --attention-backend triton \
+  --trust-remote-code \
+  --mem-fraction-static 0.98 \
+  --chunked-prefill-size 4096 \
+  --max-running-requests 37 \
+  --max-total-tokens 37000 \
+  --served-model-name DeepSeek-R1-0528-FP8 \
+  --enable-mixed-chunk \
+  --tensor-parallel-size 8 \
+  --enable-p2p-check \
+  --disable-shared-experts-fusion
+```
+
+### Key Parameters
+
+- `--kt-amx-weight-path`: Path to the CPU-optimized model weights. These weights are pre-quantized and formatted for efficient AMX computation.
+- `--kt-cpuinfer`: Number of CPU cores dedicated to expert inference (e.g., 80 cores for dual-socket servers).
+- `--kt-threadpool-count`: Number of thread pools for parallel CPU execution. Typically set to 2 for dual-socket NUMA configurations.
+- `--kt-num-gpu-experts`: Number of "hot" experts to keep on GPU. More GPU experts reduce CPU compute pressure but require additional GPU memory. Adjust based on GPU capacity and workload patterns.
+- `--kt-amx-method`: CPU kernel optimization method. Use `AMXINT4` for int4-quantized models to leverage Intel AMX instructions for maximum throughput.
+
+### Hardware Requirements
+
+For optimal performance with KTransformers hybrid inference:
+
+- **CPUs**: Modern Intel Xeon processors with AMX support (e.g., Sapphire Rapids or later) for maximum CPU expert throughput.
+- **Memory**: Sufficient DDR5 memory to hold all expert weights (typically 500GB+ for DeepSeek-V3-sized models).
+- **GPUs**: One or more GPUs with enough memory for attention layers, shared experts, and a subset of routed experts.
+- **NUMA**: Dual-socket configurations benefit from NUMA-aware thread pool assignment (`--kt-threadpool-count 2`).
+
+After launching the server, you can send inference requests via the OpenAI-compatible API endpoint at `http://0.0.0.0:30000`.
+
 ## Benchmark Results (Preview)
 
 ### Single-GPU + CPU Performance
@@ -80,6 +158,68 @@ We further evaluate the multi-GPU + CPU hybrid inference capability enabled by i
 
 The table above presents the total throughput (tokens/s) under different levels of concurrency and varying numbers of GPUs. As shown, under single-concurrency conditions, the 8-GPU configuration provides only a limited improvement over the 1-GPU setup (an increase of merely 26%). However, under 8-way concurrency, the same 8-GPU configuration achieves a **264% throughput** gain compared to 1 GPU, demonstrating excellent usability—each request achieves nearly 20 tokens per second on average. The improvement mainly comes from placing more experts on GPUs, which reduces CPU memory accesses under bandwidth bottlenecks.
 
+#### ShareGPT Benchmark on RTX 4090 × 8 Setup
+
+We further evaluated the SGLang + KTransformers integration on a consumer-grade GPU setup using **8× RTX 4090 GPUs** with an **Intel Xeon Platinum 8488C CPU**. The benchmark was conducted on **DeepSeek-R1-0528**, a large-scale MoE model from the DeepSeek-R1 series, using the ShareGPT dataset with 1000 conversation requests (301K input tokens, 188K output tokens).
+
+**System Configuration:**
+- GPUs: 8× NVIDIA RTX 4090
+- CPU: Intel Xeon Platinum 8488C
+- Model: DeepSeek-R1-0528 (FP8 quantized MoE model)
+- Dataset: ShareGPT (1000 requests)
+
+**Benchmark Commands:**
+
+First, launch the SGLang server:
+
+```bash
+PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
+python -m sglang.launch_server \
+  --host 0.0.0.0 \
+  --port 30000 \
+  --model models/DeepSeek-R1-0528-GPU-weight \
+  --kt-amx-weight-path models/DeepSeek-R1-0528-CPU-weight \
+  --kt-cpuinfer 80 \
+  --kt-threadpool-count 2 \
+  --kt-num-gpu-experts 200 \
+  --kt-amx-method AMXINT4 \
+  --attention-backend triton \
+  --trust-remote-code \
+  --mem-fraction-static 0.98 \
+  --chunked-prefill-size 4096 \
+  --max-running-requests 37 \
+  --max-total-tokens 37000 \
+  --served-model-name DeepSeek-R1-0528-FP8 \
+  --enable-mixed-chunk \
+  --tensor-parallel-size 8 \
+  --enable-p2p-check \
+  --disable-shared-experts-fusion
+```
+
+Then, run the benchmark in a separate terminal:
+
+```bash
+python -m sglang.bench_serving \
+  --backend sglang \
+  --host 127.0.0.1 \
+  --port 30000 \
+  --num-prompts 1000 \
+  --model models/DeepSeek-R1-0528-GPU-weight
+```
+
+**Performance Results:**
+
+| Metric | Value |
+|--------|-------|
+| Total Token Throughput | 302.71 tok/s |
+| Output Token Throughput | 116.36 tok/s |
+| Request Throughput | 0.62 req/s |
+| Mean Inter-Token Latency (ITL) | 300.80 ms |
+| Median Inter-Token Latency | 208.43 ms |
+| P99 Inter-Token Latency | 1364.97 ms |
+
+This setup demonstrates that SGLang + KTransformers can effectively leverage consumer-grade GPUs for hybrid inference, achieving **over 300 tokens/s total throughput** on trillion-parameter MoE models. The relatively low inter-token latency (median 208ms) ensures smooth streaming generation for interactive applications.
+
 ## Acknowledgements
 
 We would like to thank everyone in the community that helped make this effort possible.
@@ -88,7 +228,7 @@ We would like to thank everyone in the community that helped make this effort po
 
 **Approaching AI**: Jiahao Wang, Ziwei Yuan, Yaochen Han,  Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongsen Hu, Zhuo Wang, Daocheng Ye, Yanlong Wu, Yufeng Tian, Heng Guo, Hao Wu, Zirui Li, Yingqi Tian, Yue Qin, Xin Qu, Baijin Hao, Donghui Liu.
 
-**SGLang team and community:** Jingyi Chen, Shangming Cai, Lianmin Zheng, Yineng Zhang and many others for their insightful review comments on this PR and for their work on SGLang framework.
+**SGLang team and community:** Jingyi Chen, Shangming Cai, Lianmin Zheng, Yineng Zhang and many others for their insightful review comments on this PR and for their work on SGLang framework.
 
 ## Related resources