Skip to content

Commit 670c488

Browse files
authored
[docs]: Add deepseek-v3.2 run tutorial (#1659)
1 parent fcf8882 commit 670c488

File tree

1 file changed

+177
-0
lines changed

1 file changed

+177
-0
lines changed
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Running DeepSeek V3.2 with SGLang and KT-Kernel
2+
3+
This tutorial demonstrates how to run DeepSeek V3.2 model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. This setup enables efficient deployment of large MoE models by offloading experts to CPU.
4+
5+
## Table of Contents
6+
7+
- [Hardware Requirements](#hardware-requirements)
8+
- [Prerequisites](#prerequisites)
9+
- [Step 1: Download Model Weights](#step-1-download-model-weights)
10+
- [Step 2: Quantize CPU Weights](#step-2-quantize-cpu-weights)
11+
- [Step 3: Launch SGLang Server](#step-3-launch-sglang-server)
12+
- [Step 4: Send Inference Requests](#step-4-send-inference-requests)
13+
14+
## Hardware Requirements
15+
16+
**Minimum Configuration:**
17+
- **GPU**: NVIDIA L20 48GB (or equivalent with at least 27GB VRAM available)
18+
- **CPU**: Intel Xeon with AMX support (e.g., Sapphire Rapids)
19+
- **RAM**: At least 350GB system memory for INT4 quantization
20+
- **Storage**: ~1TB for model weights (FP8 + INT4 quantized)
21+
22+
**Tested Configuration:**
23+
- **GPU**: NVIDIA L20 48GB
24+
- **CPU**: Intel(R) Xeon(R) Platinum 8488C
25+
- **RAM**: 2TB DDR5
26+
- **OS**: Linux (Ubuntu 20.04+ recommended)
27+
28+
## Prerequisites
29+
30+
Before starting, ensure you have:
31+
32+
1. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation)
33+
2. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang)
34+
3. **CUDA toolkit** - Compatible with your GPU (CUDA 11.8+ recommended)
35+
4. **Hugging Face CLI** - For downloading models:
36+
```bash
37+
pip install huggingface-hub
38+
```
39+
40+
## Step 1: Download Model Weights
41+
42+
DeepSeek V3.2 requires downloading model repositories:
43+
44+
1. **DeepSeek-V3.2**
45+
2. **DeepSeek-V3.2-Speciale**
46+
47+
```bash
48+
# Create a directory for models
49+
mkdir -p /path/to/models
50+
cd /path/to/models
51+
52+
# Download DeepSeek-V3.2 (FP8 weights for GPU)
53+
huggingface-cli download deepseek-ai/DeepSeek-V3.2 \
54+
--local-dir /path/to/deepseek-v3.2
55+
56+
# Download DeepSeek-V3.2-Speciale (if needed)
57+
huggingface-cli download deepseek-ai/DeepSeek-V3.2-Speciale \
58+
--local-dir /path/to/deepseek-v3.2-speciale
59+
```
60+
61+
**Note:** Replace `/path/to/models` with your actual storage path throughout this tutorial.
62+
63+
## Step 2: Quantize CPU Weights
64+
65+
Convert the FP8 GPU weights to INT4 quantized CPU weights using the provided conversion script.
66+
67+
### Conversion Command
68+
69+
For a 2-NUMA system with 60 physical cores:
70+
71+
```bash
72+
cd /path/to/ktransformers/kt-kernel
73+
74+
python scripts/convert_cpu_weights.py \
75+
--input-path /path/to/deepseek-v3.2 \
76+
--input-type fp8 \
77+
--output /path/to/deepseek-v3.2-INT4 \
78+
--quant-method int4 \
79+
--cpuinfer-threads 60 \
80+
--threadpool-count 2 \
81+
--no-merge-safetensor
82+
```
83+
84+
## Step 3: Launch SGLang Server
85+
86+
Start the SGLang server with KT-Kernel integration for CPU-GPU heterogeneous inference.
87+
88+
### Launch Command
89+
90+
For single NVIDIA L20 48GB + 2-NUMA CPU system:
91+
92+
```bash
93+
python -m sglang.launch_server \
94+
--host 0.0.0.0 \
95+
--port 30000 \
96+
--model /path/to/deepseek-v3.2 \
97+
--kt-weight-path /path/to/deepseek-v3.2-INT4 \
98+
--kt-cpuinfer 60 \
99+
--kt-threadpool-count 2 \
100+
--kt-num-gpu-experts 1 \
101+
--attention-backend triton \
102+
--trust-remote-code \
103+
--mem-fraction-static 0.98 \
104+
--chunked-prefill-size 4096 \
105+
--max-running-requests 32 \
106+
--max-total-tokens 40000 \
107+
--served-model-name DeepSeek-V3.2 \
108+
--enable-mixed-chunk \
109+
--tensor-parallel-size 1 \
110+
--enable-p2p-check \
111+
--disable-shared-experts-fusion \
112+
--kt-method AMXINT4
113+
```
114+
115+
### Resource Usage
116+
117+
- **GPU VRAM:** ~27GB (for 1 GPU expert per layer + attention)
118+
- **System RAM:** ~350GB (for INT4 quantized CPU experts)
119+
120+
## Step 4: Send Inference Requests
121+
122+
Once the server is running, you can send inference requests using the OpenAI-compatible API.
123+
124+
### Basic Chat Completion Request
125+
126+
```bash
127+
curl -s http://localhost:30000/v1/chat/completions \
128+
-H "Content-Type: application/json" \
129+
-d '{
130+
"model": "DeepSeek-V3.2",
131+
"stream": false,
132+
"messages": [
133+
{"role": "user", "content": "hi"}
134+
]
135+
}'
136+
```
137+
138+
### Example Response
139+
140+
```json
141+
{
142+
"id": "adbb44f6aafb4b58b167e42fbbb1eed3",
143+
"object": "chat.completion",
144+
"created": 1764675126,
145+
"model": "DeepSeek-V3.2",
146+
"choices": [
147+
{
148+
"index": 0,
149+
"message": {
150+
"role": "assistant",
151+
"content": "Hi there! 👋 \n\nThanks for stopping by! How can I help you today? Feel free to ask me anything - I'm here to assist with questions, explanations, conversations, or whatever you need! 😊\n\nIs there something specific on your mind, or would you like to know more about what I can do?",
152+
"reasoning_content": null,
153+
"tool_calls": null
154+
},
155+
"logprobs": null,
156+
"finish_reason": "stop",
157+
"matched_stop": 1
158+
}
159+
],
160+
"usage": {
161+
"prompt_tokens": 5,
162+
"total_tokens": 72,
163+
"completion_tokens": 67,
164+
"prompt_tokens_details": null,
165+
"reasoning_tokens": 0
166+
},
167+
"metadata": {
168+
"weight_version": "default"
169+
}
170+
}
171+
```
172+
173+
## Additional Resources
174+
175+
- [KT-Kernel Documentation](../../../kt-kernel/README.md)
176+
- [DeepSeek V3.2 Model Card](https://huggingface.co/deepseek-ai/DeepSeek-V3.2)
177+
- [SGLang GitHub](https://github.com/sgl-project/sglang)

0 commit comments

Comments
 (0)