Skip to content

Commit cfeded7

Browse files
ShangmingCaislin1237zhyncsAtream
authored
K2 blog (#162)
Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: slin1237 <[email protected]> Co-authored-by: zhyncs <[email protected]> Co-authored-by: Atream <[email protected]>
1 parent 1904fe6 commit cfeded7

File tree

2 files changed

+248
-0
lines changed

2 files changed

+248
-0
lines changed
Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
---
2+
title: "Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs"
3+
author: "The Mooncake Team"
4+
date: "July 20, 2025"
5+
previewImg: /images/blog/k2_large_scale/preview.jpg
6+
---
7+
8+
9+
## 1️⃣ Introduction: Deploying the Most Advanced Open-Source MoE Model
10+
11+
**Kimi K2 is currently the most advanced open-source Mixture-of-Experts (MoE) model available.**
12+
13+
Released by Moonshot AI in 2025, it features:
14+
15+
- **1 trillion total parameters**
16+
- **32 billion activated parameters per token**
17+
- **384 experts with dynamic routing**
18+
- **Multi-head Latent Attention (MLA)** for long context support
19+
20+
Kimi K2 achieves strong performance in **frontier knowledge, math, and coding**, and is optimized for **agentic tasks**—not just answering questions but taking multi-step actions.
21+
22+
Moonshot AI open-sourced two versions:
23+
24+
- **Kimi-K2-Base**: The foundation model for research and fine-tuning
25+
- **Kimi-K2-Instruct**: A post-trained model for general-purpose chat and agentic applications
26+
27+
For more details, please refer to the [official Kimi K2 release](https://moonshotai.github.io/Kimi-K2/).
28+
29+
---
30+
31+
### Why Large-Scale Deployment Matters
32+
33+
Large-scale deployment fully leverages hardware capabilities and reduces costs given the model’s architecture.
34+
35+
- **Serve More Requests, Faster:** Higher throughput, lower latency, more concurrent sessions, and shorter queues.
36+
- **Lower $/Token:** Saturate hardware and amortize model load; efficiency improves at scale.
37+
38+
However, the large-scale deployment of trillion-scale MoE models present unique challenges:
39+
40+
- **Computational sparsity in MoE layers** necessitates large batch sizes to make matrix operations compute-intensive. Large-scale Expert Parallelism (EP) scales parallelism strategies across more GPUs, aggregates requests from multiple devices, reduces per-GPU memory pressure, and frees up VRAM for larger KV caches—effectively increasing batch size.
41+
- **Cross-node** communication takes a large amount of time and requires optimizations
42+
- **Sparse expert activation** leads to load imbalance
43+
44+
Efficient deployment of Kimi K2 on **128 H200 GPUs** requires rethinking both system design and deployment workflows.
45+
46+
In this blog, we explain how we solved this problem using **OME** and **SGLang**.
47+
48+
---
49+
50+
## 2️⃣ Background: From DeepSeek R1 to Kimi K2
51+
52+
In May 2025, we published [Deploying DeepSeek R1 with PD Disaggregation and Large-Scale EP](https://lmsys.org/blog/2025-05-05-large-scale-ep/), where we demonstrated:
53+
54+
- **Prefill-Decode (PD) Disaggregation** to separate compute-heavy and latency-sensitive tasks
55+
- **Large-Scale Expert Parallelism (EP)** to handle MoE routing across 96 GPUs
56+
- **5× throughput improvement** compared to vanilla tensor parallelism on H100s
57+
58+
At the same time, our [OME blog](https://lmsys.org/blog/2025-07-08-ome/) introduced **model-driven deployment**, solving the operational gap between:
59+
60+
- **ML Engineers**, who design complex serving strategies
61+
- **Production Engineers**, who need simple and reliable deployments
62+
63+
The OME insight—the model should drive deployment, not vice-versa—proved productive for scaling to Kimi K2’s 1T-parameter architecture. This transition required adapting DeepSeek’s PD Disaggregation and EP to Kimi K2’s 384 experts while maintaining high performance.
64+
65+
---
66+
67+
## 3️⃣ Our Solution: OME + SGLang PD Disaggregation + Large-Scale Expert Parallelism
68+
69+
For Kimi K2, we combined the strengths of **OME** and **SGLang** to create an optimized, scalable deployment pipeline.
70+
71+
### Model-Driven Deployment with OME
72+
73+
OME (Open Model Engine) simplifies the deployment of advanced models like Kimi K2 by abstracting away the complexity of parallelism, sharding, scaling, and runtime configuration. With a declarative configuration model, OME enables production teams to deploy and manage large models without manual tuning or custom scripting.
74+
75+
**OME Installation**
76+
77+
Install OME directly from the OCI registry using the following commands:
78+
79+
```bash
80+
# Step 1: Install OME CRDs
81+
helm upgrade --install ome-crd oci://ghcr.io/moirai-internal/charts/ome-crd --namespace ome --create-namespace
82+
83+
# Step 2: Install OME core resources
84+
helm upgrade --install ome oci://ghcr.io/moirai-internal/charts/ome-resources --namespace ome
85+
```
86+
87+
For detailed setup instructions, refer to the official [OME installation guide](https://docs.sglang.ai/ome/docs/installation/).
88+
89+
**Registering the Kimi K2 Model**
90+
To enable OME to manage the Kimi K2 model family, apply the following ClusterBaseModel resource:
91+
92+
```bash
93+
kubectl apply -f https://raw.githubusercontent.com/sgl-project/ome/refs/heads/main/config/models/moonshotai/Kimi-K2-Instruct.yaml
94+
```
95+
96+
Note: You may download the YAML file and customize the path field to specify where the model should be stored locally. OME will download the model directly from Hugging Face with optimized parallelism and automatically verify the artifact checksum to ensure integrity.
97+
98+
**Installing the Kimi K2 latest SGLang Serving Runtime**
99+
100+
```bash
101+
kubectl apply -f https://raw.githubusercontent.com/sgl-project/ome/refs/heads/main/config/runtimes/srt/kimi-k2-pd-rt.yaml
102+
```
103+
104+
**Deploying the Model**
105+
106+
Once the model and runtime are registered, deploy the inference endpoint using:
107+
108+
```bash
109+
kubectl apply -f https://raw.githubusercontent.com/sgl-project/ome/refs/heads/main/config/samples/isvc/moonshotai/kimi-k2-pd.yaml
110+
```
111+
112+
With these declarative resources in place, OME will automatically handle model downloading, runtime orchestration, and endpoint provisioning—enabling scalable, production-grade inference for the Kimi K2 model family.
113+
114+
**Interacting with the Model**
115+
This command forwards local port 8080 to model on port 80:
116+
```bash
117+
kubectl port-forward -n kimi-k2-instruct service/kimi-k2-instruct 8080:80
118+
```
119+
Leave this running in one terminal. It will route your local http://localhost:8080 to the SGlang router. After the port-forward is active, run this in a second terminal:
120+
```bash
121+
curl -s -X POST http://localhost:8080/generate \
122+
-H 'Content-Type: application/json' \
123+
-H 'Authorization: Bearer None' \
124+
-d '{
125+
"text": "The future of AI is",
126+
"max_new_tokens": 50,
127+
"temperature": 0.7
128+
}'
129+
```
130+
131+
---
132+
133+
### **OME Advantages & PD + DeepEP + Router Insights**
134+
135+
OME (Open Model Engine) offers a declarative, production-ready framework for deploying large models like Kimi K2. It abstracts the complexities of GPU topology, distributed configuration, and runtime tuning—eliminating the need for custom orchestration logic. With a single ClusterServingRuntime definition, teams can launch optimized multi-node inference workloads at scale.
136+
137+
This configuration demonstrates a powerful setup leveraging **Prefill-Decode (PD) disaggregation** and **Large Scale EP**, enabling:
138+
139+
- **Disaggregated scaling** of prefill and decode workloads with independent resource control
140+
- **Low-latency decode** via deepep-mode=low_latency and token-aware dispatch tuning
141+
- **Advanced expert routing** with ep-dispatch-algorithm=dynamic and enable-eplb
142+
- **RDMA acceleration for high-throughput kv-cache transfer**
143+
144+
The deployment is orchestrated by a lightweight **SGLang Router**, which provides:
145+
146+
- **Dynamic service discovery** for prefill and decode nodes via label selectors
147+
- **Auto-scaling capabilities** independent of engine and decoder workloads
148+
- **Least-privilege routing model**—ideal for secure production environments
149+
- **Optimized load balancing** tailored for disaggregated serving patterns
150+
151+
Together, OME and the SGLang Router form a robust foundation for large-scale, low-latency, and maintainable inference infrastructure.
152+
153+
### Prefill-Decode Disaggregation
154+
155+
We separate inference into two independent components:
156+
157+
| Stage | Role |
158+
| --- | --- |
159+
| **Prefill** | Handles large prompt ingestion (e.g., 2000-token inputs). This is compute-bound and benefits from large batch parallelism. |
160+
| **Decode** | Handles autoregressive generation (e.g., 100-token outputs). This is latency-sensitive and optimized for high-throughput outputs. |
161+
162+
Prefill and Decode are deployed as independent services, each scaled and optimized separately.
163+
164+
---
165+
166+
### Large-Scale Expert Parallelism (EP)
167+
168+
Kimi K2 activates a subset of **384 experts** per token. We implemented:
169+
170+
- **96 redundant experts on decode nodes** to balance MoE routing
171+
- **NUMA-aware GPU grouping** for optimal NVLink and PCIe utilization on H200 clusters
172+
173+
This design minimizes load imbalance and ensures even GPU utilization across the 128-card cluster.
174+
175+
---
176+
177+
## 4️⃣ Performance: 2000-Input, 100-Output Benchmark
178+
179+
We benchmarked Kimi K2 using a typical LLM serving workload on **128 H200 GPUs with 1P1D (4 nodes/P and 12 nodes/D)**:
180+
181+
| Metric | Value |
182+
| --- | --- |
183+
| **Input Length** | 2000 tokens |
184+
| **Output Length** | 100 tokens |
185+
| **Decode Batch Size** | 480 |
186+
187+
We use the same benchmark setup as in the DeepSeek R1 deployment blog as an example. Longer output for agentic scenarios will be future work.
188+
189+
Note: The prefill-to-decode ratio is workload-dependent. We prioritized decode nodes to maximize the KV Cache pool size, which is critical for scaling batch size to 480.
190+
191+
---
192+
193+
### Cluster-Level Performance (128 × H200 GPUs)
194+
195+
| Metric | Value |
196+
| --- | --- |
197+
| **Prefill Throughput** | **896k tokens/sec** |
198+
| **Decode Throughput** | **384k tokens/sec** |
199+
| **Cost per 1M Output Tokens** | **~$0.21**(**H200 $2.3/hour**) |
200+
201+
---
202+
203+
### Comparison to DeepSeek R1 Deployment
204+
205+
| Model | Experts | GPUs | Prefill Throughput (tokens/sec) | Decode Throughput (tokens/sec) |
206+
| --- | --- | --- | --- | --- |
207+
| **DeepSeek R1** | 256 | 96 × H100 | 52.3k / node | 22.3k / node |
208+
| **Kimi K2** | 384 | 128 × H200 | 56k / node | 24k / node |
209+
210+
Despite Kimi K2’s larger MoE and more complex routing, our deployment achieves:
211+
212+
- **Balanced expert activation**, using expert-parallel load balancer (EPLB)
213+
- **High throughput per GPU** by applying SGLang’s specific optimizations for DeepSeek V3 architecture to H200
214+
215+
The next step involves evaluating and optimizing long-context scenarios. As K2 is a model designed for agentic tasks, it has been reported that the average input length in such scenarios can range from 30,000 to 50,000 tokens.
216+
217+
---
218+
219+
## 5️⃣ Conclusion: Trillion-Scale Inference at Scale
220+
221+
By combining **OME**, **SGLang**, **PD Disaggregation**, and **Large-Scale Expert Parallelism**, we deployed Kimi K2 on **128 H200 GPUs**, achieving:
222+
223+
- **Cost-effective large-scale inference** (~$0.21 per 1M output tokens on H200) is available for short-context scenarios, with ongoing efforts to optimize the long-context scenarios.
224+
- **Simplified deployment workflows** with model-driven configuration
225+
226+
All components of this deployment are **fully open-source and reproducible**. We welcome the community to build on this work.
227+
228+
This deployment was made possible not only by open collaboration between Mooncake and the SGLang community, but also through the generous infrastructure support from NVIDIA DGX Cloud. NVIDIA provided the SGLang team with access to 128 H200 GPUs via DGX Cloud, enabling us to accelerate the deployment of Kimi K2 from model release to production-grade inference very quickly. As a result, organizations can now leverage SGLang to serve Kimi K2 at scale, unlocking advanced reasoning capabilities with state-of-the-art performance.
229+
230+
---
231+
232+
### Acknowledgments
233+
234+
We would like to express our heartfelt gratitude to the following teams and collaborators:
235+
236+
- **Mooncake Team:** Boxin Zhang, Shangming Cai, Mingxing Zhang, and colleagues.
237+
- **SGLang Team and community:** Simo Lin, Jingyi Chen, Qiaolin Yu, Yanbo Yang, Yineng Zhang, and many others.
238+
239+
We extend our thanks to the **MoonshotAI Team**—including Shaowei Liu, Zhengtao Wang, Weiran He, Xinran Xu, and others—for their support in tuning the big beautiful model K2.
240+
241+
---
242+
243+
## Further Reading
244+
245+
- [Deploying DeepSeek R1 with PD Disaggregation and Large-Scale EP](https://lmsys.org/blog/2025-05-05-large-scale-ep/)
246+
- [OME: Model-Driven LLM Deployment](https://lmsys.org/blog/2025-07-08-ome/)
247+
- [Kimi K2 Official Release](https://moonshotai.github.io/Kimi-K2/)
248+
- [SGLang GitHub Repository](https://github.com/sgl-project/sglang)
162 KB
Loading

0 commit comments

Comments
 (0)