Kimi K2 Thinking is an advanced large language model created by Moonshot AI.
This recipe shows how to run Kimi K2 Thinking with reasoning capabilities on your Kubernetes or any cloud. It includes two modes:
- Low Latency (TP8): Best for interactive applications requiring quick responses
- High Throughput (TP8+DCP8): Best for batch processing and high-volume serving scenarios
- Check that you have installed SkyPilot (docs).
- Check that
sky checkshows clouds or Kubernetes is enabled. - Note: This model requires 8x H200 or H20 GPUs.
For low-latency scenarios, use tensor parallelism:
sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinkingkimi-k2-thinking.sky.yaml uses tensor parallelism across 8 GPUs for optimal low-latency performance.
🎉 Congratulations! 🎉 You have now launched the Kimi K2 Thinking LLM with reasoning capabilities on your infra.
For high-throughput scenarios, use Decode Context Parallel (DCP) for 43% faster token generation and 26% higher throughput:
sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-htThe kimi-k2-thinking-high-throughput.sky.yaml adds --decode-context-parallel-size 8 to enable DCP:
run: |
echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
vllm serve $MODEL_NAME \
--port 8081 \
--tensor-parallel-size 8 \
--decode-context-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-codeFrom vLLM's benchmark:
| Metric | TP8 (Low Latency) | TP8+DCP8 (High Throughput) | Improvement |
|---|---|---|---|
| Request Throughput (req/s) | 1.25 | 1.57 | +25.6% |
| Output Token Throughput (tok/s) | 485.78 | 695.13 | +43.1% |
| Mean TTFT (sec) | 271.2 | 227.8 | +16.0% |
| KV Cache Size (tokens) | 715,072 | 5,721,088 | 8x |
To curl /v1/chat/completions:
ENDPOINT=$(sky status --endpoint 8081 kimi-k2-thinking)
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2-Thinking",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant with deep reasoning capabilities."
},
{
"role": "user",
"content": "Explain how to solve the traveling salesman problem for 10 cities."
}
]
}' | jq .The model will provide its reasoning process in the response, showing its chain-of-thought approach.
To shut down all resources:
sky down kimi-k2-thinkingWith no change to the YAML, launch a fully managed service with autoscaling replicas and load-balancing on your infra:
sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinkingWait until the service is ready:
watch -n10 sky serve status kimi-k2-thinkingGet a single endpoint that load-balances across replicas:
ENDPOINT=$(sky serve status --endpoint kimi-k2-thinking)Tip: SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.
To curl the endpoint:
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2-Thinking",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant with deep reasoning capabilities."
},
{
"role": "user",
"content": "Design a distributed system for real-time analytics."
}
]
}' | jq .To shut down all resources:
sky serve down kimi-k2-thinkingSee more details in SkyServe docs.