Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
kimi-k2-thinking-high-throughput.sky.yaml	kimi-k2-thinking-high-throughput.sky.yaml
kimi-k2-thinking.sky.yaml	kimi-k2-thinking.sky.yaml

Run Kimi K2 Thinking on Kubernetes or Any Cloud

Kimi K2 Thinking is an advanced large language model created by Moonshot AI.

This recipe shows how to run Kimi K2 Thinking with reasoning capabilities on your Kubernetes or any cloud. It includes two modes:

Low Latency (TP8): Best for interactive applications requiring quick responses
High Throughput (TP8+DCP8): Best for batch processing and high-volume serving scenarios

Prerequisites

Check that you have installed SkyPilot (docs).
Check that sky check shows clouds or Kubernetes is enabled.
Note: This model requires 8x H200 or H20 GPUs.

Run Kimi K2 Thinking (Low Latency Mode)

For low-latency scenarios, use tensor parallelism:

sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking

kimi-k2-thinking.sky.yaml uses tensor parallelism across 8 GPUs for optimal low-latency performance.

🎉 Congratulations! 🎉 You have now launched the Kimi K2 Thinking LLM with reasoning capabilities on your infra.

Run Kimi K2 Thinking (High Throughput Mode)

For high-throughput scenarios, use Decode Context Parallel (DCP) for 43% faster token generation and 26% higher throughput:

sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht

The kimi-k2-thinking-high-throughput.sky.yaml adds --decode-context-parallel-size 8 to enable DCP:

run: |
  echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
  
  vllm serve $MODEL_NAME \
    --port 8081 \
    --tensor-parallel-size 8 \
    --decode-context-parallel-size 8 \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code

DCP Performance Gains

From vLLM's benchmark:

Metric	TP8 (Low Latency)	TP8+DCP8 (High Throughput)	Improvement
Request Throughput (req/s)	1.25	1.57	+25.6%
Output Token Throughput (tok/s)	485.78	695.13	+43.1%
Mean TTFT (sec)	271.2	227.8	+16.0%
KV Cache Size (tokens)	715,072	5,721,088	8x

Chat with Kimi K2 Thinking with OpenAI API

To curl /v1/chat/completions:

ENDPOINT=$(sky status --endpoint 8081 kimi-k2-thinking)

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2-Thinking",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant with deep reasoning capabilities."
      },
      {
        "role": "user",
        "content": "Explain how to solve the traveling salesman problem for 10 cities."
      }
    ]
  }' | jq .

The model will provide its reasoning process in the response, showing its chain-of-thought approach.

Clean up resources

To shut down all resources:

sky down kimi-k2-thinking

Serving Kimi-K2-Thinking: scaling up with SkyServe

With no change to the YAML, launch a fully managed service with autoscaling replicas and load-balancing on your infra:

sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking

Wait until the service is ready:

watch -n10 sky serve status kimi-k2-thinking

Get a single endpoint that load-balances across replicas:

ENDPOINT=$(sky serve status --endpoint kimi-k2-thinking)

Tip: SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.

To curl the endpoint:

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2-Thinking",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant with deep reasoning capabilities."
      },
      {
        "role": "user",
        "content": "Design a distributed system for real-time analytics."
      }
    ]
  }' | jq .

To shut down all resources:

sky serve down kimi-k2-thinking

See more details in SkyServe docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Run Kimi K2 Thinking on Kubernetes or Any Cloud

Prerequisites

Run Kimi K2 Thinking (Low Latency Mode)

Run Kimi K2 Thinking (High Throughput Mode)

DCP Performance Gains

Chat with Kimi K2 Thinking with OpenAI API

Clean up resources

Serving Kimi-K2-Thinking: scaling up with SkyServe

FilesExpand file tree

kimi-k2-thinking

Directory actions

More options

Directory actions

More options

Latest commit

History

kimi-k2-thinking

Folders and files

parent directory

README.md

Run Kimi K2 Thinking on Kubernetes or Any Cloud

Prerequisites

Run Kimi K2 Thinking (Low Latency Mode)

Run Kimi K2 Thinking (High Throughput Mode)

DCP Performance Gains

Chat with Kimi K2 Thinking with OpenAI API

Clean up resources

Serving Kimi-K2-Thinking: scaling up with SkyServe