This guide describes how to run Kimi-K2 with native FP8.
Note: This guide is partially referenced and adapted from the official Kimi-K2-Instruct Deployment Guidance provided by Moonshot AI. We would like to express our gratitude to the original authors.
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend autoThe smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H800 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP). Running parameters for this environment are provided below. You may scale up to more nodes and increase expert-parallelism to enlarge the inference batch size and overall throughput.
A sample launch command is:
# start ray on node 0 and node 1
# node 0:
vllm serve moonshotai/Kimi-K2-Instruct --trust-remote-code --tokenizer-mode auto --tensor-parallel-size 8 --pipeline-parallel-size 2 --dtype bfloat16 --quantization fp8 --max-model-len 2048 --max-num-seqs 1 --max-num-batched-tokens 1024 --enable-chunked-prefill --disable-log-requests --kv-cache-dtype fp8 -dcp 8Key parameter notes:
- enable-auto-tool-choice: Required when enabling tool usage.
- tool-call-parser kimi_k2: Required when enabling tool usage.
You can install libraries like DeepEP and DeepGEMM as needed. Then run the command (example on H800):
# node 0
vllm serve moonshotai/Kimi-K2-Instruct --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2
# node 1
vllm serve moonshotai/Kimi-K2-Instruct --headless --data-parallel-start-rank 8 --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2Additional flags:
- You can set
--max-model-lento preserve memory.--max-model-len=65536is usually good for most scenarios. - You can set
--max-num-batched-tokensto balance throughput and latency, higher means higher throughput but higher latency.--max-num-batched-tokens=32768is usually good for prompt-heavy workloads. But you can reduce it to 16k and 8k to reduce activation memory usage and decrease latency. - vLLM conservatively uses 90% of GPU memory, you can set
--gpu-memory-utilization=0.95to maximize KVCache.
vllm bench serve \
--model moonshotai/Kimi-K2-Instruct \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 512 \
--request-rate 1.0 \
--num-prompts 8 \
--ignore-eos \
--trust-remote-codeTest different batch sizes by changing --num-prompts:
- Batch sizes: 1, 16, 32, 64, 128, 256, 512
============ Serving Benchmark Result ============
Successful requests: 8
Request rate configured (RPS): 1.00
Benchmark duration (s): 132.79
Total input tokens: 8000
Total generated tokens: 4096
Request throughput (req/s): 0.06
Output token throughput (tok/s): 30.84
Total Token throughput (tok/s): 91.09
---------------Time to First Token----------------
Mean TTFT (ms): 58282.92
Median TTFT (ms): 57827.30
P99 TTFT (ms): 110831.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 30.78
Median TPOT (ms): 31.49
P99 TPOT (ms): 33.76
---------------Inter-token Latency----------------
Mean ITL (ms): 30.78
Median ITL (ms): 22.37
P99 ITL (ms): 322.81
==================================================vllm bench serve \
--model moonshotai/Kimi-K2-Instruct \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos \
--trust-remote-code============ Serving Benchmark Result ============
Successful requests: 16
Benchmark duration (s): 62.75
Total input tokens: 128000
Total generated tokens: 16000
Request throughput (req/s): 0.25
Output token throughput (tok/s): 254.99
Total Token throughput (tok/s): 2294.88
---------------Time to First Token----------------
Mean TTFT (ms): 4278.46
Median TTFT (ms): 4285.54
P99 TTFT (ms): 7685.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 58.15
Median TPOT (ms): 58.16
P99 TPOT (ms): 61.35
---------------Inter-token Latency----------------
Mean ITL (ms): 58.15
Median ITL (ms): 54.59
P99 ITL (ms): 91.18
==================================================After adding '-dcp 8':
============ Serving Benchmark Result ============
Successful requests: 16
Request rate configured (RPS): 10000.00
Benchmark duration (s): 47.14
Total input tokens: 128000
Total generated tokens: 16000
Request throughput (req/s): 0.34
Output token throughput (tok/s): 339.38
Peak output token throughput (tok/s): 384.00
Peak concurrent requests: 16.00
Total Token throughput (tok/s): 3054.46
---------------Time to First Token----------------
Mean TTFT (ms): 2007.87
Median TTFT (ms): 1932.03
P99 TTFT (ms): 4680.76
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.01
Median TPOT (ms): 45.10
P99 TPOT (ms): 46.51
---------------Inter-token Latency----------------
Mean ITL (ms): 45.01
Median ITL (ms): 42.01
P99 ITL (ms): 52.01
==================================================