Skip to content

Commit 5e70dcd

Browse files
authored
[Doc] Fix CPU doc format (vllm-project#21316)
Signed-off-by: jiang1.li <[email protected]>
1 parent 25d585a commit 5e70dcd

File tree

1 file changed

+10
-9
lines changed
  • docs/getting_started/installation

1 file changed

+10
-9
lines changed

docs/getting_started/installation/cpu.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -168,17 +168,18 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe
168168

169169
### How to do performance tuning for vLLM CPU?
170170

171-
- First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`.
171+
First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`.
172172

173-
- Inference batch size is a important parameter for the performance. Larger batch usually provides higher throughput, smaller batch provides lower latency. Tuning max batch size starts from default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM:
174-
- `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as:
175-
- Offline Inference: `4096 * world_size`
176-
- Online Serving: `2048 * world_size`
177-
- `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance.
178-
- Offline Inference: `256 * world_size`
179-
- Online Serving: `128 * world_size`
173+
Inference batch size is a important parameter for the performance. Larger batch usually provides higher throughput, smaller batch provides lower latency. Tuning max batch size starts from default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM:
180174

181-
- vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more detials of tuning TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommend to use TP and PP togther if there are enough CPU sockets and memory nodes.
175+
- `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as:
176+
- Offline Inference: `4096 * world_size`
177+
- Online Serving: `2048 * world_size`
178+
- `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance.
179+
- Offline Inference: `256 * world_size`
180+
- Online Serving: `128 * world_size`
181+
182+
vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more detials of tuning TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommend to use TP and PP togther if there are enough CPU sockets and memory nodes.
182183

183184
### Which quantization configs does vLLM CPU support?
184185

0 commit comments

Comments
 (0)