diff --git a/docs/serving/parallelism_scaling.md b/docs/serving/parallelism_scaling.md index 14cd3b057791..a32840ea73b9 100644 --- a/docs/serving/parallelism_scaling.md +++ b/docs/serving/parallelism_scaling.md @@ -118,14 +118,16 @@ The common practice is to set the tensor parallel size to the number of GPUs in ```bash vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 8 \ - --pipeline-parallel-size 2 + --pipeline-parallel-size 2 \ + --distributed-executor-backend ray ``` Alternatively, you can set `tensor_parallel_size` to the total number of GPUs in the cluster: ```bash vllm serve /path/to/the/model/in/the/container \ - --tensor-parallel-size 16 + --tensor-parallel-size 16 \ + --distributed-executor-backend ray ``` ## Optimizing network communication for tensor parallelism