Skip to content

Commit c051bfe

Browse files
authored
[doc][distributed] doc for setting up multi-node environment (#6529)
[doc][distributed] add more doc for setting up multi-node environment (#6529)
1 parent 9e0b558 commit c051bfe

File tree

2 files changed

+103
-11
lines changed

2 files changed

+103
-11
lines changed

docs/source/serving/distributed_serving.rst

Lines changed: 54 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
.. _distributed_serving:
22

3+
Distributed Inference and Serving
4+
=================================
5+
36
How to decide the distributed inference strategy?
4-
=================================================
7+
-------------------------------------------------
58

69
Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:
710

@@ -16,8 +19,8 @@ After adding enough GPUs and nodes to hold the model, you can run vLLM first, wh
1619
.. note::
1720
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
1821

19-
Distributed Inference and Serving
20-
=================================
22+
Details for Distributed Inference and Serving
23+
----------------------------------------------
2124

2225
vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We also support pipeline parallel as a beta feature for online serving. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
2326

@@ -49,19 +52,59 @@ You can also additionally specify :code:`--pipeline-parallel-size` to enable pip
4952
.. note::
5053
Pipeline parallel is a beta feature. It is only supported for online serving as well as LLaMa, GPT2, and Mixtral style models.
5154

52-
To scale vLLM beyond a single machine, install and start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
55+
Multi-Node Inference and Serving
56+
--------------------------------
57+
58+
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
59+
60+
The first step, is to start containers and organize them into a cluster. We have provided a helper `script <https://github.com/vllm-project/vllm/tree/main/examples/run_cluster.sh>`_ to start the cluster.
61+
62+
Pick a node as the head node, and run the following command:
63+
64+
.. code-block:: console
65+
66+
$ bash run_cluster.sh \
67+
$ vllm/vllm-openai \
68+
$ ip_of_head_node \
69+
$ /path/to/the/huggingface/home/in/this/node \
70+
$ --head
71+
72+
On the rest of the worker nodes, run the following command:
5373

5474
.. code-block:: console
5575
56-
$ pip install ray
76+
$ bash run_cluster.sh \
77+
$ vllm/vllm-openai \
78+
$ ip_of_head_node \
79+
$ /path/to/the/huggingface/home/in/this/node \
80+
$ --worker
5781
58-
$ # On head node
59-
$ ray start --head
82+
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster.
6083

61-
$ # On worker nodes
62-
$ ray start --address=<ray-head-address>
84+
Then, on any node, use ``docker exec -it node /bin/bash`` to enter the container, execute ``ray status`` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
6385

64-
After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` multiplied by :code:`pipeline_parallel_size` to the number of GPUs to be the total number of GPUs across all machines.
86+
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
87+
88+
.. code-block:: console
89+
90+
$ vllm serve /path/to/the/model/in/the/container \
91+
$ --tensor-parallel-size 8 \
92+
$ --pipeline-parallel-size 2
93+
94+
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16:
95+
96+
.. code-block:: console
97+
98+
$ vllm serve /path/to/the/model/in/the/container \
99+
$ --tensor-parallel-size 16
100+
101+
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like ``--privileged -e NCCL_IB_HCA=mlx5`` to the ``run_cluster.sh`` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with ``NCCL_DEBUG=TRACE`` environment variable set, e.g. ``NCCL_DEBUG=TRACE vllm serve ...`` and check the logs for the NCCL version and the network used. If you find ``[send] via NET/Socket`` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find ``[send] via NET/IB/GDRDMA`` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
102+
103+
.. warning::
104+
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the `sanity check script <https://docs.vllm.ai/en/latest/getting_started/debugging.html>`_ for more information.
65105

66106
.. warning::
67-
Please make sure you downloaded the model to all the nodes, or the model is downloaded to some distributed file system that is accessible by all nodes.
107+
108+
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
109+
110+
When you use huggingface repo id to refer to the model, you should append your huggingface token to the ``run_cluster.sh`` script, e.g. ``-e HF_TOKEN=``. The recommended way is to download the model first, and then use the path to refer to the model.

examples/run_cluster.sh

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/bash
2+
3+
# Check for minimum number of required arguments
4+
if [ $# -lt 4 ]; then
5+
echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
6+
exit 1
7+
fi
8+
9+
# Assign the first three arguments and shift them away
10+
DOCKER_IMAGE="$1"
11+
HEAD_NODE_ADDRESS="$2"
12+
NODE_TYPE="$3" # Should be --head or --worker
13+
PATH_TO_HF_HOME="$4"
14+
shift 4
15+
16+
# Additional arguments are passed directly to the Docker command
17+
ADDITIONAL_ARGS="$@"
18+
19+
# Validate node type
20+
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
21+
echo "Error: Node type must be --head or --worker"
22+
exit 1
23+
fi
24+
25+
# Define a function to cleanup on EXIT signal
26+
cleanup() {
27+
docker stop node
28+
docker rm node
29+
}
30+
trap cleanup EXIT
31+
32+
# Command setup for head or worker node
33+
RAY_START_CMD="ray start --block"
34+
if [ "${NODE_TYPE}" == "--head" ]; then
35+
RAY_START_CMD+=" --head --port=6379"
36+
else
37+
RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
38+
fi
39+
40+
# Run the docker command with the user specified parameters and additional arguments
41+
docker run \
42+
--entrypoint /bin/bash \
43+
--network host \
44+
--name node \
45+
--shm-size 10.24g \
46+
--gpus all \
47+
-v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
48+
${ADDITIONAL_ARGS} \
49+
"${DOCKER_IMAGE}" -c "${RAY_START_CMD}"

0 commit comments

Comments
 (0)