Skip to content

Commit 856a442

Browse files
[Docs] Minor update of Clusters and Distributed tasks sections to reflect MPI new syntax (#2741)
1 parent 2cf4394 commit 856a442

File tree

6 files changed

+61
-71
lines changed

6 files changed

+61
-71
lines changed

docs/docs/concepts/tasks.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,15 @@ Nodes can communicate using their private IP addresses.
139139
Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
140140
[System environment variables](#system-environment-variables) for inter-node communication.
141141

142+
`dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks.
143+
144+
145+
!!! info "MPI"
146+
If want to use MPI, you can set `startup_order` to `workers-first` and `stop_criteria` to `master-done`, and use `DSTACK_MPI_HOSTFILE`.
147+
See the [NCCL](../../examples/clusters/nccl-tests/index.md) or [RCCL](../../examples/clusters/rccl-tests/index.md) examples.
148+
149+
> For detailed examples, see [distributed training](../../examples.md#distributed-training) examples.
150+
142151
??? info "Network interface"
143152
Distributed frameworks usually detect the correct network interface automatically,
144153
but sometimes you need to specify it explicitly.
@@ -170,7 +179,7 @@ Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
170179
recommended to create them via a fleet configuration
171180
to ensure the highest level of inter-node connectivity.
172181

173-
`dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks.
182+
> See the [Clusters](../guides/clusters.md) guide for more details on how to use `dstack` on clusters.
174183

175184
### Resources
176185

docs/docs/guides/clusters.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -38,25 +38,28 @@ For cloud fleets, fast interconnect is currently supported only on the `aws`, `g
3838
> To request fast interconnect support for a other backends,
3939
file an [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues){:target="_ blank"}.
4040

41-
## NCCL/RCCL tests
42-
43-
To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md)
44-
(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests.
45-
4641
## Distributed tasks
4742

4843
A distributed task is a task with `nodes` set to a value greater than `2`. In this case, `dstack` first ensures a
49-
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
50-
`dstack` starts the rest of the nodes and runs the task container on each of them.
44+
suitable fleet is available, then selects the master node (to obtain its IP) and finally runs jobs on each node.
5145

5246
Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
5347
[system environment variables](../concepts/tasks.md#system-environment-variables) for inter-node communication.
5448

55-
Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an example.
49+
??? info "MPI"
50+
If want to use MPI, you can set `startup_order` to `workers-first` and `stop_criteria` to `master-done`, and use `DSTACK_MPI_HOSTFILE`.
51+
See the [NCCL](../../examples/clusters/nccl-tests/index.md) or [RCCL](../../examples/clusters/rccl-tests/index.md) examples.
5652

5753
!!! info "Retry policy"
5854
By default, if any of the nodes fails, `dstack` terminates the entire run. Configure a [retry policy](../concepts/tasks.md#retry-policy) to restart the run if any node fails.
5955

56+
Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an example.
57+
58+
## NCCL/RCCL tests
59+
60+
To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md)
61+
(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests using MPI.
62+
6063
## Volumes
6164

6265
### Network volumes

examples/clusters/nccl-tests/.dstack.yml

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,19 @@ nodes: 2
55
startup_order: workers-first
66
stop_criteria: master-done
77

8+
# This image comes with MPI and NCCL tests pre-built
89
image: dstackai/efa
910
env:
1011
- NCCL_DEBUG=INFO
1112
commands:
13+
- cd /root/nccl-tests/build
1214
- |
13-
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
14-
cd /root/nccl-tests/build
15-
MPIRUN="mpirun --allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE"
16-
# Run NCCL Tests
17-
${MPIRUN} \
18-
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
15+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
16+
mpirun \
17+
--allow-run-as-root \
18+
--hostfile $DSTACK_MPI_HOSTFILE \
19+
-n $DSTACK_GPUS_NUM \
20+
-N $DSTACK_GPUS_PER_NODE \
1921
--mca btl_tcp_if_exclude lo,docker0 \
2022
--bind-to none \
2123
./all_reduce_perf -b 8 -e 8G -f 2 -g 1

examples/clusters/nccl-tests/README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,14 @@ image: dstackai/efa
2020
env:
2121
- NCCL_DEBUG=INFO
2222
commands:
23+
- cd /root/nccl-tests/build
2324
- |
24-
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
25-
cd /root/nccl-tests/build
26-
MPIRUN="mpirun --allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE"
27-
# Run NCCL Tests
28-
${MPIRUN} \
29-
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
25+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
26+
mpirun \
27+
--allow-run-as-root \
28+
--hostfile $DSTACK_MPI_HOSTFILE \
29+
-n $DSTACK_GPUS_NUM \
30+
-N $DSTACK_GPUS_PER_NODE \
3031
--mca btl_tcp_if_exclude lo,docker0 \
3132
--bind-to none \
3233
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
@@ -37,11 +38,12 @@ commands:
3738
resources:
3839
gpu: nvidia:4:16GB
3940
shm_size: 16GB
40-
4141
```
4242
4343
</div>
4444
45+
<!-- TODO: Need to stop using our EFA image - either make our default image cluster-friendly, or recommend using NGC or other images -->
46+
4547
!!! info "Docker image"
4648
The `dstackai/efa` image used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for
4749
[AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}, it can also

examples/clusters/rccl-tests/.dstack.yml

Lines changed: 14 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,12 @@ type: task
22
name: rccl-tests
33

44
nodes: 2
5+
startup_order: workers-first
6+
stop_criteria: master-done
57

6-
# Uncomment to mount the system libraries folder from the host
7-
#volumes:
8-
# - /usr/local/lib:/mnt/lib
8+
# Mount the system libraries folder from the host
9+
volumes:
10+
- /usr/local/lib:/mnt/lib
911

1012
image: rocm/dev-ubuntu-22.04:6.4-complete
1113
env:
@@ -16,41 +18,26 @@ commands:
1618
- apt-get install -y git libopenmpi-dev openmpi-bin
1719
- git clone https://github.com/ROCm/rccl-tests.git
1820
- cd rccl-tests
19-
- make MPI=1 MPI_HOME=${OPEN_MPI_HOME}
21+
- make MPI=1 MPI_HOME=$OPEN_MPI_HOME
2022

21-
# Uncomment to preload the RoCE driver library from the host (for Broadcom driver compatibility)
22-
#- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
23+
# Preload the RoCE driver library from the host (for Broadcom driver compatibility)
24+
- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
2325

2426
# Run RCCL tests via MPI
2527
- |
26-
FIFO=/tmp/${DSTACK_RUN_NAME}
27-
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
28-
sleep 10
29-
echo "$DSTACK_NODES_IPS" | tr ' ' '\n' > hostfile
30-
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
31-
# Wait for other nodes
32-
while true; do
33-
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
34-
break
35-
fi
36-
echo 'Waiting for other nodes...'
37-
sleep 5
38-
done
39-
# Run NCCL Tests
40-
${MPIRUN} \
41-
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
28+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
29+
mpirun --allow-run-as-root \
30+
--hostfile $DSTACK_MPI_HOSTFILE \
31+
-n $DSTACK_GPUS_NUM \
32+
-N $DSTACK_GPUS_PER_NODE \
4233
--mca btl_tcp_if_include ens41np0 \
4334
-x LD_PRELOAD \
4435
-x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
4536
-x NCCL_IB_GID_INDEX=3 \
4637
-x NCCL_IB_DISABLE=0 \
4738
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
48-
# Notify other nodes the MPI run is finished
49-
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
5039
else
51-
mkfifo ${FIFO}
52-
# Wait for a message from the master node
53-
cat ${FIFO}
40+
sleep infinity
5441
fi
5542
5643
resources:

examples/clusters/rccl-tests/README.md

Lines changed: 9 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ type: task
1313
name: rccl-tests
1414

1515
nodes: 2
16+
startup_order: workers-first
17+
stop_criteria: master-done
1618

1719
# Mount the system libraries folder from the host
1820
volumes:
@@ -27,41 +29,26 @@ commands:
2729
- apt-get install -y git libopenmpi-dev openmpi-bin
2830
- git clone https://github.com/ROCm/rccl-tests.git
2931
- cd rccl-tests
30-
- make MPI=1 MPI_HOME=${OPEN_MPI_HOME}
32+
- make MPI=1 MPI_HOME=$OPEN_MPI_HOME
3133

3234
# Preload the RoCE driver library from the host (for Broadcom driver compatibility)
3335
- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
3436

3537
# Run RCCL tests via MPI
3638
- |
37-
FIFO=/tmp/${DSTACK_RUN_NAME}
38-
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
39-
sleep 10
40-
echo "$DSTACK_NODES_IPS" | tr ' ' '\n' > hostfile
41-
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
42-
# Wait for other nodes
43-
while true; do
44-
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
45-
break
46-
fi
47-
echo 'Waiting for other nodes...'
48-
sleep 5
49-
done
50-
# Run NCCL Tests
51-
${MPIRUN} \
52-
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
39+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
40+
mpirun --allow-run-as-root \
41+
--hostfile $DSTACK_MPI_HOSTFILE \
42+
-n $DSTACK_GPUS_NUM \
43+
-N $DSTACK_GPUS_PER_NODE \
5344
--mca btl_tcp_if_include ens41np0 \
5445
-x LD_PRELOAD \
5546
-x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
5647
-x NCCL_IB_GID_INDEX=3 \
5748
-x NCCL_IB_DISABLE=0 \
5849
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
59-
# Notify other nodes the MPI run is finished
60-
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
6150
else
62-
mkfifo ${FIFO}
63-
# Wait for a message from the master node
64-
cat ${FIFO}
51+
sleep infinity
6552
fi
6653
6754
resources:

0 commit comments

Comments
 (0)