Skip to content

Commit 91a5cf8

Browse files
Merge branch 'master' of https://github.com/dstackai/dstack
2 parents 33be4e7 + ca326ba commit 91a5cf8

File tree

15 files changed

+165
-258
lines changed

15 files changed

+165
-258
lines changed

docs/blog/posts/gpu-health-checks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ A healthy instance is ready for workloads. A warning means you should monitor it
5151

5252
This release focuses on passive checks using DCGM background health checks. These run continuously and do not interrupt workloads.
5353

54-
For active checks today, you can run [NCCL tests](../../examples/clusters/nccl-tests/index.md) as a [distributed task](../../docs/concepts/tasks.md#distributed-tasks) to verify GPU-to-GPU communication and bandwidth across a fleet. Active tests like these can reveal network or interconnect issues that passive monitoring might miss. More built-in support for active diagnostics is planned.
54+
For active checks today, you can run [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) as a [distributed task](../../docs/concepts/tasks.md#distributed-tasks) to verify GPU-to-GPU communication and bandwidth across a fleet. Active tests like these can reveal network or interconnect issues that passive monitoring might miss. More built-in support for active diagnostics is planned.
5555

5656
## Supported backends
5757

docs/blog/posts/mpi.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,5 +100,5 @@ as well as use MPI for other tasks.
100100
101101
!!! info "What's next?"
102102
1. Learn more about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
103-
2. Check the [NCCL tests](../../examples/clusters/nccl-tests/index.md) example
103+
2. Check the [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) example
104104
3. Join [Discord](https://discord.gg/u8SmfwPpMd)

docs/docs/concepts/tasks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
144144

145145
!!! info "MPI"
146146
If want to use MPI, you can set `startup_order` to `workers-first` and `stop_criteria` to `master-done`, and use `DSTACK_MPI_HOSTFILE`.
147-
See the [NCCL](../../examples/clusters/nccl-tests/index.md) or [RCCL](../../examples/clusters/rccl-tests/index.md) examples.
147+
See the [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) examples.
148148

149149
> For detailed examples, see [distributed training](../../examples.md#distributed-training) examples.
150150

docs/docs/guides/clusters.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DST
5050

5151
??? info "MPI"
5252
If want to use MPI, you can set `startup_order` to `workers-first` and `stop_criteria` to `master-done`, and use `DSTACK_MPI_HOSTFILE`.
53-
See the [NCCL](../../examples/clusters/nccl-tests/index.md) or [RCCL](../../examples/clusters/rccl-tests/index.md) examples.
53+
See the [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) examples.
5454

5555
!!! info "Retry policy"
5656
By default, if any of the nodes fails, `dstack` terminates the entire run. Configure a [retry policy](../concepts/tasks.md#retry-policy) to restart the run if any node fails.
@@ -59,8 +59,7 @@ Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an exam
5959

6060
## NCCL/RCCL tests
6161

62-
To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md)
63-
(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests using MPI.
62+
To test the interconnect of a created fleet, ensure you run [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) tests using MPI.
6463

6564
## Volumes
6665

docs/examples.md

Lines changed: 10 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -80,26 +80,6 @@ hide:
8080
## Clusters
8181

8282
<div class="tx-landing__highlights_grid">
83-
<a href="/examples/clusters/nccl-tests"
84-
class="feature-cell sky">
85-
<h3>
86-
NCCL tests
87-
</h3>
88-
89-
<p>
90-
Run multi-node NCCL tests with MPI
91-
</p>
92-
</a>
93-
<a href="/examples/clusters/rccl-tests"
94-
class="feature-cell sky">
95-
<h3>
96-
RCCL tests
97-
</h3>
98-
99-
<p>
100-
Run multi-node RCCL tests with MPI
101-
</p>
102-
</a>
10383
<a href="/examples/clusters/gcp"
10484
class="feature-cell sky">
10585
<h3>
@@ -130,6 +110,16 @@ hide:
130110
Set up Crusoe clusters with optimized networking
131111
</p>
132112
</a>
113+
<a href="/examples/clusters/nccl-rccl-tests"
114+
class="feature-cell sky">
115+
<h3>
116+
NCCL/RCCL tests
117+
</h3>
118+
119+
<p>
120+
Run multi-node NCCL tests with MPI
121+
</p>
122+
</a>
133123
</div>
134124

135125
## Inference
File renamed without changes.

docs/examples/clusters/rccl-tests/index.md

Whitespace-only changes.
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# NCCL/RCCL tests
2+
3+
This example shows how to run [NCCL](https://github.com/NVIDIA/nccl-tests) or [RCCL](https://github.com/ROCm/rccl-tests) tests on a cluster using [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks).
4+
5+
!!! info "Prerequisites"
6+
Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#backend-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)).
7+
8+
## Running as a task
9+
10+
Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total).
11+
12+
=== "NCCL tests"
13+
14+
<div editor-title="examples/clusters/nccl-rccl-tests/nccl-tests.dstack.yml">
15+
16+
```yaml
17+
type: task
18+
name: nccl-tests
19+
20+
nodes: 2
21+
22+
startup_order: workers-first
23+
stop_criteria: master-done
24+
25+
env:
26+
- NCCL_DEBUG=INFO
27+
commands:
28+
- |
29+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
30+
mpirun \
31+
--allow-run-as-root \
32+
--hostfile $DSTACK_MPI_HOSTFILE \
33+
-n $DSTACK_GPUS_NUM \
34+
-N $DSTACK_GPUS_PER_NODE \
35+
--bind-to none \
36+
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
37+
else
38+
sleep infinity
39+
fi
40+
41+
# Uncomment if the `kubernetes` backend requires it for `/dev/infiniband` access
42+
#privileged: true
43+
44+
resources:
45+
gpu: nvidia:1..8
46+
shm_size: 16GB
47+
```
48+
49+
</div>
50+
51+
!!! info "Default image"
52+
If you don't specify `image`, `dstack` uses its [base](https://github.com/dstackai/dstack/tree/master/docker/base) Docker image pre-configured with
53+
`uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`).
54+
55+
=== "RCCL tests"
56+
57+
<div editor-title="examples/clusters/nccl-rccl-tests/rccl-tests.dstack.yml">
58+
59+
```yaml
60+
type: task
61+
name: rccl-tests
62+
63+
nodes: 2
64+
startup_order: workers-first
65+
stop_criteria: master-done
66+
67+
# Mount the system libraries folder from the host
68+
volumes:
69+
- /usr/local/lib:/mnt/lib
70+
71+
image: rocm/dev-ubuntu-22.04:6.4-complete
72+
env:
73+
- NCCL_DEBUG=INFO
74+
- OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
75+
commands:
76+
# Setup MPI and build RCCL tests
77+
- apt-get install -y git libopenmpi-dev openmpi-bin
78+
- git clone https://github.com/ROCm/rccl-tests.git
79+
- cd rccl-tests
80+
- make MPI=1 MPI_HOME=$OPEN_MPI_HOME
81+
82+
# Preload the RoCE driver library from the host (for Broadcom driver compatibility)
83+
- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
84+
85+
# Run RCCL tests via MPI
86+
- |
87+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
88+
mpirun --allow-run-as-root \
89+
--hostfile $DSTACK_MPI_HOSTFILE \
90+
-n $DSTACK_GPUS_NUM \
91+
-N $DSTACK_GPUS_PER_NODE \
92+
--mca btl_tcp_if_include ens41np0 \
93+
-x LD_PRELOAD \
94+
-x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
95+
-x NCCL_IB_GID_INDEX=3 \
96+
-x NCCL_IB_DISABLE=0 \
97+
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
98+
else
99+
sleep infinity
100+
fi
101+
102+
resources:
103+
gpu: MI300X:8
104+
```
105+
106+
</div>
107+
108+
!!! info "RoCE library"
109+
Broadcom RoCE drivers require the `libbnxt_re` userspace library inside the container to be compatible with the host’s Broadcom
110+
kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it
111+
using `LD_PRELOAD` when running MPI.
112+
113+
114+
!!! info "Privileged"
115+
In some cases, the backend (e.g., `kubernetes`) may require `privileged: true` to access the high-speed interconnect (e.g., InfiniBand).
116+
117+
### Apply a configuration
118+
119+
To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply/) command.
120+
121+
<div class="termy">
122+
123+
```shell
124+
$ dstack apply -f examples/clusters/nccl-rccl-tests/nccl-tests.dstack.yml
125+
126+
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
127+
1 aws us-east-1 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
128+
2 aws us-west-2 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
129+
3 aws us-east-2 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
130+
131+
Submit the run nccl-tests? [y/n]: y
132+
```
133+
134+
</div>
135+
136+
## Source code
137+
138+
The source-code of this example can be found in
139+
[`examples/clusters/nccl-rccl-tests`](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-rccl-tests).
140+
141+
## What's next?
142+
143+
1. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks),
144+
[services](https://dstack.ai/docsconcepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets).
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)