Skip to content

Commit 20de4c7

Browse files
[Docs] Added Clusters guide (#2646)
Co-authored-by: jvstme <[email protected]>
1 parent 50a0529 commit 20de4c7

File tree

12 files changed

+132
-40
lines changed

12 files changed

+132
-40
lines changed

docs/blog/posts/mpi.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,10 +86,10 @@ resources:
8686

8787
</div>
8888

89-
The first worker node (`DSTACK_NODE_RANK=0`) generates a `hostfile` listing all node IPs and waits until all nodes are
89+
The master node (`DSTACK_NODE_RANK=0`) generates a `hostfile` listing all node IPs and waits until all nodes are
9090
reachable via MPI. Once confirmed, it launches the `/root/nccl-tests/build/all_reduce_perf` benchmark across all available GPUs in the cluster.
9191

92-
The other worker nodes remain blocked until they receive a termination signal from the master node via a FIFO pipe.
92+
Non-master nodes remain blocked until they receive a termination signal from the master node via a FIFO pipe.
9393

9494
With this, now you can use such a task to run both NCCL or RCCL tests on both cloud and SSH fleets,
9595
as well as use MPI for other tasks.
@@ -102,4 +102,4 @@ as well as use MPI for other tasks.
102102
!!! info "What's next?"
103103
1. Learn more about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
104104
2. Check the [NCCL tests](../../examples/clusters/nccl-tests/index.md) example
105-
2. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}
105+
3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

docs/blog/posts/nebius.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ $ dstack apply -f .dstack.yml
104104
The new `nebius` backend supports CPU and GPU instances, [fleets](../../docs/concepts/fleets.md),
105105
[distributed tasks](../../docs/concepts/tasks.md#distributed-tasks), and more.
106106

107-
> Support for [network volumes](../../docs/concepts/volumes.md#network-volumes) and accelerated cluster
107+
> Support for [network volumes](../../docs/concepts/volumes.md#network) and accelerated cluster
108108
interconnects is coming soon.
109109

110110
!!! info "What's next?"

docs/docs/concepts/fleets.md

Lines changed: 29 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -63,25 +63,34 @@ Once the status of instances changes to `idle`, they can be used by dev environm
6363

6464
To ensure instances are interconnected (e.g., for
6565
[distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`.
66-
This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity
66+
This ensures all instances are provisioned with optimal inter-node connectivity.
6767

6868
??? info "AWS"
69-
`dstack` automatically enables the Elastic Fabric Adapter for all
70-
[EFA-capable instance types :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types){:target="_blank"}.
71-
If the `aws` backend config has `public_ips: false` set, `dstack` enables the maximum number of interfaces supported by the instance.
72-
Otherwise, if instances have public IPs, only one EFA interface is enabled per instance due to AWS limitations.
69+
When you create a cloud fleet with AWS, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
70+
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
71+
Otherwise, instances are only connected by the default VPC subnet.
72+
73+
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
74+
75+
??? info "GCP"
76+
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
77+
78+
!!! info "Backend configuration"
79+
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration.
80+
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and
81+
[A3 High](../../examples/clusters/a3high/index.md) examples for more details.
7382

7483
??? info "Nebius"
75-
`dstack` automatically creates an [InfiniBand cluster](https://docs.nebius.com/compute/clusters/gpu)
76-
if all instances in the fleet support it.
84+
When you create a cloud fleet with Nebius, [InfiniBand networking :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
7785
Otherwise, instances are only connected by the default VPC subnet.
7886

79-
An InfiniBand fabric for the cluster is selected automatically.
80-
If you prefer to use some specific fabrics, configure them in the
87+
An InfiniBand fabric for the cluster is selected automatically. If you prefer to use some specific fabrics, configure them in the
8188
[backend settings](../reference/server/config.yml.md#nebius).
8289

83-
> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, `nebius`, `oci`, and `vultr`
84-
> backends.
90+
The `cluster` placement is supported for `aws`, `azure`, `gcp`, `nebius`, `oci`, and `vultr`
91+
backends.
92+
93+
> For more details on optimal inter-node connectivity, read the [Clusters](../guides/clusters.md) guide.
8594
8695
#### Resources
8796

@@ -304,13 +313,14 @@ Once the status of instances changes to `idle`, they can be used by dev environm
304313
If the hosts are interconnected (i.e. share the same network), set `placement` to `cluster`.
305314
This is required if you'd like to use the fleet for [distributed tasks](tasks.md#distributed-tasks).
306315

307-
##### Network
308-
309-
By default, `dstack` automatically detects the network shared by the hosts.
310-
However, it's possible to configure it explicitly via
311-
the [`network`](../reference/dstack.yml/fleet.md#network) property.
316+
??? info "Network"
317+
By default, `dstack` automatically detects the network shared by the hosts.
318+
However, it's possible to configure it explicitly via
319+
the [`network`](../reference/dstack.yml/fleet.md#network) property.
320+
321+
[//]: # (TODO: Provide an example and more detail)
312322

313-
[//]: # (TODO: Provide an example and more detail)
323+
> For more details on optimal inter-node connectivity, read the [Clusters](../guides/clusters.md) guide.
314324

315325
#### Blocks { #ssh-blocks }
316326

@@ -463,5 +473,6 @@ Alternatively, you can delete a fleet by passing the fleet name to `dstack flee
463473
To terminate and delete specific instances from a fleet, pass `-i INSTANCE_NUM`.
464474

465475
!!! info "What's next?"
466-
1. Read about [dev environments](dev-environments.md), [tasks](tasks.md), and
476+
1. Check [dev environments](dev-environments.md), [tasks](tasks.md), and
467477
[services](services.md)
478+
2. Read the [Clusters](../guides/clusters.md) guide

docs/docs/concepts/tasks.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -137,8 +137,7 @@ resources:
137137

138138
Nodes can communicate using their private IP addresses.
139139
Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
140-
[System environment variables](#system-environment-variables)
141-
to discover IP addresses and other details.
140+
[System environment variables](#system-environment-variables) for inter-node communication.
142141

143142
??? info "Network interface"
144143
Distributed frameworks usually detect the correct network interface automatically,

docs/docs/concepts/volumes.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ Volumes enable data persistence between runs of dev environments, tasks, and ser
44

55
`dstack` supports two kinds of volumes:
66

7-
* [Network volumes](#network-volumes) &mdash; provisioned via backends and mounted to specific container directories.
7+
* [Network volumes](#network) &mdash; provisioned via backends and mounted to specific container directories.
88
Ideal for persistent storage.
9-
* [Instance volumes](#instance-volumes) &mdash; bind directories on the host instance to container directories.
9+
* [Instance volumes](#instance) &mdash; bind directories on the host instance to container directories.
1010
Useful as a cache for cloud fleets or for persistent storage with SSH fleets.
1111

12-
## Network volumes
12+
## Network volumes { #network }
1313

1414
Network volumes are currently supported for the `aws`, `gcp`, and `runpod` backends.
1515

@@ -130,6 +130,7 @@ and its contents will persist across runs.
130130

131131
`dstack` will attach one of the volumes based on the region and backend of the run.
132132

133+
<span id="distributed-tasks"></span>
133134
??? info "Distributed tasks"
134135
When using single-attach volumes such as AWS EBS with distributed tasks,
135136
you can attach different volumes to different nodes using `dstack` variable interpolation:
@@ -221,7 +222,7 @@ If you've registered an existing volume, it will be de-registered with `dstack`
221222
??? info "Can I attach network volumes to multiple runs or instances?"
222223
You can mount a volume in multiple runs. This feature is currently supported only by the `runpod` backend.
223224

224-
## Instance volumes
225+
## Instance volumes { #instance }
225226

226227
Instance volumes allow mapping any directory on the instance where the run is executed to any path inside the container.
227228
This means that the data in instance volumes is persisted only if the run is executed on the same instance.

docs/docs/guides/clusters.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Clusters
2+
3+
A cluster is a fleet with its `placement` set to `cluster`. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.
4+
5+
## Fleets
6+
7+
Ensure a fleet is created before you run any distributed task. This can be either an SSH fleet or a cloud fleet.
8+
9+
### SSH fleets
10+
11+
SSH fleets can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.
12+
13+
> For SSH fleets, fast interconnect is supported provided that the hosts are pre-configured with the appropriate interconnect drivers.
14+
15+
### Cloud fleets
16+
17+
Cloud fleets allow to provision interconnected clusters across supported backends.
18+
For cloud fleets, fast interconnect is currently supported only on the `aws`, `gcp`, and `nebius` backends.
19+
20+
=== "AWS"
21+
When you create a cloud fleet with AWS, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
22+
23+
!!! info "Backend configuration"
24+
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
25+
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
26+
27+
=== "GCP"
28+
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
29+
30+
!!! info "Backend configuration"
31+
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration.
32+
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and
33+
[A3 High](../../examples/clusters/a3high/index.md) examples for more details.
34+
35+
=== "Nebius"
36+
When you create a cloud fleet with Nebius, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
37+
38+
> To request fast interconnect support for a other backends,
39+
file an [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues){:target="_ blank"}.
40+
41+
## NCCL/RCCL tests
42+
43+
To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md)
44+
(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests.
45+
46+
## Distributed tasks
47+
48+
A distributed task is a task with `nodes` set to a value greater than `2`. In this case, `dstack` first ensures a
49+
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
50+
`dstack` starts the rest of the nodes and runs the task container on each of them.
51+
52+
Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
53+
[system environment variables](../concepts/tasks.md#system-environment-variables) for inter-node communication.
54+
55+
Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an example.
56+
57+
!!! info "Retry policy"
58+
By default, if any of the nodes fails, `dstack` terminates the entire run. Configure a [retry policy](../concepts/tasks.md#retry-policy) to restart the run if any node fails.
59+
60+
## Volumes
61+
62+
### Network volumes
63+
64+
Currently, no backend supports multi-attach network volumes for distributed tasks. However, single-attach volumes can be used by leveraging volume name [interpolation syntax](../concepts/volumes.md#distributed-tasks). This approach mounts a separate single-attach volume to each node.
65+
66+
### Instance volumes
67+
68+
Instance volumes enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.
69+
70+
Instance volumes can be used to mount:
71+
72+
* Regular folders (data persists only while the fleet exists)
73+
* Folders that are mounts of shared filesystems (e.g., manually mounted shared filesystems).
74+
75+
Refer to [instance volumes](../concepts/volumes.md#instance) for an example.
76+
77+
!!! info "What's next?"
78+
1. Read about [distributed tasks](../concepts/tasks.md#distributed-tasks), [fleets](../concepts/fleets.md), and [volumes](../concepts/volumes.md)
79+
2. Browse the [Clusters](../../examples.md#clusters) examples
80+

docs/docs/guides/protips.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,9 @@ unlimited).
3636
## Volumes
3737

3838
To persist data across runs, it is recommended to use volumes.
39-
`dstack` supports two types of volumes: [network](../concepts/volumes.md#network-volumes)
39+
`dstack` supports two types of volumes: [network](../concepts/volumes.md#network)
4040
(for persisting data even if the instance is interrupted)
41-
and [instance](../concepts/volumes.md#instance-volumes) (useful for persisting cached data across runs while the instance remains active).
41+
and [instance](../concepts/volumes.md#instance) (useful for persisting cached data across runs while the instance remains active).
4242

4343
> If you use [SSH fleets](../concepts/fleets.md#ssh), you can mount network storage (e.g., NFS or SMB) to the hosts and access it in runs via instance volumes.
4444

docs/docs/guides/troubleshooting.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Examples: `gpu: amd` (one AMD GPU), `gpu: A10:4..8` (4 to 8 A10 GPUs),
8787
8888
#### Cause 6: Network volumes
8989

90-
If your run configuration uses [network volumes](../concepts/volumes.md#network-volumes),
90+
If your run configuration uses [network volumes](../concepts/volumes.md#network),
9191
`dstack` will only select instances from the same backend and region as the volumes.
9292
For AWS, the availability zone of the volume and the instance should also match.
9393

@@ -97,7 +97,7 @@ Some `dstack` features are not supported by all backends. If your configuration
9797
one of these features, `dstack` will only select offers from the backends that support it.
9898

9999
- [Cloud fleet](../concepts/fleets.md#cloud) configurations,
100-
[Instance volumes](../concepts/volumes.md#instance-volumes),
100+
[Instance volumes](../concepts/volumes.md#instance),
101101
and [Privileged containers](../reference/dstack.yml/dev-environment.md#privileged)
102102
are supported by all backends except `runpod`, `vastai`, and `kubernetes`.
103103
- [Clusters](../concepts/fleets.md#cloud-placement)

examples/clusters/nccl-tests/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,10 @@ resources:
6363

6464
!!! info "MPI"
6565
NCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
66-
and waits until worker nodes are accessible via MPI.
66+
and waits until other nodes are accessible via MPI.
6767
Then, it executes `/nccl-tests/build/all_reduce_perf` across all GPUs.
6868

69-
Worker nodes use a `FIFO` pipe to wait for until the MPI run is finished.
69+
Non-master nodes use a `FIFO` pipe to wait for until the MPI run is finished.
7070

7171
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
7272

examples/clusters/rccl-tests/.dstack.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ commands:
3333
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
3434
break
3535
fi
36-
echo 'Waiting for worker nodes...'
36+
echo 'Waiting for other nodes...'
3737
sleep 5
3838
done
3939
# Run NCCL Tests
@@ -45,7 +45,7 @@ commands:
4545
-x NCCL_IB_GID_INDEX=3 \
4646
-x NCCL_IB_DISABLE=0 \
4747
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
48-
# Notify worker nodes the MPI run is finished
48+
# Notify other nodes the MPI run is finished
4949
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
5050
else
5151
mkfifo ${FIFO}

0 commit comments

Comments
 (0)