You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/guides/clusters.md
+11-8Lines changed: 11 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,25 +38,28 @@ For cloud fleets, fast interconnect is currently supported only on the `aws`, `g
38
38
> To request fast interconnect support for a other backends,
39
39
file an [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues){:target="_ blank"}.
40
40
41
-
## NCCL/RCCL tests
42
-
43
-
To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md)
44
-
(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests.
45
-
46
41
## Distributed tasks
47
42
48
43
A distributed task is a task with `nodes` set to a value greater than `2`. In this case, `dstack` first ensures a
49
-
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
50
-
`dstack` starts the rest of the nodes and runs the task container on each of them.
44
+
suitable fleet is available, then selects the master node (to obtain its IP) and finally runs jobs on each node.
51
45
52
46
Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
53
47
[system environment variables](../concepts/tasks.md#system-environment-variables) for inter-node communication.
54
48
55
-
Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an example.
49
+
??? info "MPI"
50
+
If want to use MPI, you can set `startup_order` to `workers-first` and `stop_criteria` to `master-done`, and use `DSTACK_MPI_HOSTFILE`.
51
+
See the [NCCL](../../examples/clusters/nccl-tests/index.md) or [RCCL](../../examples/clusters/rccl-tests/index.md) examples.
56
52
57
53
!!! info "Retry policy"
58
54
By default, if any of the nodes fails, `dstack` terminates the entire run. Configure a [retry policy](../concepts/tasks.md#retry-policy) to restart the run if any node fails.
59
55
56
+
Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an example.
57
+
58
+
## NCCL/RCCL tests
59
+
60
+
To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md)
61
+
(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests using MPI.
0 commit comments