|
1173 | 1173 | </li> |
1174 | 1174 |
|
1175 | 1175 | <li class="md-nav__item"> |
1176 | | - <a href="#ncclrccl-tests" class="md-nav__link"> |
| 1176 | + <a href="#distributed-tasks" class="md-nav__link"> |
1177 | 1177 | <span class="md-ellipsis"> |
1178 | 1178 |
|
1179 | 1179 | <span class="md-typeset"> |
1180 | | - NCCL/RCCL tests |
| 1180 | + Distributed tasks |
1181 | 1181 | </span> |
1182 | 1182 |
|
1183 | 1183 | </span> |
|
1186 | 1186 | </li> |
1187 | 1187 |
|
1188 | 1188 | <li class="md-nav__item"> |
1189 | | - <a href="#distributed-tasks" class="md-nav__link"> |
| 1189 | + <a href="#ncclrccl-tests" class="md-nav__link"> |
1190 | 1190 | <span class="md-ellipsis"> |
1191 | 1191 |
|
1192 | 1192 | <span class="md-typeset"> |
1193 | | - Distributed tasks |
| 1193 | + NCCL/RCCL tests |
1194 | 1194 | </span> |
1195 | 1195 |
|
1196 | 1196 | </span> |
|
1213 | 1213 | <ul class="md-nav__list"> |
1214 | 1214 |
|
1215 | 1215 | <li class="md-nav__item"> |
1216 | | - <a href="#network-volumes" class="md-nav__link"> |
| 1216 | + <a href="#instance-volumes" class="md-nav__link"> |
1217 | 1217 | <span class="md-ellipsis"> |
1218 | 1218 |
|
1219 | 1219 | <span class="md-typeset"> |
1220 | | - Network volumes |
| 1220 | + Instance volumes |
1221 | 1221 | </span> |
1222 | 1222 |
|
1223 | 1223 | </span> |
|
1226 | 1226 | </li> |
1227 | 1227 |
|
1228 | 1228 | <li class="md-nav__item"> |
1229 | | - <a href="#instance-volumes" class="md-nav__link"> |
| 1229 | + <a href="#network-volumes" class="md-nav__link"> |
1230 | 1230 | <span class="md-ellipsis"> |
1231 | 1231 |
|
1232 | 1232 | <span class="md-typeset"> |
1233 | | - Instance volumes |
| 1233 | + Network volumes |
1234 | 1234 | </span> |
1235 | 1235 |
|
1236 | 1236 | </span> |
|
4064 | 4064 | </li> |
4065 | 4065 |
|
4066 | 4066 | <li class="md-nav__item"> |
4067 | | - <a href="#ncclrccl-tests" class="md-nav__link"> |
| 4067 | + <a href="#distributed-tasks" class="md-nav__link"> |
4068 | 4068 | <span class="md-ellipsis"> |
4069 | 4069 |
|
4070 | 4070 | <span class="md-typeset"> |
4071 | | - NCCL/RCCL tests |
| 4071 | + Distributed tasks |
4072 | 4072 | </span> |
4073 | 4073 |
|
4074 | 4074 | </span> |
|
4077 | 4077 | </li> |
4078 | 4078 |
|
4079 | 4079 | <li class="md-nav__item"> |
4080 | | - <a href="#distributed-tasks" class="md-nav__link"> |
| 4080 | + <a href="#ncclrccl-tests" class="md-nav__link"> |
4081 | 4081 | <span class="md-ellipsis"> |
4082 | 4082 |
|
4083 | 4083 | <span class="md-typeset"> |
4084 | | - Distributed tasks |
| 4084 | + NCCL/RCCL tests |
4085 | 4085 | </span> |
4086 | 4086 |
|
4087 | 4087 | </span> |
|
4104 | 4104 | <ul class="md-nav__list"> |
4105 | 4105 |
|
4106 | 4106 | <li class="md-nav__item"> |
4107 | | - <a href="#network-volumes" class="md-nav__link"> |
| 4107 | + <a href="#instance-volumes" class="md-nav__link"> |
4108 | 4108 | <span class="md-ellipsis"> |
4109 | 4109 |
|
4110 | 4110 | <span class="md-typeset"> |
4111 | | - Network volumes |
| 4111 | + Instance volumes |
4112 | 4112 | </span> |
4113 | 4113 |
|
4114 | 4114 | </span> |
|
4117 | 4117 | </li> |
4118 | 4118 |
|
4119 | 4119 | <li class="md-nav__item"> |
4120 | | - <a href="#instance-volumes" class="md-nav__link"> |
| 4120 | + <a href="#network-volumes" class="md-nav__link"> |
4121 | 4121 | <span class="md-ellipsis"> |
4122 | 4122 |
|
4123 | 4123 | <span class="md-typeset"> |
4124 | | - Instance volumes |
| 4124 | + Network volumes |
4125 | 4125 | </span> |
4126 | 4126 |
|
4127 | 4127 | </span> |
|
4217 | 4217 |
|
4218 | 4218 |
|
4219 | 4219 | <h1 id="clusters">Clusters<a class="headerlink" href="#clusters" title="Permanent link">¶</a></h1> |
4220 | | -<p>A cluster is a fleet with its <code>placement</code> set to <code>cluster</code>. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.</p> |
| 4220 | +<p>A cluster is a <a href="../../concepts/fleets/">fleet</a> with its <code>placement</code> set to <code>cluster</code>. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.</p> |
4221 | 4221 | <h2 id="fleets">Fleets<a class="headerlink" href="#fleets" title="Permanent link">¶</a></h2> |
4222 | 4222 | <p>Ensure a fleet is created before you run any distributed task. This can be either an SSH fleet or a cloud fleet.</p> |
4223 | 4223 | <h3 id="ssh-fleets">SSH fleets<a class="headerlink" href="#ssh-fleets" title="Permanent link">¶</a></h3> |
4224 | | -<p>SSH fleets can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.</p> |
| 4224 | +<p><a href="../../concepts/fleets/#ssh">SSH fleets</a> can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.</p> |
4225 | 4225 | <blockquote> |
4226 | 4226 | <p>For SSH fleets, fast interconnect is supported provided that the hosts are pre-configured with the appropriate interconnect drivers.</p> |
4227 | 4227 | </blockquote> |
4228 | 4228 | <h3 id="cloud-fleets">Cloud fleets<a class="headerlink" href="#cloud-fleets" title="Permanent link">¶</a></h3> |
4229 | | -<p>Cloud fleets allow to provision interconnected clusters across supported backends. |
| 4229 | +<p><a href="../../concepts/fleets/#cloud">Cloud fleets</a> allow to provision interconnected clusters across supported backends. |
4230 | 4230 | For cloud fleets, fast interconnect is currently supported only on the <code>aws</code>, <code>gcp</code>, and <code>nebius</code> backends.</p> |
4231 | 4231 | <div class="tabbed-set tabbed-alternate" data-tabs="1:3"><input checked="checked" id="aws" name="__tabbed_1" type="radio" /><input id="gcp" name="__tabbed_1" type="radio" /><input id="nebius" name="__tabbed_1" type="radio" /><div class="tabbed-labels"><label for="aws">AWS</label><label for="gcp">GCP</label><label for="nebius">Nebius</label></div> |
4232 | 4232 | <div class="tabbed-content"> |
@@ -4256,31 +4256,34 @@ <h3 id="cloud-fleets">Cloud fleets<a class="headerlink" href="#cloud-fleets" tit |
4256 | 4256 | <p>To request fast interconnect support for a other backends, |
4257 | 4257 | file an <a href="https://github.com/dstackai/dstack/issues" target="_ blank">issue <span class="twemoji external"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="m11.93 5 2.83 2.83L5 17.59 6.42 19l9.76-9.75L19 12.07V5z"/></svg></span></a>. </p> |
4258 | 4258 | </blockquote> |
4259 | | -<h2 id="ncclrccl-tests">NCCL/RCCL tests<a class="headerlink" href="#ncclrccl-tests" title="Permanent link">¶</a></h2> |
4260 | | -<p>To test the interconnect of a created fleet, ensure you run <a href="../../../examples/clusters/nccl-tests/">NCCL</a> |
4261 | | -(for NVIDIA) or <a href="../../../examples/clusters/rccl-tests/">RCCL</a> (for AMD) tests.</p> |
4262 | 4259 | <h2 id="distributed-tasks">Distributed tasks<a class="headerlink" href="#distributed-tasks" title="Permanent link">¶</a></h2> |
4263 | 4260 | <p>A distributed task is a task with <code>nodes</code> set to a value greater than <code>2</code>. In this case, <code>dstack</code> first ensures a |
4264 | | -suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up, |
4265 | | -<code>dstack</code> starts the rest of the nodes and runs the task container on each of them.</p> |
| 4261 | +suitable fleet is available, then selects the master node (to obtain its IP) and finally runs jobs on each node.</p> |
4266 | 4262 | <p>Within the task's <code>commands</code>, it's possible to use <code>DSTACK_MASTER_NODE_IP</code>, <code>DSTACK_NODES_IPS</code>, <code>DSTACK_NODE_RANK</code>, and other |
4267 | 4263 | <a href="../../concepts/tasks/#system-environment-variables">system environment variables</a> for inter-node communication.</p> |
4268 | | -<p>Refer to <a href="../../concepts/tasks/#distributed-tasks">distributed tasks</a> for an example.</p> |
| 4264 | +<details class="info"> |
| 4265 | +<summary>MPI</summary> |
| 4266 | +<p>If want to use MPI, you can set <code>startup_order</code> to <code>workers-first</code> and <code>stop_criteria</code> to <code>master-done</code>, and use <code>DSTACK_MPI_HOSTFILE</code>. |
| 4267 | +See the <a href="../../../examples/clusters/nccl-tests/">NCCL</a> or <a href="../../../examples/clusters/rccl-tests/">RCCL</a> examples.</p> |
| 4268 | +</details> |
4269 | 4269 | <div class="admonition info"> |
4270 | 4270 | <p class="admonition-title">Retry policy</p> |
4271 | 4271 | <p>By default, if any of the nodes fails, <code>dstack</code> terminates the entire run. Configure a <a href="../../concepts/tasks/#retry-policy">retry policy</a> to restart the run if any node fails.</p> |
4272 | 4272 | </div> |
| 4273 | +<p>Refer to <a href="../../concepts/tasks/#distributed-tasks">distributed tasks</a> for an example.</p> |
| 4274 | +<h2 id="ncclrccl-tests">NCCL/RCCL tests<a class="headerlink" href="#ncclrccl-tests" title="Permanent link">¶</a></h2> |
| 4275 | +<p>To test the interconnect of a created fleet, ensure you run <a href="../../../examples/clusters/nccl-tests/">NCCL</a> |
| 4276 | +(for NVIDIA) or <a href="../../../examples/clusters/rccl-tests/">RCCL</a> (for AMD) tests using MPI.</p> |
4273 | 4277 | <h2 id="volumes">Volumes<a class="headerlink" href="#volumes" title="Permanent link">¶</a></h2> |
4274 | | -<h3 id="network-volumes">Network volumes<a class="headerlink" href="#network-volumes" title="Permanent link">¶</a></h3> |
4275 | | -<p>Currently, no backend supports multi-attach network volumes for distributed tasks. However, single-attach volumes can be used by leveraging volume name <a href="../../concepts/volumes/#distributed-tasks">interpolation syntax</a>. This approach mounts a separate single-attach volume to each node.</p> |
4276 | 4278 | <h3 id="instance-volumes">Instance volumes<a class="headerlink" href="#instance-volumes" title="Permanent link">¶</a></h3> |
4277 | | -<p>Instance volumes enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.</p> |
| 4279 | +<p><a href="../../concepts/volumes/#instance">Instance volumes</a> enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.</p> |
4278 | 4280 | <p>Instance volumes can be used to mount:</p> |
4279 | 4281 | <ul> |
4280 | 4282 | <li>Regular folders (data persists only while the fleet exists)</li> |
4281 | 4283 | <li>Folders that are mounts of shared filesystems (e.g., manually mounted shared filesystems).</li> |
4282 | 4284 | </ul> |
4283 | | -<p>Refer to <a href="../../concepts/volumes/#instance">instance volumes</a> for an example.</p> |
| 4285 | +<h3 id="network-volumes">Network volumes<a class="headerlink" href="#network-volumes" title="Permanent link">¶</a></h3> |
| 4286 | +<p>Currently, no backend supports multi-attach <a href="../../concepts/volumes/#network">network volumes</a> for distributed tasks. However, single-attach volumes can be used by leveraging volume name <a href="../../concepts/volumes/#distributed-tasks">interpolation syntax</a>. This approach mounts a separate single-attach volume to each node.</p> |
4284 | 4287 | <div class="admonition info"> |
4285 | 4288 | <p class="admonition-title">What's next?</p> |
4286 | 4289 | <ol> |
|
0 commit comments