Skip to content

Commit 3684467

Browse files
Deploying to gh-pages from @ dstackai/dstack@9565045 🚀
1 parent c3a9a65 commit 3684467

File tree

19 files changed

+864
-473
lines changed

19 files changed

+864
-473
lines changed

assets/images/social/examples.png

285 Bytes
Loading

assets/images/social/partners.png

92 Bytes
Loading

blog/efa/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4082,7 +4082,7 @@ <h2 id="submit-the-task">Submit the task<a class="headerlink" href="#submit-the-
40824082
<span class="c1"># The size of the cluster</span>
40834083
<span class="nt">nodes</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">2</span>
40844084

4085-
<span class="nt">python</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;3.12&quot;</span>
4085+
<span class="nt">python</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">3.12</span>
40864086

40874087
<span class="c1"># Commands to run on each node</span>
40884088
<span class="nt">commands</span><span class="p">:</span>

docs/concepts/dev-environments/index.html

Lines changed: 139 additions & 91 deletions
Large diffs are not rendered by default.

docs/concepts/repos/index.html

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4314,7 +4314,9 @@ <h3 id="git-credentials">Git credentials<a class="headerlink" href="#git-credent
43144314
<h3 id="gitignore-and-folder-size">.gitignore and folder size<a class="headerlink" href="#gitignore-and-folder-size" title="Permanent link">&para;</a></h3>
43154315
<p>If the directory is cloned Git repo, <a href="../../reference/cli/dstack/apply/"><code>dstack apply</code></a> uploads to the <code>dstack</code> server only local changes.
43164316
If the directory is not a cloned Git repo, it uploads the entire directory.</p>
4317-
<p>Uploads are limited to 2MB. Use <code>.gitignore</code> to exclude unnecessary files from being uploaded.</p>
4317+
<p>Uploads are limited to 2MB. Use <code>.gitignore</code> to exclude unnecessary files from being uploaded.
4318+
You can set the <code>DSTACK_SERVER_CODE_UPLOAD_LIMIT</code> environment variable to increase the default server limit.
4319+
Increasing the limit is recommended only if you <a href="../../guides/server-deployment/">configure an object storage</a>.</p>
43184320
<h3 id="initialize-as-a-local-directory">Initialize as a local directory<a class="headerlink" href="#initialize-as-a-local-directory" title="Permanent link">&para;</a></h3>
43194321
<p>If the directory is a cloned Git repo but you want to initialize it as a regular local directory,
43204322
use <code>--local</code> with <a href="../../reference/cli/dstack/init/"><code>dstack init</code></a>.</p>

docs/concepts/services/index.html

Lines changed: 81 additions & 27 deletions
Large diffs are not rendered by default.

docs/concepts/tasks/index.html

Lines changed: 113 additions & 55 deletions
Large diffs are not rendered by default.

docs/guides/clusters/index.html

Lines changed: 32 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1173,11 +1173,11 @@
11731173
</li>
11741174

11751175
<li class="md-nav__item">
1176-
<a href="#ncclrccl-tests" class="md-nav__link">
1176+
<a href="#distributed-tasks" class="md-nav__link">
11771177
<span class="md-ellipsis">
11781178

11791179
<span class="md-typeset">
1180-
NCCL/RCCL tests
1180+
Distributed tasks
11811181
</span>
11821182

11831183
</span>
@@ -1186,11 +1186,11 @@
11861186
</li>
11871187

11881188
<li class="md-nav__item">
1189-
<a href="#distributed-tasks" class="md-nav__link">
1189+
<a href="#ncclrccl-tests" class="md-nav__link">
11901190
<span class="md-ellipsis">
11911191

11921192
<span class="md-typeset">
1193-
Distributed tasks
1193+
NCCL/RCCL tests
11941194
</span>
11951195

11961196
</span>
@@ -1213,11 +1213,11 @@
12131213
<ul class="md-nav__list">
12141214

12151215
<li class="md-nav__item">
1216-
<a href="#network-volumes" class="md-nav__link">
1216+
<a href="#instance-volumes" class="md-nav__link">
12171217
<span class="md-ellipsis">
12181218

12191219
<span class="md-typeset">
1220-
Network volumes
1220+
Instance volumes
12211221
</span>
12221222

12231223
</span>
@@ -1226,11 +1226,11 @@
12261226
</li>
12271227

12281228
<li class="md-nav__item">
1229-
<a href="#instance-volumes" class="md-nav__link">
1229+
<a href="#network-volumes" class="md-nav__link">
12301230
<span class="md-ellipsis">
12311231

12321232
<span class="md-typeset">
1233-
Instance volumes
1233+
Network volumes
12341234
</span>
12351235

12361236
</span>
@@ -4064,11 +4064,11 @@
40644064
</li>
40654065

40664066
<li class="md-nav__item">
4067-
<a href="#ncclrccl-tests" class="md-nav__link">
4067+
<a href="#distributed-tasks" class="md-nav__link">
40684068
<span class="md-ellipsis">
40694069

40704070
<span class="md-typeset">
4071-
NCCL/RCCL tests
4071+
Distributed tasks
40724072
</span>
40734073

40744074
</span>
@@ -4077,11 +4077,11 @@
40774077
</li>
40784078

40794079
<li class="md-nav__item">
4080-
<a href="#distributed-tasks" class="md-nav__link">
4080+
<a href="#ncclrccl-tests" class="md-nav__link">
40814081
<span class="md-ellipsis">
40824082

40834083
<span class="md-typeset">
4084-
Distributed tasks
4084+
NCCL/RCCL tests
40854085
</span>
40864086

40874087
</span>
@@ -4104,11 +4104,11 @@
41044104
<ul class="md-nav__list">
41054105

41064106
<li class="md-nav__item">
4107-
<a href="#network-volumes" class="md-nav__link">
4107+
<a href="#instance-volumes" class="md-nav__link">
41084108
<span class="md-ellipsis">
41094109

41104110
<span class="md-typeset">
4111-
Network volumes
4111+
Instance volumes
41124112
</span>
41134113

41144114
</span>
@@ -4117,11 +4117,11 @@
41174117
</li>
41184118

41194119
<li class="md-nav__item">
4120-
<a href="#instance-volumes" class="md-nav__link">
4120+
<a href="#network-volumes" class="md-nav__link">
41214121
<span class="md-ellipsis">
41224122

41234123
<span class="md-typeset">
4124-
Instance volumes
4124+
Network volumes
41254125
</span>
41264126

41274127
</span>
@@ -4217,16 +4217,16 @@
42174217

42184218

42194219
<h1 id="clusters">Clusters<a class="headerlink" href="#clusters" title="Permanent link">&para;</a></h1>
4220-
<p>A cluster is a fleet with its <code>placement</code> set to <code>cluster</code>. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.</p>
4220+
<p>A cluster is a <a href="../../concepts/fleets/">fleet</a> with its <code>placement</code> set to <code>cluster</code>. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.</p>
42214221
<h2 id="fleets">Fleets<a class="headerlink" href="#fleets" title="Permanent link">&para;</a></h2>
42224222
<p>Ensure a fleet is created before you run any distributed task. This can be either an SSH fleet or a cloud fleet.</p>
42234223
<h3 id="ssh-fleets">SSH fleets<a class="headerlink" href="#ssh-fleets" title="Permanent link">&para;</a></h3>
4224-
<p>SSH fleets can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.</p>
4224+
<p><a href="../../concepts/fleets/#ssh">SSH fleets</a> can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.</p>
42254225
<blockquote>
42264226
<p>For SSH fleets, fast interconnect is supported provided that the hosts are pre-configured with the appropriate interconnect drivers.</p>
42274227
</blockquote>
42284228
<h3 id="cloud-fleets">Cloud fleets<a class="headerlink" href="#cloud-fleets" title="Permanent link">&para;</a></h3>
4229-
<p>Cloud fleets allow to provision interconnected clusters across supported backends.
4229+
<p><a href="../../concepts/fleets/#cloud">Cloud fleets</a> allow to provision interconnected clusters across supported backends.
42304230
For cloud fleets, fast interconnect is currently supported only on the <code>aws</code>, <code>gcp</code>, and <code>nebius</code> backends.</p>
42314231
<div class="tabbed-set tabbed-alternate" data-tabs="1:3"><input checked="checked" id="aws" name="__tabbed_1" type="radio" /><input id="gcp" name="__tabbed_1" type="radio" /><input id="nebius" name="__tabbed_1" type="radio" /><div class="tabbed-labels"><label for="aws">AWS</label><label for="gcp">GCP</label><label for="nebius">Nebius</label></div>
42324232
<div class="tabbed-content">
@@ -4256,31 +4256,34 @@ <h3 id="cloud-fleets">Cloud fleets<a class="headerlink" href="#cloud-fleets" tit
42564256
<p>To request fast interconnect support for a other backends,
42574257
file an <a href="https://github.com/dstackai/dstack/issues" target="_ blank">issue <span class="twemoji external"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="m11.93 5 2.83 2.83L5 17.59 6.42 19l9.76-9.75L19 12.07V5z"/></svg></span></a>. </p>
42584258
</blockquote>
4259-
<h2 id="ncclrccl-tests">NCCL/RCCL tests<a class="headerlink" href="#ncclrccl-tests" title="Permanent link">&para;</a></h2>
4260-
<p>To test the interconnect of a created fleet, ensure you run <a href="../../../examples/clusters/nccl-tests/">NCCL</a>
4261-
(for NVIDIA) or <a href="../../../examples/clusters/rccl-tests/">RCCL</a> (for AMD) tests.</p>
42624259
<h2 id="distributed-tasks">Distributed tasks<a class="headerlink" href="#distributed-tasks" title="Permanent link">&para;</a></h2>
42634260
<p>A distributed task is a task with <code>nodes</code> set to a value greater than <code>2</code>. In this case, <code>dstack</code> first ensures a
4264-
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
4265-
<code>dstack</code> starts the rest of the nodes and runs the task container on each of them.</p>
4261+
suitable fleet is available, then selects the master node (to obtain its IP) and finally runs jobs on each node.</p>
42664262
<p>Within the task's <code>commands</code>, it's possible to use <code>DSTACK_MASTER_NODE_IP</code>, <code>DSTACK_NODES_IPS</code>, <code>DSTACK_NODE_RANK</code>, and other
42674263
<a href="../../concepts/tasks/#system-environment-variables">system environment variables</a> for inter-node communication.</p>
4268-
<p>Refer to <a href="../../concepts/tasks/#distributed-tasks">distributed tasks</a> for an example.</p>
4264+
<details class="info">
4265+
<summary>MPI</summary>
4266+
<p>If want to use MPI, you can set <code>startup_order</code> to <code>workers-first</code> and <code>stop_criteria</code> to <code>master-done</code>, and use <code>DSTACK_MPI_HOSTFILE</code>.
4267+
See the <a href="../../../examples/clusters/nccl-tests/">NCCL</a> or <a href="../../../examples/clusters/rccl-tests/">RCCL</a> examples.</p>
4268+
</details>
42694269
<div class="admonition info">
42704270
<p class="admonition-title">Retry policy</p>
42714271
<p>By default, if any of the nodes fails, <code>dstack</code> terminates the entire run. Configure a <a href="../../concepts/tasks/#retry-policy">retry policy</a> to restart the run if any node fails.</p>
42724272
</div>
4273+
<p>Refer to <a href="../../concepts/tasks/#distributed-tasks">distributed tasks</a> for an example.</p>
4274+
<h2 id="ncclrccl-tests">NCCL/RCCL tests<a class="headerlink" href="#ncclrccl-tests" title="Permanent link">&para;</a></h2>
4275+
<p>To test the interconnect of a created fleet, ensure you run <a href="../../../examples/clusters/nccl-tests/">NCCL</a>
4276+
(for NVIDIA) or <a href="../../../examples/clusters/rccl-tests/">RCCL</a> (for AMD) tests using MPI.</p>
42734277
<h2 id="volumes">Volumes<a class="headerlink" href="#volumes" title="Permanent link">&para;</a></h2>
4274-
<h3 id="network-volumes">Network volumes<a class="headerlink" href="#network-volumes" title="Permanent link">&para;</a></h3>
4275-
<p>Currently, no backend supports multi-attach network volumes for distributed tasks. However, single-attach volumes can be used by leveraging volume name <a href="../../concepts/volumes/#distributed-tasks">interpolation syntax</a>. This approach mounts a separate single-attach volume to each node.</p>
42764278
<h3 id="instance-volumes">Instance volumes<a class="headerlink" href="#instance-volumes" title="Permanent link">&para;</a></h3>
4277-
<p>Instance volumes enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.</p>
4279+
<p><a href="../../concepts/volumes/#instance">Instance volumes</a> enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.</p>
42784280
<p>Instance volumes can be used to mount:</p>
42794281
<ul>
42804282
<li>Regular folders (data persists only while the fleet exists)</li>
42814283
<li>Folders that are mounts of shared filesystems (e.g., manually mounted shared filesystems).</li>
42824284
</ul>
4283-
<p>Refer to <a href="../../concepts/volumes/#instance">instance volumes</a> for an example.</p>
4285+
<h3 id="network-volumes">Network volumes<a class="headerlink" href="#network-volumes" title="Permanent link">&para;</a></h3>
4286+
<p>Currently, no backend supports multi-attach <a href="../../concepts/volumes/#network">network volumes</a> for distributed tasks. However, single-attach volumes can be used by leveraging volume name <a href="../../concepts/volumes/#distributed-tasks">interpolation syntax</a>. This approach mounts a separate single-attach volume to each node.</p>
42844287
<div class="admonition info">
42854288
<p class="admonition-title">What's next?</p>
42864289
<ol>

0 commit comments

Comments
 (0)