Skip to content

Commit 4a57c85

Browse files
committed
Deploying to gh-pages from @ dstackai/dstack@f148a3e 🚀
1 parent a11a89e commit 4a57c85

File tree

10 files changed

+158
-135
lines changed

10 files changed

+158
-135
lines changed
96 Bytes
Loading

assets/images/social/community.png

96 Bytes
Loading

assets/images/social/examples.png

285 Bytes
Loading

docs/concepts/fleets/index.html

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -926,7 +926,7 @@
926926
<ul class="md-nav__list">
927927

928928
<li class="md-nav__item">
929-
<a href="#placement" class="md-nav__link">
929+
<a href="#cloud-placement" class="md-nav__link">
930930
<span class="md-ellipsis">
931931

932932
<span class="md-typeset">
@@ -1042,7 +1042,7 @@
10421042
<ul class="md-nav__list">
10431043

10441044
<li class="md-nav__item">
1045-
<a href="#placement_1" class="md-nav__link">
1045+
<a href="#ssh-placement" class="md-nav__link">
10461046
<span class="md-ellipsis">
10471047

10481048
<span class="md-typeset">
@@ -3563,7 +3563,7 @@
35633563
<ul class="md-nav__list">
35643564

35653565
<li class="md-nav__item">
3566-
<a href="#placement" class="md-nav__link">
3566+
<a href="#cloud-placement" class="md-nav__link">
35673567
<span class="md-ellipsis">
35683568

35693569
<span class="md-typeset">
@@ -3679,7 +3679,7 @@
36793679
<ul class="md-nav__list">
36803680

36813681
<li class="md-nav__item">
3682-
<a href="#placement_1" class="md-nav__link">
3682+
<a href="#ssh-placement" class="md-nav__link">
36833683
<span class="md-ellipsis">
36843684

36853685
<span class="md-typeset">
@@ -3867,22 +3867,19 @@ <h3 id="define-a-configuration">Define a configuration<a class="headerlink" href
38673867

38683868
</div>
38693869

3870-
<h4 id="placement">Placement<a class="headerlink" href="#placement" title="Permanent link">&para;</a></h4>
3870+
<h4 id="cloud-placement">Placement<a class="headerlink" href="#cloud-placement" title="Permanent link">&para;</a></h4>
38713871
<p>To ensure instances are interconnected (e.g., for
38723872
<a href="../tasks/#distributed-tasks">distributed tasks</a>), set <code>placement</code> to <code>cluster</code>.
38733873
This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity</p>
38743874
<details class="info">
38753875
<summary>AWS</summary>
3876-
<p><code>dstack</code> automatically enables <a href="https://aws.amazon.com/hpc/efa/" target="_blank">Elastic Fabric Adapter <span class="twemoji external"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="m11.93 5 2.83 2.83L5 17.59 6.42 19l9.76-9.75L19 12.07V5z"/></svg></span></a>
3877-
for the instance types that support it:
3878-
<code>p5.48xlarge</code>, <code>p4d.24xlarge</code>, <code>g4dn.12xlarge</code>, <code>g4dn.16xlarge</code>, <code>g4dn.8xlarge</code>, <code>g4dn.metal</code>,
3879-
<code>g5.12xlarge</code>, <code>g5.16xlarge</code>, <code>g5.24xlarge</code>, <code>g5.48xlarge</code>, <code>g5.8xlarge</code>, <code>g6.12xlarge</code>,
3880-
<code>g6.16xlarge</code>, <code>g6.24xlarge</code>, <code>g6.48xlarge</code>, <code>g6.8xlarge</code>, and <code>gr6.8xlarge</code>.</p>
3881-
<p>Currently, only one EFA interface is enabled per instance, regardless of its maximum capacity.
3876+
<p><code>dstack</code> automatically enables the Elastic Fabric Adapter for all
3877+
<a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types" target="_blank">EFA-capable instance types <span class="twemoji external"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="m11.93 5 2.83 2.83L5 17.59 6.42 19l9.76-9.75L19 12.07V5z"/></svg></span></a>.
3878+
Currently, only one EFA interface is enabled per instance, regardless of its maximum capacity.
38823879
This will change once <a href="https://github.com/dstackai/dstack/issues/1804" target="_blank">this issue <span class="twemoji external"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="m11.93 5 2.83 2.83L5 17.59 6.42 19l9.76-9.75L19 12.07V5z"/></svg></span></a> is resolved.</p>
38833880
</details>
38843881
<blockquote>
3885-
<p>The <code>cluster</code> placement is supported only for <code>aws</code>, <code>azure</code>, <code>gcp</code>, and <code>oci</code>
3882+
<p>The <code>cluster</code> placement is supported only for <code>aws</code>, <code>azure</code>, <code>gcp</code>, <code>oci</code>, and <code>vultr</code>
38863883
backends.</p>
38873884
</blockquote>
38883885
<h4 id="resources">Resources<a class="headerlink" href="#resources" title="Permanent link">&para;</a></h4>
@@ -4052,7 +4049,7 @@ <h3 id="define-a-configuration_1">Define a configuration<a class="headerlink" hr
40524049
</div>
40534050
<p>3.&nbsp;The user specified should have passwordless <code>sudo</code> access.</p>
40544051
</details>
4055-
<h4 id="placement_1">Placement<a class="headerlink" href="#placement_1" title="Permanent link">&para;</a></h4>
4052+
<h4 id="ssh-placement">Placement<a class="headerlink" href="#ssh-placement" title="Permanent link">&para;</a></h4>
40564053
<p>If the hosts are interconnected (i.e. share the same network), set <code>placement</code> to <code>cluster</code>.
40574054
This is required if you'd like to use the fleet for <a href="../tasks/#distributed-tasks">distributed tasks</a>.</p>
40584055
<h5 id="network">Network<a class="headerlink" href="#network" title="Permanent link">&para;</a></h5>

docs/concepts/tasks/index.html

Lines changed: 39 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3821,7 +3821,7 @@ <h3 id="ports">Ports<a class="headerlink" href="#ports" title="Permanent link">&
38213821
<h3 id="distributed-tasks">Distributed tasks<a class="headerlink" href="#distributed-tasks" title="Permanent link">&para;</a></h3>
38223822
<p>By default, a task runs on a single node.
38233823
However, you can run it on a cluster of nodes by specifying <code>nodes</code>.</p>
3824-
<div editor-title="examples/fine-tuning/train.dstack.yml">
3824+
<div editor-title="train.dstack.yml">
38253825

38263826
<div class="highlight"><pre><span></span><code><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">task</span>
38273827
<span class="c1"># The name is optional, if not specified, generated randomly</span>
@@ -3830,33 +3830,57 @@ <h3 id="distributed-tasks">Distributed tasks<a class="headerlink" href="#distrib
38303830
<span class="c1"># The size of the cluster</span>
38313831
<span class="nt">nodes</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">2</span>
38323832

3833-
<span class="nt">python</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;3.10&quot;</span>
3833+
<span class="nt">python</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;3.12&quot;</span>
38343834

3835-
<span class="c1"># Commands of the task</span>
3835+
<span class="c1"># Commands to run on each node</span>
38363836
<span class="nt">commands</span><span class="p">:</span>
3837+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">git clone https://github.com/pytorch/examples.git</span>
3838+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">cd examples/distributed/ddp-tutorial-series</span>
38373839
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">pip install -r requirements.txt</span>
38383840
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">torchrun</span>
3839-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--nproc_per_node=$DSTACK_GPUS_PER_NODE</span>
3840-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--node_rank=$DSTACK_NODE_RANK</span>
3841+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--nproc-per-node=$DSTACK_GPUS_PER_NODE</span>
3842+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--node-rank=$DSTACK_NODE_RANK</span>
38413843
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--nnodes=$DSTACK_NODES_NUM</span>
3842-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--master_addr=$DSTACK_MASTER_NODE_IP</span>
3843-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--master_port=8008 resnet_ddp.py</span>
3844-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--num_epochs 20</span>
3844+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--master-addr=$DSTACK_MASTER_NODE_IP</span>
3845+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--master-port=12345</span>
3846+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">multinode.py 50 10</span>
38453847

38463848
<span class="nt">resources</span><span class="p">:</span>
38473849
<span class="w"> </span><span class="nt">gpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">24GB</span>
3850+
<span class="w"> </span><span class="c1"># Uncomment if using multiple GPUs</span>
3851+
<span class="w"> </span><span class="c1">#shm_size: 24GB</span>
38483852
</code></pre></div>
38493853

38503854
</div>
38513855

3852-
<p>All you need to do is pass the corresponding environment variables such as
3853-
<code>DSTACK_GPUS_PER_NODE</code>, <code>DSTACK_NODE_RANK</code>, <code>DSTACK_NODES_NUM</code>,
3854-
<code>DSTACK_MASTER_NODE_IP</code>, and <code>DSTACK_GPUS_NUM</code> (see <a href="#system-environment-variables">System environment variables</a>).</p>
3856+
<p>Nodes can communicate using their private IP addresses.
3857+
Use <code>DSTACK_MASTER_NODE_IP</code>, <code>$DSTACK_NODE_RANK</code>, and other
3858+
<a href="#system-environment-variables">System environment variables</a>
3859+
to discover IP addresses and other details.</p>
3860+
<details class="info">
3861+
<summary>Network interface</summary>
3862+
<p>Distributed frameworks usually detect the correct network interface automatically,
3863+
but sometimes you need to specify it explicitly.</p>
3864+
<p>For example, with PyTorch and the NCCL backend, you may need
3865+
to add these commands to tell NCCL to use the private interface:</p>
3866+
<div class="highlight"><pre><span></span><code><span class="nt">commands</span><span class="p">:</span>
3867+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">apt-get install -y iproute2</span>
3868+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
3869+
<span class="w"> </span><span class="no">if [[ $DSTACK_NODE_RANK == 0 ]]; then</span>
3870+
<span class="w"> </span><span class="no">export NCCL_SOCKET_IFNAME=$(ip -4 -o addr show | fgrep $DSTACK_MASTER_NODE_IP | awk &#39;{print $2}&#39;)</span>
3871+
<span class="w"> </span><span class="no">else</span>
3872+
<span class="w"> </span><span class="no">export NCCL_SOCKET_IFNAME=$(ip route get $DSTACK_MASTER_NODE_IP | sed -E &#39;s/.*?dev (\S+) .*/\1/;t;d&#39;)</span>
3873+
<span class="w"> </span><span class="no">fi</span>
3874+
<span class="w"> </span><span class="c1"># ... The rest of the commands</span>
3875+
</code></pre></div>
3876+
</details>
38553877
<div class="admonition info">
38563878
<p class="admonition-title">Fleets</p>
3857-
<p>To ensure all nodes are provisioned into a cluster placement group and to enable the highest level of inter-node
3858-
connectivity (incl. support for <a href="https://aws.amazon.com/hpc/efa/" target="_blank">EFA <span class="twemoji external"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="m11.93 5 2.83 2.83L5 17.59 6.42 19l9.76-9.75L19 12.07V5z"/></svg></span></a>),
3859-
create a <a href="../fleets/">fleet</a> via a configuration before running a disstributed task.</p>
3879+
<p>Distributed tasks can only run on fleets with
3880+
<a href="../fleets/#cloud-placement">cluster placement</a>.
3881+
While <code>dstack</code> can provision such fleets automatically, it is
3882+
recommended to create them via a fleet configuration
3883+
to ensure the highest level of inter-node connectivity.</p>
38603884
</div>
38613885
<p><code>dstack</code> is easy to use with <code>accelerate</code>, <code>torchrun</code>, Ray, Spark, and any other distributed frameworks.</p>
38623886
<h3 id="resources">Resources<a class="headerlink" href="#resources" title="Permanent link">&para;</a></h3>
@@ -4061,7 +4085,7 @@ <h3 id="environment-variables">Environment variables<a class="headerlink" href="
40614085
</tr>
40624086
<tr>
40634087
<td><code>DSTACK_MASTER_NODE_IP</code></td>
4064-
<td>The internal IP address the master node</td>
4088+
<td>The internal IP address of the master node</td>
40654089
</tr>
40664090
<tr>
40674091
<td><code>DSTACK_NODES_IPS</code></td>

docs/reference/api/rest/openapi.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/reference/environment-variables/index.html

Lines changed: 23 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3398,28 +3398,30 @@ <h2 id="dstackyml">.dstack.yml<a class="headerlink" href="#dstackyml" title="Per
33983398
<li><code id="DSTACK_GPUS_PER_NODE">DSTACK_GPUS_PER_NODE</code> – The number of GPUs per node</li>
33993399
<li><code id="DSTACK_NODE_RANK">DSTACK_NODE_RANK</code> – The rank of the node</li>
34003400
<li>
3401-
<p><code id="DSTACK_NODE_RANK">DSTACK_NODE_RANK</code> – The internal IP address the master node.</p>
3402-
<p>Below is an example of using <code>DSTACK_NODES_NUM</code>, <code>DSTACK_GPUS_PER_NODE</code>, <code>DSTACK_NODE_RANK</code>, and <code>DSTACK_NODE_RANK</code>
3401+
<p><code id="DSTACK_NODE_RANK">DSTACK_MASTER_NODE_IP</code> – The internal IP address of the master node.</p>
3402+
<p>Below is an example of using <code>DSTACK_NODES_NUM</code>, <code>DSTACK_GPUS_PER_NODE</code>, <code>DSTACK_NODE_RANK</code>, and <code>DSTACK_MASTER_NODE_IP</code>
34033403
for distributed training:</p>
3404-
<div class="highlight"><pre><span></span><code><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">task</span>
3405-
<span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">train-distrib</span>
3406-
3407-
<span class="c1"># The number of instances in the cluster</span>
3408-
<span class="nt">nodes</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">2</span>
3409-
3410-
<span class="nt">python</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;3.10&quot;</span>
3411-
<span class="nt">commands</span><span class="p">:</span>
3412-
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">pip install -r requirements.txt</span>
3413-
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">torchrun</span>
3414-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--nproc_per_node=$DSTACK_GPUS_PER_NODE</span>
3415-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--node_rank=$DSTACK_NODE_RANK</span>
3416-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--nnodes=$DSTACK_NODES_NUM</span>
3417-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--master_addr=$DSTACK_MASTER_NODE_IP</span>
3418-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--master_port=8008</span><span class="w"> </span>
3419-
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">resnet_ddp.py --num_epochs 20</span>
3420-
3421-
<span class="nt">resources</span><span class="p">:</span>
3422-
<span class="w"> </span><span class="nt">gpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">24GB</span>
3404+
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">task</span>
3405+
<span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">train-distrib</span>
3406+
3407+
<span class="w"> </span><span class="nt">nodes</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">2</span>
3408+
<span class="w"> </span><span class="nt">python</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;3.12&quot;</span>
3409+
3410+
<span class="w"> </span><span class="nt">commands</span><span class="p">:</span>
3411+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">git clone https://github.com/pytorch/examples.git</span>
3412+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">cd examples/distributed/ddp-tutorial-series</span>
3413+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">pip install -r requirements.txt</span>
3414+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">torchrun</span>
3415+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--nproc-per-node=$DSTACK_GPUS_PER_NODE</span>
3416+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--node-rank=$DSTACK_NODE_RANK</span>
3417+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--nnodes=$DSTACK_NODES_NUM</span>
3418+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--master-addr=$DSTACK_MASTER_NODE_IP</span>
3419+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--master-port=12345</span>
3420+
<span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">multinode.py 50 10</span>
3421+
3422+
<span class="w"> </span><span class="nt">resources</span><span class="p">:</span>
3423+
<span class="w"> </span><span class="nt">gpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">24GB</span>
3424+
<span class="w"> </span><span class="nt">shm_size</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">24GB</span>
34233425
</code></pre></div>
34243426
</li>
34253427
<li>

search/search_index.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)