Skip to content

Commit f29a1f8

Browse files
[Blog] Using SSH fleets with TensorWave's private AMD cloud (#2391)
1 parent 4341282 commit f29a1f8

File tree

6 files changed

+407
-4732
lines changed

6 files changed

+407
-4732
lines changed

docs/assets/images/hotaisle-logo.svg

Lines changed: 7 additions & 4669 deletions
Loading
Lines changed: 10 additions & 0 deletions
Loading

docs/assets/stylesheets/landing.css

Lines changed: 68 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -386,7 +386,7 @@
386386
}
387387

388388
.providers.tx-landing__highlights_grid {
389-
grid-gap: 28px !important;
389+
grid-gap: 20px !important;
390390
}
391391

392392
#typed {
@@ -397,15 +397,22 @@
397397

398398
.providers.tx-landing__highlights_grid .feature-cell h3 {
399399
align-content: center;
400-
font-size: 1.1em;
400+
font-size: 1em;
401401
font-weight: 600;
402402
padding-bottom: 0.05em;
403403
line-height: 25px;
404404
}
405405

406406
.providers.tx-landing__highlights_grid .feature-cell {
407-
row-gap: 18px;
408-
padding: 26px 39px;
407+
row-gap: 22px;
408+
padding: 25px 30px;
409+
aspect-ratio: 1.05;
410+
411+
@media screen and (min-width: 76.1875em) {
412+
&:nth-child(1) {
413+
border-top-left-radius: 3px;
414+
}
415+
}
409416
}
410417

411418
.tx-landing__highlights_grid .feature-cell {
@@ -418,6 +425,54 @@
418425
flex-direction: column;
419426
}
420427

428+
@media screen and (min-width: 76.1875em) {
429+
.providers.tx-landing__highlights_grid .feature-cell {
430+
border-radius: 0;
431+
border-left: none;
432+
border-bottom: none;
433+
}
434+
435+
.nvidia.providers.tx-landing__highlights_grid .feature-cell {
436+
&:nth-child(1), &:nth-child(6), &:nth-child(11) {
437+
border-left: 0.5px dotted rgba(0, 0, 0, 0.75);
438+
}
439+
440+
&:nth-child(n+7) {
441+
border-bottom: 0.5px dotted rgba(0, 0, 0, 0.75);
442+
}
443+
444+
&:nth-child(5) {
445+
border-top-right-radius: 3px;
446+
}
447+
448+
&:nth-child(5), &:nth-child(11) {
449+
border-bottom-right-radius: 3px;
450+
}
451+
452+
&:nth-child(11) {
453+
border-bottom-left-radius: 3px;
454+
}
455+
456+
&:nth-child(10) {
457+
border-bottom-right-radius: 3px;
458+
}
459+
}
460+
}
461+
462+
:is(.amd).providers.tx-landing__highlights_grid .feature-cell {
463+
&:nth-child(1) {
464+
border-left: 0.5px dotted rgba(0, 0, 0, 0.75);
465+
border-bottom-left-radius: 3px;
466+
}
467+
468+
border-bottom: 0.5px dotted rgba(0, 0, 0, 0.75);
469+
470+
&:nth-child(3) {
471+
border-top-right-radius: 3px;
472+
border-bottom-right-radius: 3px;
473+
}
474+
}
475+
421476
.providers.tx-landing__highlights_grid.other .feature-cell {
422477
column-gap: 15px;
423478
flex-direction: row;
@@ -436,6 +491,13 @@
436491
grid-template-columns: repeat(4, 1fr) !important;
437492
}
438493

494+
.providers.tx-landing__highlights_grid {
495+
grid-gap: 0px !important;
496+
border: none;
497+
498+
grid-template-columns: repeat(5, 1fr) !important;
499+
}
500+
439501
.tx-landing__highlights_grid .feature-cell {
440502
}
441503
}
@@ -444,9 +506,9 @@
444506
background: -webkit-linear-gradient(45deg, rgba(0, 42, 255, 0.005), rgba(0, 42, 255, 0.005), rgba(225, 101, 254, 0.01));
445507
}
446508

447-
.tx-landing__highlights_grid .feature-cell:hover {
509+
/*.tx-landing__highlights_grid .feature-cell:hover {
448510
background: -webkit-linear-gradient(45deg, rgba(0, 42, 255, 0.03), rgba(0, 42, 255, 0.03), rgba(225, 101, 254, 0.05));
449-
}
511+
}*/
450512

451513
.tx-landing__highlights_grid .feature-cell strong {
452514
font-weight: 500;
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
---
2+
title: Using SSH fleets with TensorWave's private AMD cloud
3+
date: 2025-03-11
4+
description: "This tutorial walks you through how dstack can be used with TensorWave's private AMD cloud using SSH fleets."
5+
slug: amd-on-tensorwave
6+
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-tensorwave-v2.png?raw=true
7+
categories:
8+
- Fleets
9+
- AMD
10+
- Private clouds
11+
---
12+
13+
# Using SSH fleets with TensorWave's private AMD cloud
14+
15+
Since last month, when we introduced support for private clouds and data centers, it has become easier to use `dstack`
16+
to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters.
17+
18+
In this tutorial, we’ll walk you through how `dstack` can be used with
19+
[TensorWave :material-arrow-top-right-thin:{ .external }](https://tensorwave.com/){:target="_blank"} using
20+
[SSH fleets](../../docs/concepts/fleets.md#ssh).
21+
22+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-tensorwave-v2.png?raw=true" width="630"/>
23+
24+
<!-- more -->
25+
26+
TensorWave is a cloud provider specializing in large-scale AMD GPU clusters for both
27+
training and inference.
28+
29+
Before following this tutorial, ensure you have access to a cluster. You’ll see the cluster and its nodes in your
30+
TensorWave dashboard.
31+
32+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-tensorwave-ui.png?raw=true" width="750"/>
33+
34+
## Creating a fleet
35+
36+
??? info "Prerequisites"
37+
Once `dstack` is [installed](https://dstack.ai/docs/installation), create a project repo folder and run `dstack init`.
38+
39+
<div class="termy">
40+
41+
```shell
42+
$ mkdir tensorwave-demo && cd tensorwave-demo
43+
$ dstack init
44+
```
45+
46+
</div>
47+
48+
Now, define an SSH fleet configuration by listing the IP addresses of each node in the cluster,
49+
along with the SSH user and SSH key configured for each host.
50+
51+
<div editor-title="fleet.dstack.yml">
52+
53+
```yaml
54+
type: fleet
55+
name: my-tensorwave-fleet
56+
57+
placement: cluster
58+
59+
ssh_config:
60+
user: dstack
61+
identity_file: ~/.ssh/id_rsa
62+
hosts:
63+
- hostname: 64.139.222.107
64+
blocks: auto
65+
- hostname: 64.139.222.108
66+
blocks: auto
67+
```
68+
69+
</div>
70+
71+
You can set `blocks` to `auto` if you want to run concurrent workloads on each instance.
72+
Otherwise, you can omit this property.
73+
74+
Once the configuration is ready, apply it using `dstack apply`:
75+
76+
<div class="termy">
77+
78+
```shell
79+
$ dstack apply -f fleet.dstack.yml
80+
81+
Provisioning...
82+
---> 100%
83+
84+
FLEET INSTANCE RESOURCES STATUS CREATED
85+
my-tensorwave-fleet 0 8xMI300X (192GB) 0/8 busy 3 mins ago
86+
1 8xMI300X (192GB) 0/8 busy 3 mins ago
87+
88+
```
89+
90+
</div>
91+
92+
`dstack` will automatically connect to each host, detect the hardware, install dependencies, and make them ready for
93+
workloads.
94+
95+
## Running workloads
96+
97+
Once the fleet is created, you can use `dstack` to run workloads.
98+
99+
### Dev environments
100+
101+
A dev environment lets you access an instance through your desktop IDE.
102+
103+
<div editor-title=".dstack.yml">
104+
105+
```yaml
106+
type: dev-environment
107+
name: vscode
108+
109+
image: rocm/pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.4.0
110+
ide: vscode
111+
112+
resources:
113+
gpu: MI300X:8
114+
```
115+
116+
</div>
117+
118+
Apply the configuration via [`dstack apply`](../../docs/reference/cli/dstack/apply.md):
119+
120+
<div class="termy">
121+
122+
```shell
123+
$ dstack apply -f .dstack.yml
124+
125+
Submit the run `vscode`? [y/n]: y
126+
127+
Launching `vscode`...
128+
---> 100%
129+
130+
To open in VS Code Desktop, use this link:
131+
vscode://vscode-remote/ssh-remote+vscode/workflow
132+
```
133+
134+
</div>
135+
136+
Open the link to access the dev environment using your desktop IDE.
137+
138+
### Tasks
139+
140+
A task allows you to schedule a job or run a web app. Tasks can be distributed and support port forwarding.
141+
142+
Below is a distributed training task configuration:
143+
144+
<div editor-title="train.dstack.yml">
145+
146+
```yaml
147+
type: task
148+
name: train-distrib
149+
150+
nodes: 2
151+
152+
image: rocm/pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.4.0
153+
commands:
154+
- pip install torch
155+
- export NCCL_IB_GID_INDEX=3
156+
- export NCCL_NET_GDR_LEVEL=0
157+
- torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$DSTACK_NODE_RANK --master_port=29600 --master_addr=$DSTACK_MASTER_NODE_IP test/tensorwave/multinode.py 5000 50
158+
159+
resources:
160+
gpu: MI300X:8
161+
```
162+
163+
</div>
164+
165+
Run the configuration via [`dstack apply`](../../docs/reference/cli/dstack/apply.md):
166+
167+
<div class="termy">
168+
169+
```shell
170+
$ dstack apply -f train.dstack.yml
171+
172+
Submit the run `streamlit`? [y/n]: y
173+
174+
Provisioning `train-distrib`...
175+
---> 100%
176+
```
177+
178+
</div>
179+
180+
`dstack` automatically runs the container on each node while passing
181+
[system environment variables](../../docs/concepts/tasks.md#system-environment-variables)
182+
which you can use with `torchrun`, `accelerate`, or other distributed frameworks.
183+
184+
### Services
185+
186+
A service allows you to deploy a model or any web app as a scalable and secure endpoint.
187+
188+
Create the following configuration file inside the repo:
189+
190+
<div editor-title="deepseek.dstack.yml">
191+
192+
```yaml
193+
type: service
194+
name: deepseek-r1-sglang
195+
196+
image: rocm/sglang-staging:20250212
197+
env:
198+
- MODEL_ID=deepseek-ai/DeepSeek-R1
199+
- HSA_NO_SCRATCH_RECLAIM=1
200+
commands:
201+
- python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code
202+
port: 8000
203+
model: deepseek-ai/DeepSeek-R1
204+
205+
resources:
206+
gpu: mi300x:8
207+
208+
volumes:
209+
- /root/.cache/huggingface:/root/.cache/huggingface
210+
```
211+
212+
</div>
213+
214+
Run the configuration via [`dstack apply`](../../docs/reference/cli/dstack/apply.md):
215+
216+
<div class="termy">
217+
218+
```shell
219+
$ dstack apply -f deepseek.dstack.yml
220+
221+
Submit the run `deepseek-r1-sglang`? [y/n]: y
222+
223+
Provisioning `deepseek-r1-sglang`...
224+
---> 100%
225+
226+
Service is published at:
227+
http://localhost:3000/proxy/services/main/deepseek-r1-sglang/
228+
Model deepseek-ai/DeepSeek-R1 is published at:
229+
http://localhost:3000/proxy/models/main/
230+
```
231+
232+
</div>
233+
234+
## See it in action
235+
236+
Want to see how it works? Check out the video below:
237+
238+
<iframe width="750" height="520" src="https://www.youtube.com/embed/b1vAgm5fCfE?si=qw2gYHkMjERohdad&rel=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
239+
240+
!!! info "What's next?"
241+
1. See [SSH fleets](../../docs/concepts/fleets.md#ssh)
242+
2. Read about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), and [services](../../docs/concepts/services.md)
243+
3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd)

docs/overrides/main.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@
132132
<a href="https://discord.gg/u8SmfwPpMd" target="_blank" class="tx-footer__section-link external">Discord</a>
133133
<a href="https://github.com/dstackai/dstack/" target="_blank" class="tx-footer__section-link external">GitHub</a>
134134
<a href="https://github.com/dstackai/dstack/blob/master/CONTRIBUTING.md" target="_blank" class="tx-footer__section-link external">Contributing</a>
135-
<a href="/developers#ambassador-program" class="tx-footer__section-link">Ambassador program</a>
135+
<!--<a href="/developers#ambassador-program" class="tx-footer__section-link">Ambassador program</a>-->
136136
</div>
137137

138138
<div class="tx-footer__section">

0 commit comments

Comments
 (0)