|
| 1 | +--- |
| 2 | +title: Using SSH fleets with TensorWave's private AMD cloud |
| 3 | +date: 2025-03-11 |
| 4 | +description: "This tutorial walks you through how dstack can be used with TensorWave's private AMD cloud using SSH fleets." |
| 5 | +slug: amd-on-tensorwave |
| 6 | +image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-tensorwave-v2.png?raw=true |
| 7 | +categories: |
| 8 | + - Fleets |
| 9 | + - AMD |
| 10 | + - Private clouds |
| 11 | +--- |
| 12 | + |
| 13 | +# Using SSH fleets with TensorWave's private AMD cloud |
| 14 | + |
| 15 | +Since last month, when we introduced support for private clouds and data centers, it has become easier to use `dstack` |
| 16 | +to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters. |
| 17 | + |
| 18 | +In this tutorial, we’ll walk you through how `dstack` can be used with |
| 19 | +[TensorWave :material-arrow-top-right-thin:{ .external }](https://tensorwave.com/){:target="_blank"} using |
| 20 | +[SSH fleets](../../docs/concepts/fleets.md#ssh). |
| 21 | + |
| 22 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-tensorwave-v2.png?raw=true" width="630"/> |
| 23 | + |
| 24 | +<!-- more --> |
| 25 | + |
| 26 | +TensorWave is a cloud provider specializing in large-scale AMD GPU clusters for both |
| 27 | +training and inference. |
| 28 | + |
| 29 | +Before following this tutorial, ensure you have access to a cluster. You’ll see the cluster and its nodes in your |
| 30 | +TensorWave dashboard. |
| 31 | + |
| 32 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-tensorwave-ui.png?raw=true" width="750"/> |
| 33 | + |
| 34 | +## Creating a fleet |
| 35 | + |
| 36 | +??? info "Prerequisites" |
| 37 | + Once `dstack` is [installed](https://dstack.ai/docs/installation), create a project repo folder and run `dstack init`. |
| 38 | + |
| 39 | + <div class="termy"> |
| 40 | + |
| 41 | + ```shell |
| 42 | + $ mkdir tensorwave-demo && cd tensorwave-demo |
| 43 | + $ dstack init |
| 44 | + ``` |
| 45 | + |
| 46 | + </div> |
| 47 | + |
| 48 | +Now, define an SSH fleet configuration by listing the IP addresses of each node in the cluster, |
| 49 | +along with the SSH user and SSH key configured for each host. |
| 50 | + |
| 51 | +<div editor-title="fleet.dstack.yml"> |
| 52 | + |
| 53 | +```yaml |
| 54 | +type: fleet |
| 55 | +name: my-tensorwave-fleet |
| 56 | + |
| 57 | +placement: cluster |
| 58 | + |
| 59 | +ssh_config: |
| 60 | + user: dstack |
| 61 | + identity_file: ~/.ssh/id_rsa |
| 62 | + hosts: |
| 63 | + - hostname: 64.139.222.107 |
| 64 | + blocks: auto |
| 65 | + - hostname: 64.139.222.108 |
| 66 | + blocks: auto |
| 67 | +``` |
| 68 | +
|
| 69 | +</div> |
| 70 | +
|
| 71 | +You can set `blocks` to `auto` if you want to run concurrent workloads on each instance. |
| 72 | +Otherwise, you can omit this property. |
| 73 | + |
| 74 | +Once the configuration is ready, apply it using `dstack apply`: |
| 75 | + |
| 76 | +<div class="termy"> |
| 77 | + |
| 78 | +```shell |
| 79 | +$ dstack apply -f fleet.dstack.yml |
| 80 | +
|
| 81 | +Provisioning... |
| 82 | +---> 100% |
| 83 | +
|
| 84 | + FLEET INSTANCE RESOURCES STATUS CREATED |
| 85 | + my-tensorwave-fleet 0 8xMI300X (192GB) 0/8 busy 3 mins ago |
| 86 | + 1 8xMI300X (192GB) 0/8 busy 3 mins ago |
| 87 | +
|
| 88 | +``` |
| 89 | + |
| 90 | +</div> |
| 91 | + |
| 92 | +`dstack` will automatically connect to each host, detect the hardware, install dependencies, and make them ready for |
| 93 | +workloads. |
| 94 | + |
| 95 | +## Running workloads |
| 96 | + |
| 97 | +Once the fleet is created, you can use `dstack` to run workloads. |
| 98 | + |
| 99 | +### Dev environments |
| 100 | + |
| 101 | +A dev environment lets you access an instance through your desktop IDE. |
| 102 | + |
| 103 | +<div editor-title=".dstack.yml"> |
| 104 | + |
| 105 | +```yaml |
| 106 | +type: dev-environment |
| 107 | +name: vscode |
| 108 | +
|
| 109 | +image: rocm/pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.4.0 |
| 110 | +ide: vscode |
| 111 | +
|
| 112 | +resources: |
| 113 | + gpu: MI300X:8 |
| 114 | +``` |
| 115 | + |
| 116 | +</div> |
| 117 | + |
| 118 | +Apply the configuration via [`dstack apply`](../../docs/reference/cli/dstack/apply.md): |
| 119 | + |
| 120 | +<div class="termy"> |
| 121 | + |
| 122 | +```shell |
| 123 | +$ dstack apply -f .dstack.yml |
| 124 | +
|
| 125 | +Submit the run `vscode`? [y/n]: y |
| 126 | + |
| 127 | +Launching `vscode`... |
| 128 | +---> 100% |
| 129 | + |
| 130 | +To open in VS Code Desktop, use this link: |
| 131 | + vscode://vscode-remote/ssh-remote+vscode/workflow |
| 132 | +``` |
| 133 | +
|
| 134 | +</div> |
| 135 | +
|
| 136 | +Open the link to access the dev environment using your desktop IDE. |
| 137 | +
|
| 138 | +### Tasks |
| 139 | +
|
| 140 | +A task allows you to schedule a job or run a web app. Tasks can be distributed and support port forwarding. |
| 141 | +
|
| 142 | +Below is a distributed training task configuration: |
| 143 | +
|
| 144 | +<div editor-title="train.dstack.yml"> |
| 145 | +
|
| 146 | +```yaml |
| 147 | +type: task |
| 148 | +name: train-distrib |
| 149 | + |
| 150 | +nodes: 2 |
| 151 | + |
| 152 | +image: rocm/pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.4.0 |
| 153 | +commands: |
| 154 | + - pip install torch |
| 155 | + - export NCCL_IB_GID_INDEX=3 |
| 156 | + - export NCCL_NET_GDR_LEVEL=0 |
| 157 | + - torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$DSTACK_NODE_RANK --master_port=29600 --master_addr=$DSTACK_MASTER_NODE_IP test/tensorwave/multinode.py 5000 50 |
| 158 | + |
| 159 | +resources: |
| 160 | + gpu: MI300X:8 |
| 161 | +``` |
| 162 | +
|
| 163 | +</div> |
| 164 | +
|
| 165 | +Run the configuration via [`dstack apply`](../../docs/reference/cli/dstack/apply.md): |
| 166 | + |
| 167 | +<div class="termy"> |
| 168 | + |
| 169 | +```shell |
| 170 | +$ dstack apply -f train.dstack.yml |
| 171 | +
|
| 172 | +Submit the run `streamlit`? [y/n]: y |
| 173 | + |
| 174 | +Provisioning `train-distrib`... |
| 175 | +---> 100% |
| 176 | +``` |
| 177 | + |
| 178 | +</div> |
| 179 | + |
| 180 | +`dstack` automatically runs the container on each node while passing |
| 181 | +[system environment variables](../../docs/concepts/tasks.md#system-environment-variables) |
| 182 | +which you can use with `torchrun`, `accelerate`, or other distributed frameworks. |
| 183 | + |
| 184 | +### Services |
| 185 | + |
| 186 | +A service allows you to deploy a model or any web app as a scalable and secure endpoint. |
| 187 | + |
| 188 | +Create the following configuration file inside the repo: |
| 189 | + |
| 190 | +<div editor-title="deepseek.dstack.yml"> |
| 191 | + |
| 192 | +```yaml |
| 193 | +type: service |
| 194 | +name: deepseek-r1-sglang |
| 195 | + |
| 196 | +image: rocm/sglang-staging:20250212 |
| 197 | +env: |
| 198 | + - MODEL_ID=deepseek-ai/DeepSeek-R1 |
| 199 | + - HSA_NO_SCRATCH_RECLAIM=1 |
| 200 | +commands: |
| 201 | + - python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code |
| 202 | +port: 8000 |
| 203 | +model: deepseek-ai/DeepSeek-R1 |
| 204 | + |
| 205 | +resources: |
| 206 | + gpu: mi300x:8 |
| 207 | + |
| 208 | +volumes: |
| 209 | + - /root/.cache/huggingface:/root/.cache/huggingface |
| 210 | +``` |
| 211 | +
|
| 212 | +</div> |
| 213 | +
|
| 214 | +Run the configuration via [`dstack apply`](../../docs/reference/cli/dstack/apply.md): |
| 215 | + |
| 216 | +<div class="termy"> |
| 217 | + |
| 218 | +```shell |
| 219 | +$ dstack apply -f deepseek.dstack.yml |
| 220 | +
|
| 221 | +Submit the run `deepseek-r1-sglang`? [y/n]: y |
| 222 | + |
| 223 | +Provisioning `deepseek-r1-sglang`... |
| 224 | +---> 100% |
| 225 | + |
| 226 | +Service is published at: |
| 227 | + http://localhost:3000/proxy/services/main/deepseek-r1-sglang/ |
| 228 | +Model deepseek-ai/DeepSeek-R1 is published at: |
| 229 | + http://localhost:3000/proxy/models/main/ |
| 230 | +``` |
| 231 | +
|
| 232 | +</div> |
| 233 | +
|
| 234 | +## See it in action |
| 235 | +
|
| 236 | +Want to see how it works? Check out the video below: |
| 237 | +
|
| 238 | +<iframe width="750" height="520" src="https://www.youtube.com/embed/b1vAgm5fCfE?si=qw2gYHkMjERohdad&rel=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> |
| 239 | +
|
| 240 | +!!! info "What's next?" |
| 241 | + 1. See [SSH fleets](../../docs/concepts/fleets.md#ssh) |
| 242 | + 2. Read about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), and [services](../../docs/concepts/services.md) |
| 243 | + 3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd) |
0 commit comments