|
| 1 | +# A3 Mega GCP clusters |
| 2 | + |
| 3 | +This example shows how to set up an A3 Mega GCP cluster with [GPUDirect-TCPXO](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot) |
| 4 | +optimized NCCL communication and run [NCCL Tests](https://github.com/NVIDIA/nccl-tests) on it using `dstack`. |
| 5 | + |
| 6 | +## Overview |
| 7 | + |
| 8 | +GCP's A3 Mega instances are 8xH100 VMs that have 1800Gbps maximum network bandwidth, |
| 9 | +which is the best among H100 GCP instances. To get that network performance, you need |
| 10 | +to set up GPUDirect-TCPXO – the GCP technology for GPU RDMA over TCP. This involves: |
| 11 | + |
| 12 | +* Setting up eight extra data NICs on every node, each NIC in a separate VPC. |
| 13 | +* Building a VM image with the GPUDirect-TCPXO support. |
| 14 | +* Launching an RXDM service container. |
| 15 | +* Installing the GPUDirect-TCPXO NCCL plugin. |
| 16 | + |
| 17 | +`dstack` hides most of the setup complexity and provides optimized A3 Mega GCP clusters out-of-the-box. |
| 18 | + |
| 19 | +## Configure GCP backend |
| 20 | + |
| 21 | +First configure the `gcp` backend for the GPUDirect-TCPXO support. |
| 22 | +You need to specify eight `extra_vpcs` to use for data NICs. |
| 23 | +You also need to specify `vm_service_account` that's authorized to pull GPUDirect-related Docker images: |
| 24 | + |
| 25 | +<div editor-title="~/.dstack/server/config.yml"> |
| 26 | + |
| 27 | +```yaml |
| 28 | +projects: |
| 29 | + - name: main |
| 30 | + backends: |
| 31 | + - type: gcp |
| 32 | + project_id: $MYPROJECT # Replace $MYPROJECT |
| 33 | + extra_vpcs: |
| 34 | + - dstack-gpu-data-net-1 |
| 35 | + - dstack-gpu-data-net-2 |
| 36 | + - dstack-gpu-data-net-3 |
| 37 | + - dstack-gpu-data-net-4 |
| 38 | + - dstack-gpu-data-net-5 |
| 39 | + - dstack-gpu-data-net-6 |
| 40 | + - dstack-gpu-data-net-7 |
| 41 | + - dstack-gpu-data-net-8 |
| 42 | + regions: [europe-west4] |
| 43 | + vm_service_account: a3mega-sa@$MYPROJECT.iam.gserviceaccount.com # Replace $MYPROJECT |
| 44 | + creds: |
| 45 | + type: default |
| 46 | +``` |
| 47 | +
|
| 48 | +</div> |
| 49 | +
|
| 50 | +??? info "Create extra VPCs" |
| 51 | + Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule. Choose the GPUDirect-TCPX tab for A3 High machine types, or choose the GPUDirect-TCPXO tab for A3 Mega machine types, then complete the following instructions: |
| 52 | + |
| 53 | + ```shell |
| 54 | + # Specify the region where you intend to deploy the cluster |
| 55 | + REGION="europe-west4" |
| 56 | + |
| 57 | + for N in $(seq 1 8); do |
| 58 | + gcloud compute networks create dstack-gpu-data-net-$N \ |
| 59 | + --subnet-mode=custom \ |
| 60 | + --mtu=8244 |
| 61 | + |
| 62 | + gcloud compute networks subnets create dstack-gpu-data-sub-$N \ |
| 63 | + --network=dstack-gpu-data-net-$N \ |
| 64 | + --region=$REGION \ |
| 65 | + --range=192.168.$N.0/24 |
| 66 | + |
| 67 | + gcloud compute firewall-rules create dstack-gpu-data-internal-$N \ |
| 68 | + --network=dstack-gpu-data-net-$N \ |
| 69 | + --action=ALLOW \ |
| 70 | + --rules=tcp:0-65535,udp:0-65535,icmp \ |
| 71 | + --source-ranges=192.168.0.0/16 |
| 72 | + done |
| 73 | + ``` |
| 74 | + |
| 75 | +??? info "Create Service Account" |
| 76 | + Create a VM service account that allows VMs to access the `pkg.dev` registry: |
| 77 | + |
| 78 | + ```shell |
| 79 | + PROJECT_ID=$(gcloud config get-value project) |
| 80 | +
|
| 81 | + gcloud iam service-accounts create a3mega-sa \ |
| 82 | + --display-name "Service Account for pulling GCR images" |
| 83 | +
|
| 84 | + gcloud projects add-iam-policy-binding $PROJECT_ID \ |
| 85 | + --member="serviceAccount:a3mega-sa@${PROJECT_ID}.iam.gserviceaccount.com" \ |
| 86 | + --role="roles/artifactregistry.reader" |
| 87 | + ``` |
| 88 | + |
| 89 | +## Create A3 Mega fleet |
| 90 | + |
| 91 | +Once you've configured the `gcp` backend, create the fleet configuration: |
| 92 | + |
| 93 | +<div editor-title="fleet.dstack.yml"> |
| 94 | +```yaml |
| 95 | +type: fleet |
| 96 | +name: a3mega-cluster |
| 97 | +nodes: 2 |
| 98 | +placement: cluster |
| 99 | +instance_types: |
| 100 | + - a3-megagpu-8g |
| 101 | +spot_policy: auto |
| 102 | +``` |
| 103 | +</div> |
| 104 | + |
| 105 | +and apply the configuration: |
| 106 | + |
| 107 | +<div class="termy"> |
| 108 | + |
| 109 | +```shell |
| 110 | +$ dstack apply -f examples/misc/a3mega-clusters/fleet.dstack.yml |
| 111 | + Project main |
| 112 | + User admin |
| 113 | + Configuration examples/misc/a3mega-clusters/fleet.dstack.yml |
| 114 | + Type fleet |
| 115 | + Fleet type cloud |
| 116 | + Nodes 2 |
| 117 | + Placement cluster |
| 118 | + Resources 2..xCPU, 8GB.., 100GB.. (disk) |
| 119 | + Spot policy auto |
| 120 | +
|
| 121 | + # BACKEND REGION INSTANCE RESOURCES SPOT PRICE |
| 122 | + 1 gcp europe-west4 a3-megagpu-8g 208xCPU, 1872GB, yes $22.1525 |
| 123 | + 8xH100 (80GB), |
| 124 | + 100.0GB (disk) |
| 125 | + 2 gcp europe-west4 a3-megagpu-8g 208xCPU, 1872GB, no $64.2718 |
| 126 | + 8xH100 (80GB), |
| 127 | + 100.0GB (disk) |
| 128 | +
|
| 129 | +Fleet a3mega-cluster does not exist yet. |
| 130 | +Create the fleet? [y/n]: y |
| 131 | +
|
| 132 | +Provisioning... |
| 133 | +---> 100% |
| 134 | +``` |
| 135 | + |
| 136 | +</div> |
| 137 | + |
| 138 | +`dstack` will provision two A3 Mega nodes with GPUDirect-TCPXO configured. |
| 139 | + |
| 140 | +## Run NCCL Tests with GPUDirect-TCPXO support |
| 141 | + |
| 142 | +Once the nodes are provisioned, let's test the network by running NCCL Tests: |
| 143 | + |
| 144 | +<div class="termy"> |
| 145 | + |
| 146 | +```shell |
| 147 | +$ dstack apply -f examples/misc/a3mega-clusters/nccl-tests.dstack.yml |
| 148 | +
|
| 149 | +nccl-tests provisioning completed (running) |
| 150 | +nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0 |
| 151 | +
|
| 152 | + out-of-place in-place |
| 153 | + size count type redop root time algbw busbw #wrong time algbw busbw #wrong |
| 154 | + (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) |
| 155 | + 8388608 131072 float none -1 394.2 21.28 19.95 N/A 392.7 21.36 20.03 N/A |
| 156 | + 16777216 262144 float none -1 437.8 38.32 35.92 N/A 434.1 38.65 36.24 N/A |
| 157 | + 33554432 524288 float none -1 479.5 69.98 65.61 N/A 479.9 69.92 65.55 N/A |
| 158 | + 67108864 1048576 float none -1 755.8 88.79 83.24 N/A 771.9 86.94 81.51 N/A |
| 159 | + 134217728 2097152 float none -1 1125.3 119.27 111.81 N/A 1121.8 119.64 112.16 N/A |
| 160 | + 268435456 4194304 float none -1 1741.3 154.16 144.53 N/A 1742.2 154.08 144.45 N/A |
| 161 | + 536870912 8388608 float none -1 2854.9 188.05 176.30 N/A 2869.8 187.08 175.38 N/A |
| 162 | + 1073741824 16777216 float none -1 5536.1 193.95 181.83 N/A 5528.8 194.21 182.07 N/A |
| 163 | + 2147483648 33554432 float none -1 10853 197.88 185.51 N/A 10830 198.29 185.90 N/A |
| 164 | + 4294967296 67108864 float none -1 21491 199.85 187.36 N/A 21466 200.09 187.58 N/A |
| 165 | + 8589934592 134217728 float none -1 42770 200.84 188.29 N/A 42752 200.93 188.37 N/A |
| 166 | +Out of bounds values : 0 OK |
| 167 | +Avg bus bandwidth : 125.436 |
| 168 | +
|
| 169 | +Done |
| 170 | +``` |
| 171 | + |
| 172 | +The networking bandwidth should be close to the maximum bandwidth supported by GCP. |
| 173 | + |
| 174 | +</div> |
| 175 | + |
| 176 | +## Run NCCL workloads with GPUDirect-TCPXO support |
| 177 | + |
| 178 | +To take full advantage of GPUDirect-TCPXO in your workloads, you need properly setup the [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl). |
| 179 | +This can be done with the following commands in your run configuration: |
| 180 | + |
| 181 | +<div editor-title="task.dstack.yml"> |
| 182 | + |
| 183 | +```yaml |
| 184 | +type: task |
| 185 | +nodes: 2 |
| 186 | +commands: |
| 187 | + - | |
| 188 | + NCCL_LIB_DIR="/var/lib/tcpxo/lib64" |
| 189 | + source ${NCCL_LIB_DIR}/nccl-env-profile.sh |
| 190 | + export NCCL_FASTRAK_CTRL_DEV=enp0s12 |
| 191 | + export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0 |
| 192 | + export NCCL_SOCKET_IFNAME=enp0s12 |
| 193 | + export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices" |
| 194 | + export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}" |
| 195 | + # run NCCL |
| 196 | +resources: |
| 197 | + # Allocate some shared memory for NCCL |
| 198 | + shm_size: 16GB |
| 199 | +``` |
| 200 | + |
| 201 | +</div> |
| 202 | + |
| 203 | +## Source code |
| 204 | + |
| 205 | +The source code for this example can be found in |
| 206 | +[`examples/misc/a3mega-clusters` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/misc/a3mega-clusters). |
0 commit comments