Skip to content

Commit 8a25f09

Browse files
committed
Add example on A3 Mega GCP clusters
1 parent 27864d9 commit 8a25f09

File tree

4 files changed

+262
-0
lines changed

4 files changed

+262
-0
lines changed

docs/examples/misc/a3mega-clusters/index.md

Whitespace-only changes.
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# A3 Mega GCP clusters
2+
3+
This example shows how to set up an A3 Mega GCP cluster with [GPUDirect-TCPXO](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot)
4+
optimized NCCL communication and run [NCCL Tests](https://github.com/NVIDIA/nccl-tests) on it using `dstack`.
5+
6+
## Overview
7+
8+
GCP's A3 Mega instances are 8xH100 VMs that have 1800Gbps maximum network bandwidth,
9+
which is the best among H100 GCP instances. To get that network performance, you need
10+
to set up GPUDirect-TCPXO – the GCP technology for GPU RDMA over TCP. This involves:
11+
12+
* Setting up eight extra data NICs on every node, each NIC in a separate VPC.
13+
* Building a VM image with the GPUDirect-TCPXO support.
14+
* Launching an RXDM service container.
15+
* Installing the GPUDirect-TCPXO NCCL plugin.
16+
17+
`dstack` hides most of the setup complexity and provides optimized A3 Mega GCP clusters out-of-the-box.
18+
19+
## Configure GCP backend
20+
21+
First configure the `gcp` backend for the GPUDirect-TCPXO support.
22+
You need to specify eight `extra_vpcs` to use for data NICs.
23+
You also need to specify `vm_service_account` that's authorized to pull GPUDirect-related Docker images:
24+
25+
<div editor-title="~/.dstack/server/config.yml">
26+
27+
```yaml
28+
projects:
29+
- name: main
30+
backends:
31+
- type: gcp
32+
project_id: $MYPROJECT # Replace $MYPROJECT
33+
extra_vpcs:
34+
- dstack-gpu-data-net-1
35+
- dstack-gpu-data-net-2
36+
- dstack-gpu-data-net-3
37+
- dstack-gpu-data-net-4
38+
- dstack-gpu-data-net-5
39+
- dstack-gpu-data-net-6
40+
- dstack-gpu-data-net-7
41+
- dstack-gpu-data-net-8
42+
regions: [europe-west4]
43+
vm_service_account: a3mega-sa@$MYPROJECT.iam.gserviceaccount.com # Replace $MYPROJECT
44+
creds:
45+
type: default
46+
```
47+
48+
</div>
49+
50+
??? info "Create extra VPCs"
51+
Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule. Choose the GPUDirect-TCPX tab for A3 High machine types, or choose the GPUDirect-TCPXO tab for A3 Mega machine types, then complete the following instructions:
52+
53+
```shell
54+
# Specify the region where you intend to deploy the cluster
55+
REGION="europe-west4"
56+
57+
for N in $(seq 1 8); do
58+
gcloud compute networks create dstack-gpu-data-net-$N \
59+
--subnet-mode=custom \
60+
--mtu=8244
61+
62+
gcloud compute networks subnets create dstack-gpu-data-sub-$N \
63+
--network=dstack-gpu-data-net-$N \
64+
--region=$REGION \
65+
--range=192.168.$N.0/24
66+
67+
gcloud compute firewall-rules create dstack-gpu-data-internal-$N \
68+
--network=dstack-gpu-data-net-$N \
69+
--action=ALLOW \
70+
--rules=tcp:0-65535,udp:0-65535,icmp \
71+
--source-ranges=192.168.0.0/16
72+
done
73+
```
74+
75+
??? info "Create Service Account"
76+
Create a VM service account that allows VMs to access the `pkg.dev` registry:
77+
78+
```shell
79+
PROJECT_ID=$(gcloud config get-value project)
80+
81+
gcloud iam service-accounts create a3mega-sa \
82+
--display-name "Service Account for pulling GCR images"
83+
84+
gcloud projects add-iam-policy-binding $PROJECT_ID \
85+
--member="serviceAccount:a3mega-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
86+
--role="roles/artifactregistry.reader"
87+
```
88+
89+
## Create A3 Mega fleet
90+
91+
Once you've configured the `gcp` backend, create the fleet configuration:
92+
93+
<div editor-title="fleet.dstack.yml">
94+
```yaml
95+
type: fleet
96+
name: a3mega-cluster
97+
nodes: 2
98+
placement: cluster
99+
instance_types:
100+
- a3-megagpu-8g
101+
spot_policy: auto
102+
```
103+
</div>
104+
105+
and apply the configuration:
106+
107+
<div class="termy">
108+
109+
```shell
110+
$ dstack apply -f examples/misc/a3mega-clusters/fleet.dstack.yml
111+
Project main
112+
User admin
113+
Configuration examples/misc/a3mega-clusters/fleet.dstack.yml
114+
Type fleet
115+
Fleet type cloud
116+
Nodes 2
117+
Placement cluster
118+
Resources 2..xCPU, 8GB.., 100GB.. (disk)
119+
Spot policy auto
120+
121+
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
122+
1 gcp europe-west4 a3-megagpu-8g 208xCPU, 1872GB, yes $22.1525
123+
8xH100 (80GB),
124+
100.0GB (disk)
125+
2 gcp europe-west4 a3-megagpu-8g 208xCPU, 1872GB, no $64.2718
126+
8xH100 (80GB),
127+
100.0GB (disk)
128+
129+
Fleet a3mega-cluster does not exist yet.
130+
Create the fleet? [y/n]: y
131+
132+
Provisioning...
133+
---> 100%
134+
```
135+
136+
</div>
137+
138+
`dstack` will provision two A3 Mega nodes with GPUDirect-TCPXO configured.
139+
140+
## Run NCCL Tests with GPUDirect-TCPXO support
141+
142+
Once the nodes are provisioned, let's test the network by running NCCL Tests:
143+
144+
<div class="termy">
145+
146+
```shell
147+
$ dstack apply -f examples/misc/a3mega-clusters/nccl-tests.dstack.yml
148+
149+
nccl-tests provisioning completed (running)
150+
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0
151+
152+
out-of-place in-place
153+
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
154+
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
155+
8388608 131072 float none -1 394.2 21.28 19.95 N/A 392.7 21.36 20.03 N/A
156+
16777216 262144 float none -1 437.8 38.32 35.92 N/A 434.1 38.65 36.24 N/A
157+
33554432 524288 float none -1 479.5 69.98 65.61 N/A 479.9 69.92 65.55 N/A
158+
67108864 1048576 float none -1 755.8 88.79 83.24 N/A 771.9 86.94 81.51 N/A
159+
134217728 2097152 float none -1 1125.3 119.27 111.81 N/A 1121.8 119.64 112.16 N/A
160+
268435456 4194304 float none -1 1741.3 154.16 144.53 N/A 1742.2 154.08 144.45 N/A
161+
536870912 8388608 float none -1 2854.9 188.05 176.30 N/A 2869.8 187.08 175.38 N/A
162+
1073741824 16777216 float none -1 5536.1 193.95 181.83 N/A 5528.8 194.21 182.07 N/A
163+
2147483648 33554432 float none -1 10853 197.88 185.51 N/A 10830 198.29 185.90 N/A
164+
4294967296 67108864 float none -1 21491 199.85 187.36 N/A 21466 200.09 187.58 N/A
165+
8589934592 134217728 float none -1 42770 200.84 188.29 N/A 42752 200.93 188.37 N/A
166+
Out of bounds values : 0 OK
167+
Avg bus bandwidth : 125.436
168+
169+
Done
170+
```
171+
172+
The networking bandwidth should be close to the maximum bandwidth supported by GCP.
173+
174+
</div>
175+
176+
## Run NCCL workloads with GPUDirect-TCPXO support
177+
178+
To take full advantage of GPUDirect-TCPXO in your workloads, you need properly setup the [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl).
179+
This can be done with the following commands in your run configuration:
180+
181+
<div editor-title="task.dstack.yml">
182+
183+
```yaml
184+
type: task
185+
nodes: 2
186+
commands:
187+
- |
188+
NCCL_LIB_DIR="/var/lib/tcpxo/lib64"
189+
source ${NCCL_LIB_DIR}/nccl-env-profile.sh
190+
export NCCL_FASTRAK_CTRL_DEV=enp0s12
191+
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
192+
export NCCL_SOCKET_IFNAME=enp0s12
193+
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices"
194+
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
195+
# run NCCL
196+
resources:
197+
# Allocate some shared memory for NCCL
198+
shm_size: 16GB
199+
```
200+
201+
</div>
202+
203+
## Source code
204+
205+
The source code for this example can be found in
206+
[`examples/misc/a3mega-clusters` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/misc/a3mega-clusters).
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
type: fleet
2+
name: a3mega-cluster
3+
nodes: 2
4+
placement: cluster
5+
instance_types:
6+
- a3-megagpu-8g
7+
spot_policy: auto
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
type: task
2+
name: nccl-tests
3+
nodes: 2
4+
image: nvcr.io/nvidia/pytorch:24.04-py3
5+
entrypoint: "bash -c" # Need to use bash instead of default dash for nccl-env-profile.sh
6+
commands:
7+
- |
8+
# Setup TCPXO NCCL env variables
9+
NCCL_LIB_DIR="/var/lib/tcpxo/lib64"
10+
source ${NCCL_LIB_DIR}/nccl-env-profile.sh
11+
export NCCL_FASTRAK_CTRL_DEV=enp0s12
12+
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
13+
export NCCL_SOCKET_IFNAME=enp0s12
14+
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices"
15+
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
16+
# Build NCCL Tests
17+
git clone https://github.com/NVIDIA/nccl-tests.git
18+
cd nccl-tests
19+
MPI=1 CC=mpicc CXX=mpicxx make -j
20+
cd build
21+
# We use FIFO for inter-node communication
22+
FIFO=/tmp/dstack_job
23+
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
24+
sleep 10
25+
echo "${DSTACK_NODES_IPS}" > hostfile
26+
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
27+
# Wait for other nodes
28+
while true; do
29+
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
30+
break
31+
fi
32+
echo 'Waiting for nodes...'
33+
sleep 5
34+
done
35+
# Run NCCL Tests
36+
${MPIRUN} \
37+
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
38+
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 \
39+
$(env | awk -F= '{print "-x", $1}' | xargs) \
40+
./all_gather_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200 -c 0;
41+
# Notify nodes the job is done
42+
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
43+
else
44+
mkfifo ${FIFO}
45+
# Wait for a message from the first node
46+
cat ${FIFO}
47+
fi
48+
resources:
49+
shm_size: 16GB

0 commit comments

Comments
 (0)