|
| 1 | +# NCCL/RCCL tests |
| 2 | + |
| 3 | +This example shows how to run [NCCL](https://github.com/NVIDIA/nccl-tests) or [RCCL](https://github.com/ROCm/rccl-tests) tests on a cluster using [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks). |
| 4 | + |
| 5 | +!!! info "Prerequisites" |
| 6 | + Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#backend-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)). |
| 7 | + |
| 8 | +## Running as a task |
| 9 | + |
| 10 | +Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total). |
| 11 | + |
| 12 | +=== "NCCL tests" |
| 13 | + |
| 14 | + <div editor-title="examples/clusters/nccl-rccl-tests/nccl-tests.dstack.yml"> |
| 15 | + |
| 16 | + ```yaml |
| 17 | + type: task |
| 18 | + name: nccl-tests |
| 19 | + |
| 20 | + nodes: 2 |
| 21 | + |
| 22 | + startup_order: workers-first |
| 23 | + stop_criteria: master-done |
| 24 | + |
| 25 | + env: |
| 26 | + - NCCL_DEBUG=INFO |
| 27 | + commands: |
| 28 | + - | |
| 29 | + if [ $DSTACK_NODE_RANK -eq 0 ]; then |
| 30 | + mpirun \ |
| 31 | + --allow-run-as-root \ |
| 32 | + --hostfile $DSTACK_MPI_HOSTFILE \ |
| 33 | + -n $DSTACK_GPUS_NUM \ |
| 34 | + -N $DSTACK_GPUS_PER_NODE \ |
| 35 | + --bind-to none \ |
| 36 | + /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 |
| 37 | + else |
| 38 | + sleep infinity |
| 39 | + fi |
| 40 | + |
| 41 | + # Uncomment if the `kubernetes` backend requires it for `/dev/infiniband` access |
| 42 | + #privileged: true |
| 43 | + |
| 44 | + resources: |
| 45 | + gpu: nvidia:1..8 |
| 46 | + shm_size: 16GB |
| 47 | + ``` |
| 48 | + |
| 49 | + </div> |
| 50 | + |
| 51 | + !!! info "Default image" |
| 52 | + If you don't specify `image`, `dstack` uses its [base](https://github.com/dstackai/dstack/tree/master/docker/base) Docker image pre-configured with |
| 53 | + `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`). |
| 54 | + |
| 55 | +=== "RCCL tests" |
| 56 | + |
| 57 | + <div editor-title="examples/clusters/nccl-rccl-tests/rccl-tests.dstack.yml"> |
| 58 | + |
| 59 | + ```yaml |
| 60 | + type: task |
| 61 | + name: rccl-tests |
| 62 | + |
| 63 | + nodes: 2 |
| 64 | + startup_order: workers-first |
| 65 | + stop_criteria: master-done |
| 66 | + |
| 67 | + # Mount the system libraries folder from the host |
| 68 | + volumes: |
| 69 | + - /usr/local/lib:/mnt/lib |
| 70 | + |
| 71 | + image: rocm/dev-ubuntu-22.04:6.4-complete |
| 72 | + env: |
| 73 | + - NCCL_DEBUG=INFO |
| 74 | + - OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi |
| 75 | + commands: |
| 76 | + # Setup MPI and build RCCL tests |
| 77 | + - apt-get install -y git libopenmpi-dev openmpi-bin |
| 78 | + - git clone https://github.com/ROCm/rccl-tests.git |
| 79 | + - cd rccl-tests |
| 80 | + - make MPI=1 MPI_HOME=$OPEN_MPI_HOME |
| 81 | + |
| 82 | + # Preload the RoCE driver library from the host (for Broadcom driver compatibility) |
| 83 | + - export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so |
| 84 | + |
| 85 | + # Run RCCL tests via MPI |
| 86 | + - | |
| 87 | + if [ $DSTACK_NODE_RANK -eq 0 ]; then |
| 88 | + mpirun --allow-run-as-root \ |
| 89 | + --hostfile $DSTACK_MPI_HOSTFILE \ |
| 90 | + -n $DSTACK_GPUS_NUM \ |
| 91 | + -N $DSTACK_GPUS_PER_NODE \ |
| 92 | + --mca btl_tcp_if_include ens41np0 \ |
| 93 | + -x LD_PRELOAD \ |
| 94 | + -x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \ |
| 95 | + -x NCCL_IB_GID_INDEX=3 \ |
| 96 | + -x NCCL_IB_DISABLE=0 \ |
| 97 | + ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0; |
| 98 | + else |
| 99 | + sleep infinity |
| 100 | + fi |
| 101 | + |
| 102 | + resources: |
| 103 | + gpu: MI300X:8 |
| 104 | + ``` |
| 105 | + |
| 106 | + </div> |
| 107 | + |
| 108 | + !!! info "RoCE library" |
| 109 | + Broadcom RoCE drivers require the `libbnxt_re` userspace library inside the container to be compatible with the host’s Broadcom |
| 110 | + kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it |
| 111 | + using `LD_PRELOAD` when running MPI. |
| 112 | + |
| 113 | + |
| 114 | +!!! info "Privileged" |
| 115 | + In some cases, the backend (e.g., `kubernetes`) may require `privileged: true` to access the high-speed interconnect (e.g., InfiniBand). |
| 116 | + |
| 117 | +### Apply a configuration |
| 118 | + |
| 119 | +To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply/) command. |
| 120 | + |
| 121 | +<div class="termy"> |
| 122 | + |
| 123 | +```shell |
| 124 | +$ dstack apply -f examples/clusters/nccl-rccl-tests/nccl-tests.dstack.yml |
| 125 | + |
| 126 | + # BACKEND REGION INSTANCE RESOURCES SPOT PRICE |
| 127 | + 1 aws us-east-1 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912 |
| 128 | + 2 aws us-west-2 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912 |
| 129 | + 3 aws us-east-2 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912 |
| 130 | + |
| 131 | +Submit the run nccl-tests? [y/n]: y |
| 132 | +``` |
| 133 | + |
| 134 | +</div> |
| 135 | + |
| 136 | +## Source code |
| 137 | + |
| 138 | +The source-code of this example can be found in |
| 139 | +[`examples/clusters/nccl-rccl-tests`](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-rccl-tests). |
| 140 | + |
| 141 | +## What's next? |
| 142 | + |
| 143 | +1. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), |
| 144 | + [services](https://dstack.ai/docsconcepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets). |
0 commit comments