|
| 1 | +# Distributed Training with NCCL2 and RDMA |
| 2 | + |
| 3 | +When doing distributed multi-GPU training, network bandwith often becomes the |
| 4 | +bottle neck. We introduce a way to use NCCL2 to do such training job to |
| 5 | +achieve best performace. |
| 6 | + |
| 7 | +## Prepare Hardwares with RDMA and Multiple GPUs |
| 8 | + |
| 9 | +I'm using two Linux servers each of them is installed with 8 GPUs and |
| 10 | +one 100Gb RDMA card. |
| 11 | +Base environment is: |
| 12 | + |
| 13 | +* OS: CentOS 7.4 |
| 14 | +* RDMA device: "Mellanox Technologies MT27700 Family [ConnectX-4]" |
| 15 | +* Kernel version: `4.4.88-1.el7.elrepo.x86_64` |
| 16 | +* Docker version: `1.12.6` |
| 17 | +* Docker storage driver: `overlay2` |
| 18 | +* IP addresses: 192.168.16.30,192.168.16.34 |
| 19 | + |
| 20 | +In general, the steps including: |
| 21 | + |
| 22 | +1. Install GPU drivers |
| 23 | +1. Install RDMA drivers |
| 24 | +1. Install "InfiniBand Support" |
| 25 | +1. Use docker to run tests and make sure GPUs and RDMA can work inside |
| 26 | + the container. |
| 27 | + |
| 28 | +I'll ommit section "Install GPU drivers" because we can find it easily |
| 29 | +somewhere else. |
| 30 | + |
| 31 | +### Install RDMA drivers |
| 32 | + |
| 33 | +For my case, I've got two machines with device |
| 34 | +"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was |
| 35 | +"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can |
| 36 | +work with latest overlay2 filesystem. |
| 37 | + |
| 38 | +***NOTE: before you start, make sure you have a way to get a console |
| 39 | +of the server other than ssh because we may need to re-configure the |
| 40 | +network device.*** |
| 41 | + |
| 42 | +1. Go to http://www.mellanox.com/page/products_dyn?product_family=26, |
| 43 | + download `MLNX_OFED` software in the bottom of the page, and upload it |
| 44 | + onto the server. |
| 45 | +1. Run `./mlnxofedinstall --add-kernel-support` in the software package. |
| 46 | +1. Run `/etc/init.d/openibd restart` to make everything work, note that |
| 47 | + this operation may cause the network goes down if you are using this |
| 48 | + RDMA device as default network device and use ssh to login the server. |
| 49 | +1. Re-configure the network interface, for example: |
| 50 | + `ifconfig eth2 192.168.16.30/20 up`, then add routes if needed: |
| 51 | + `ip route add default via 192.168.16.1 dev eth2`. |
| 52 | +1. Do the same thing on the other node. |
| 53 | +1. Use `ping` to test if the two nodes have typical ICMP connection. |
| 54 | +1. Use either `udaddy` or `ib_write_bw` to test the network connection is |
| 55 | + ready and have the desired bandwith. |
| 56 | + |
| 57 | +### Prepare Docker Image to Run RDMA Programs |
| 58 | + |
| 59 | +1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl |
| 60 | + package in it. |
| 61 | +1. Start a docker container and mount GPU driver libs into it (you can |
| 62 | + skip this step if you are using nvidia-docker). |
| 63 | +1. Mount RDMA dirvers and libs into the docker image (see below section), |
| 64 | + also `udaddy` and `ib_write_bw` if needed. |
| 65 | +1. Mount GPU devices and RDMA devices into the container using `--device` |
| 66 | + or just use privileged mode `--privileged`. |
| 67 | +1. Start the container using host network mode: `--net=host` |
| 68 | + |
| 69 | +### RDMA Library Files Needed |
| 70 | + |
| 71 | +Usually, `MLNX_OFED` install latest supported libs under |
| 72 | +`/usr/lib64/mlnx_ofed/valgrind`. Other libs also needed to run RDMA programs |
| 73 | +is listed below. These libs must be mounted into the docker container. |
| 74 | + |
| 75 | +* Libs under `/usr/lib64/mlnx_ofed/valgrind` |
| 76 | + * libibcm.so |
| 77 | + * libibverbs.so |
| 78 | + * libmlx4.so |
| 79 | + * libmlx5.so |
| 80 | + * libmlx5-rdmav2.so |
| 81 | + * librdmacm.so |
| 82 | +* Other libs: |
| 83 | + * libnl-3.so.200 |
| 84 | + * libnl-route-3.so.200 |
| 85 | + * libnuma.so.1 |
| 86 | + |
| 87 | +## Start to Run the Training Job |
| 88 | + |
| 89 | +Setting NCCL environment variables to turn NCCL switches on and off: |
| 90 | + |
| 91 | + |
| 92 | +| Env Name | Description | |
| 93 | +| --- | --- | |
| 94 | +| NCCL_SOCKET_IFNAME | The RDMA device, e.g. eth2 | |
| 95 | +| NCCL_P2P_DISABLE | Set to 1 to disable P2P transfer between GPUs | |
| 96 | +| NCCL_IB_DISABLE | Set to 1 to disable using RDMA | |
| 97 | +| NCCL_IB_CUDA_SUPPORT | Set to 1 to enable GPU Direct if supported | |
| 98 | +| NCCL_DEBUG | Set debug level: VERSION, WARN, INFO | |
| 99 | + |
| 100 | +My two servers are: `192.168.16.30,192.168.16.34`, On node 1, Run : |
| 101 | + |
| 102 | +```bash |
| 103 | +PADDLE_TRAINER_ID=0 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.30 stdbuf -oL python vgg16.py |
| 104 | +``` |
| 105 | + |
| 106 | +On node 2, Run: |
| 107 | + |
| 108 | +```bash |
| 109 | +PADDLE_TRAINER_ID=1 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.34 stdbuf -oL python vgg16.py |
| 110 | +``` |
0 commit comments