Skip to content

Commit 535245c

Browse files
authored
Add NCCL2 rdma train doc (#10561)
* add rdma train doc * update by comment * fix table
1 parent 705e734 commit 535245c

File tree

1 file changed

+110
-0
lines changed

1 file changed

+110
-0
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Distributed Training with NCCL2 and RDMA
2+
3+
When doing distributed multi-GPU training, network bandwith often becomes the
4+
bottle neck. We introduce a way to use NCCL2 to do such training job to
5+
achieve best performace.
6+
7+
## Prepare Hardwares with RDMA and Multiple GPUs
8+
9+
I'm using two Linux servers each of them is installed with 8 GPUs and
10+
one 100Gb RDMA card.
11+
Base environment is:
12+
13+
* OS: CentOS 7.4
14+
* RDMA device: "Mellanox Technologies MT27700 Family [ConnectX-4]"
15+
* Kernel version: `4.4.88-1.el7.elrepo.x86_64`
16+
* Docker version: `1.12.6`
17+
* Docker storage driver: `overlay2`
18+
* IP addresses: 192.168.16.30,192.168.16.34
19+
20+
In general, the steps including:
21+
22+
1. Install GPU drivers
23+
1. Install RDMA drivers
24+
1. Install "InfiniBand Support"
25+
1. Use docker to run tests and make sure GPUs and RDMA can work inside
26+
the container.
27+
28+
I'll ommit section "Install GPU drivers" because we can find it easily
29+
somewhere else.
30+
31+
### Install RDMA drivers
32+
33+
For my case, I've got two machines with device
34+
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
35+
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
36+
work with latest overlay2 filesystem.
37+
38+
***NOTE: before you start, make sure you have a way to get a console
39+
of the server other than ssh because we may need to re-configure the
40+
network device.***
41+
42+
1. Go to http://www.mellanox.com/page/products_dyn?product_family=26,
43+
download `MLNX_OFED` software in the bottom of the page, and upload it
44+
onto the server.
45+
1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
46+
1. Run `/etc/init.d/openibd restart` to make everything work, note that
47+
this operation may cause the network goes down if you are using this
48+
RDMA device as default network device and use ssh to login the server.
49+
1. Re-configure the network interface, for example:
50+
`ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
51+
`ip route add default via 192.168.16.1 dev eth2`.
52+
1. Do the same thing on the other node.
53+
1. Use `ping` to test if the two nodes have typical ICMP connection.
54+
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
55+
ready and have the desired bandwith.
56+
57+
### Prepare Docker Image to Run RDMA Programs
58+
59+
1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl
60+
package in it.
61+
1. Start a docker container and mount GPU driver libs into it (you can
62+
skip this step if you are using nvidia-docker).
63+
1. Mount RDMA dirvers and libs into the docker image (see below section),
64+
also `udaddy` and `ib_write_bw` if needed.
65+
1. Mount GPU devices and RDMA devices into the container using `--device`
66+
or just use privileged mode `--privileged`.
67+
1. Start the container using host network mode: `--net=host`
68+
69+
### RDMA Library Files Needed
70+
71+
Usually, `MLNX_OFED` install latest supported libs under
72+
`/usr/lib64/mlnx_ofed/valgrind`. Other libs also needed to run RDMA programs
73+
is listed below. These libs must be mounted into the docker container.
74+
75+
* Libs under `/usr/lib64/mlnx_ofed/valgrind`
76+
* libibcm.so
77+
* libibverbs.so
78+
* libmlx4.so
79+
* libmlx5.so
80+
* libmlx5-rdmav2.so
81+
* librdmacm.so
82+
* Other libs:
83+
* libnl-3.so.200
84+
* libnl-route-3.so.200
85+
* libnuma.so.1
86+
87+
## Start to Run the Training Job
88+
89+
Setting NCCL environment variables to turn NCCL switches on and off:
90+
91+
92+
| Env Name | Description |
93+
| --- | --- |
94+
| NCCL_SOCKET_IFNAME | The RDMA device, e.g. eth2 |
95+
| NCCL_P2P_DISABLE | Set to 1 to disable P2P transfer between GPUs |
96+
| NCCL_IB_DISABLE | Set to 1 to disable using RDMA |
97+
| NCCL_IB_CUDA_SUPPORT | Set to 1 to enable GPU Direct if supported |
98+
| NCCL_DEBUG | Set debug level: VERSION, WARN, INFO |
99+
100+
My two servers are: `192.168.16.30,192.168.16.34`, On node 1, Run :
101+
102+
```bash
103+
PADDLE_TRAINER_ID=0 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.30 stdbuf -oL python vgg16.py
104+
```
105+
106+
On node 2, Run:
107+
108+
```bash
109+
PADDLE_TRAINER_ID=1 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.34 stdbuf -oL python vgg16.py
110+
```

0 commit comments

Comments
 (0)