1
1
# Distributed Training with NCCL2 and RDMA
2
2
3
- When doing distributed multi-GPU training, network bandwith often becomes the
4
- bottle neck . We introduce a way to use NCCL2 to do such training job to
5
- achieve best performace .
3
+ When doing distributed multi-GPU training, network bandwidth often becomes the
4
+ bottleneck . We introduce a way to use NCCL2 to do such training job to
5
+ achieve best performance .
6
6
7
- ## Prepare Hardwares with RDMA and Multiple GPUs
7
+ ## Prepare Hardware with RDMA and Multiple GPUs
8
8
9
- I'm using two Linux servers each of them is installed with 8 GPUs and
9
+ I'm using two Linux servers each of them installed with 8 GPUs and
10
10
one 100Gb RDMA card.
11
11
Base environment is:
12
12
@@ -25,15 +25,15 @@ In general, the steps including:
25
25
1 . Use docker to run tests and make sure GPUs and RDMA can work inside
26
26
the container.
27
27
28
- I'll ommit section "Install GPU drivers" because we can find it easily
28
+ I'll omit the section "Install GPU drivers" because we can find it easily
29
29
somewhere else.
30
30
31
31
### Install RDMA drivers
32
32
33
33
For my case, I've got two machines with device
34
34
"Mellanox Technologies MT27700 Family [ ConnectX-4] " installed. The OS was
35
35
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
36
- work with latest overlay2 filesystem.
36
+ work with the latest overlay2 filesystem.
37
37
38
38
*** NOTE: before you start, make sure you have a way to get a console
39
39
of the server other than ssh because we may need to re-configure the
@@ -45,22 +45,22 @@ network device.***
45
45
1 . Run ` ./mlnxofedinstall --add-kernel-support ` in the software package.
46
46
1 . Run ` /etc/init.d/openibd restart ` to make everything work, note that
47
47
this operation may cause the network goes down if you are using this
48
- RDMA device as default network device and use ssh to login the server.
48
+ RDMA device as default network device and use ssh to log in the server.
49
49
1 . Re-configure the network interface, for example:
50
50
` ifconfig eth2 192.168.16.30/20 up ` , then add routes if needed:
51
51
` ip route add default via 192.168.16.1 dev eth2 ` .
52
52
1 . Do the same thing on the other node.
53
53
1 . Use ` ping ` to test if the two nodes have typical ICMP connection.
54
54
1 . Use either ` udaddy ` or ` ib_write_bw ` to test the network connection is
55
- ready and have the desired bandwith .
55
+ ready and have the desired bandwidth .
56
56
57
57
### Prepare Docker Image to Run RDMA Programs
58
58
59
59
1 . Build a docker image using cuda base image like: ` nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04 ` and install paddlepaddle whl
60
60
package in it.
61
61
1 . Start a docker container and mount GPU driver libs into it (you can
62
62
skip this step if you are using nvidia-docker).
63
- 1 . Mount RDMA dirvers and libs into the docker image (see below section),
63
+ 1 . Mount RDMA drivers and libs into the docker image (see below section),
64
64
also ` udaddy ` and ` ib_write_bw ` if needed.
65
65
1 . Mount GPU devices and RDMA devices into the container using ` --device `
66
66
or just use privileged mode ` --privileged ` .
0 commit comments