Skip to content

Commit d82453f

Browse files
authored
fix typo (#12896)
1 parent 7570e5e commit d82453f

File tree

2 files changed

+16
-16
lines changed

2 files changed

+16
-16
lines changed
Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
11
# Distributed Training with NCCL2
22

33
We design a pattern that can enable training with `ParallelExecutor` and
4-
using [NCCL2](https://developer.nvidia.com/nccl) as it's collective
4+
use [NCCL2](https://developer.nvidia.com/nccl) as it's collective
55
communication library.
66

77
In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast`
88
to do multi GPU training. And if we initialize NCCL2 communicators as
99
ranks in a distributed environment, we can simply run the `ParallelExecutor`
1010
as a distributed program! The only thing that may be different than in
1111
the single node version is that we need to broadcast the NCCL unique ID
12-
to all the nodes, and initialize communicators using that ID, so NCCL2
13-
will know each other as ranks.
12+
to all the nodes and initialize communicators using that ID, so NCCL2
13+
can know each other as ranks.
1414

1515
To achieve this feature, we introduce a new operator: `gen_nccl_id` op,
1616
so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in
17-
what ever platform you like.
17+
whatever platform you like.
1818

19-
It have two running modes:
19+
It has two running modes:
2020

2121
1. Generate and broadcast mode, which should be used on trainer 0;
2222
1. Listen and fetch mode, which should be used on trainers other than 0.
@@ -29,7 +29,7 @@ initialize NCCL communicator objects.
2929
<img src="src/ncc2_design.png">
3030

3131
The above figure indicates the general process when training with NCCL2
32-
distributed. Each trainer have the number of communicators equal to the
32+
distributed. Each trainer has the number of communicators equal to the
3333
number of GPUs, but the ranks should match the global ranks number: here
3434
we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should
3535
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.

doc/fluid/howto/cluster/nccl2_rdma_training.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Distributed Training with NCCL2 and RDMA
22

3-
When doing distributed multi-GPU training, network bandwith often becomes the
4-
bottle neck. We introduce a way to use NCCL2 to do such training job to
5-
achieve best performace.
3+
When doing distributed multi-GPU training, network bandwidth often becomes the
4+
bottleneck. We introduce a way to use NCCL2 to do such training job to
5+
achieve best performance.
66

7-
## Prepare Hardwares with RDMA and Multiple GPUs
7+
## Prepare Hardware with RDMA and Multiple GPUs
88

9-
I'm using two Linux servers each of them is installed with 8 GPUs and
9+
I'm using two Linux servers each of them installed with 8 GPUs and
1010
one 100Gb RDMA card.
1111
Base environment is:
1212

@@ -25,15 +25,15 @@ In general, the steps including:
2525
1. Use docker to run tests and make sure GPUs and RDMA can work inside
2626
the container.
2727

28-
I'll ommit section "Install GPU drivers" because we can find it easily
28+
I'll omit the section "Install GPU drivers" because we can find it easily
2929
somewhere else.
3030

3131
### Install RDMA drivers
3232

3333
For my case, I've got two machines with device
3434
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
3535
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
36-
work with latest overlay2 filesystem.
36+
work with the latest overlay2 filesystem.
3737

3838
***NOTE: before you start, make sure you have a way to get a console
3939
of the server other than ssh because we may need to re-configure the
@@ -45,22 +45,22 @@ network device.***
4545
1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
4646
1. Run `/etc/init.d/openibd restart` to make everything work, note that
4747
this operation may cause the network goes down if you are using this
48-
RDMA device as default network device and use ssh to login the server.
48+
RDMA device as default network device and use ssh to log in the server.
4949
1. Re-configure the network interface, for example:
5050
`ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
5151
`ip route add default via 192.168.16.1 dev eth2`.
5252
1. Do the same thing on the other node.
5353
1. Use `ping` to test if the two nodes have typical ICMP connection.
5454
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
55-
ready and have the desired bandwith.
55+
ready and have the desired bandwidth.
5656

5757
### Prepare Docker Image to Run RDMA Programs
5858

5959
1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl
6060
package in it.
6161
1. Start a docker container and mount GPU driver libs into it (you can
6262
skip this step if you are using nvidia-docker).
63-
1. Mount RDMA dirvers and libs into the docker image (see below section),
63+
1. Mount RDMA drivers and libs into the docker image (see below section),
6464
also `udaddy` and `ib_write_bw` if needed.
6565
1. Mount GPU devices and RDMA devices into the container using `--device`
6666
or just use privileged mode `--privileged`.

0 commit comments

Comments
 (0)