Skip to content

Commit a0fefc2

Browse files
authored
Add NCCL2 dist train design doc (#11885)
* add_nccl2_dist_design * update * update by comments
1 parent 78790ed commit a0fefc2

File tree

3 files changed

+35
-0
lines changed

3 files changed

+35
-0
lines changed
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Distributed Training with NCCL2
2+
3+
We design a pattern that can enable training with `ParallelExecutor` and
4+
using [NCCL2](https://developer.nvidia.com/nccl) as it's collective
5+
communication library.
6+
7+
In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast`
8+
to do multi GPU training. And if we initialize NCCL2 communicators as
9+
ranks in a distributed environment, we can simply run the `ParallelExecutor`
10+
as a distributed program! The only thing that may be different than in
11+
the single node version is that we need to broadcast the NCCL unique ID
12+
to all the nodes, and initialize communicators using that ID, so NCCL2
13+
will know each other as ranks.
14+
15+
To achieve this feature, we introduce a new operator: `gen_nccl_id` op,
16+
so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in
17+
what ever platform you like.
18+
19+
It have two running modes:
20+
21+
1. Generate and broadcast mode, which should be used on trainer 0;
22+
1. Listen and fetch mode, which should be used on trainers other than 0.
23+
24+
In both two modes, this op can save the NCCL ID into current scope as a
25+
persistable variable, Then we can insert this op at the end of
26+
"startup program" of fluid, so that all workers can get the same ID to
27+
initialize NCCL communicator objects.
28+
29+
<img src="src/ncc2_design.png">
30+
31+
The above figure indicates the general process when training with NCCL2
32+
distributed. Each trainer have the number of communicators equal to the
33+
number of GPUs, but the ranks should match the global ranks number: here
34+
we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should
35+
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.
Binary file not shown.
91.6 KB
Loading

0 commit comments

Comments
 (0)