|
| 1 | +# Distributed Training with NCCL2 |
| 2 | + |
| 3 | +We design a pattern that can enable training with `ParallelExecutor` and |
| 4 | +using [NCCL2](https://developer.nvidia.com/nccl) as it's collective |
| 5 | +communication library. |
| 6 | + |
| 7 | +In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast` |
| 8 | +to do multi GPU training. And if we initialize NCCL2 communicators as |
| 9 | +ranks in a distributed environment, we can simply run the `ParallelExecutor` |
| 10 | +as a distributed program! The only thing that may be different than in |
| 11 | +the single node version is that we need to broadcast the NCCL unique ID |
| 12 | +to all the nodes, and initialize communicators using that ID, so NCCL2 |
| 13 | +will know each other as ranks. |
| 14 | + |
| 15 | +To achieve this feature, we introduce a new operator: `gen_nccl_id` op, |
| 16 | +so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in |
| 17 | +what ever platform you like. |
| 18 | + |
| 19 | +It have two running modes: |
| 20 | + |
| 21 | +1. Generate and broadcast mode, which should be used on trainer 0; |
| 22 | +1. Listen and fetch mode, which should be used on trainers other than 0. |
| 23 | + |
| 24 | +In both two modes, this op can save the NCCL ID into current scope as a |
| 25 | +persistable variable, Then we can insert this op at the end of |
| 26 | +"startup program" of fluid, so that all workers can get the same ID to |
| 27 | +initialize NCCL communicator objects. |
| 28 | + |
| 29 | +<img src="src/ncc2_design.png"> |
| 30 | + |
| 31 | +The above figure indicates the general process when training with NCCL2 |
| 32 | +distributed. Each trainer have the number of communicators equal to the |
| 33 | +number of GPUs, but the ranks should match the global ranks number: here |
| 34 | +we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should |
| 35 | +be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1. |
0 commit comments