|
| 1 | +# Parallelism Methods |
| 2 | + |
| 3 | +This section explores the prevalent methods for implementing distributed |
| 4 | +training systems, discussing the design goals and a detailed examination |
| 5 | +of each parallelism approach. |
| 6 | + |
| 7 | +## Classification of Methods |
| 8 | + |
| 9 | +Distributed training amalgamates multiple single-node training systems |
| 10 | +into a parallel structure to expedite the training process without |
| 11 | +sacrificing model accuracy. A single-node training system, depicted in |
| 12 | +Figure :numref:`ch010/ch10-single-node`, processes training datasets |
| 13 | +split into small batches, termed as mini-batches. Here, a mini-batch of |
| 14 | +*data* is input into the model, guided by a training *program*, which |
| 15 | +generates gradients to enhance model accuracy. Typically, this program |
| 16 | +executes a deep neural network. To illustrate the execution of a neural |
| 17 | +network, we employ a computational graph, comprising connected |
| 18 | +operators. Each operator executes a layer of the neural network, storing |
| 19 | +parameters to be updated during the training. |
| 20 | + |
| 21 | + |
| 22 | +:label:`ch010/ch10-single-node` |
| 23 | + |
| 24 | +The execution of a computational graph involves two phases: *forward* |
| 25 | +and *backward* computation. In the forward phase, data is fed into the |
| 26 | +initial operator, which calculates and generates the data required by |
| 27 | +the downstream operator. This process is continued sequentially through |
| 28 | +all operators until the last one concludes its computation. The backward |
| 29 | +phase initiates from the last operator, computing gradients and updating |
| 30 | +local parameters accordingly. The process culminates at the first |
| 31 | +operator. Upon completion of these two phases for a given mini-batch, |
| 32 | +the system loads the next mini-batch to update the model. |
| 33 | + |
| 34 | +Considering a model training job, partitioning the *data* and *program* |
| 35 | +can facilitate parallel acceleration. Table |
| 36 | +`ch010/ch10-parallel-methods` compiles various partition |
| 37 | +methods. Single-node training systems enable a \"single program, single |
| 38 | +data\" paradigm. For parallel computing across multiple devices, data is |
| 39 | +partitioned and the program is replicated for simultaneous execution, |
| 40 | +creating a \"single program, multiple data\" or *data parallelism* |
| 41 | +paradigm. Another approach involves partitioning the program, |
| 42 | +distributing its operators across devices---termed as \"multiple |
| 43 | +programs, single data\" or *model parallelism*. When training |
| 44 | +exceptionally large AI models, both data and program are partitioned to |
| 45 | +optimize the degree of parallelism (DOP), yielding a \"multiple program, |
| 46 | +multiple data\" or *hybrid parallelism* paradigm. |
| 47 | + |
| 48 | +:Parallelism methods |
| 49 | +| Classification | Single Data | Multiple Data | |
| 50 | +|------------------|-----------------------|-------------------- | |
| 51 | +| Single program | single-node execution | data parallelism | |
| 52 | +| Multiple program | model parallelism | hybrid parallelism | |
| 53 | +:label:`ch010/ch10-parallel-methods` |
| 54 | + |
| 55 | + |
| 56 | +## Data Parallelism |
| 57 | + |
| 58 | +Data parallelism is used when a single node cannot provide sufficient |
| 59 | +computing power. This is the most common parallelism approach adopted by |
| 60 | +AI frameworks. Specific implementations include TensorFlow |
| 61 | +DistributedStrategy, PyTorch Distributed, and Horovod |
| 62 | +DistributedOptimizer. Given a data-parallel system, assume that the |
| 63 | +training batch size is $N$, and that there are $M$ devices available for |
| 64 | +parallel acceleration. To achieve data parallelism, the batch size is |
| 65 | +partitioned into $M$ partitions, with each device getting $N/M$ training |
| 66 | +samples. Sharing a replica of the training program, each device executes |
| 67 | +and calculates a gradient separately over its own data partition. Each |
| 68 | +device (indexed $i$) calculates a gradient $G_i$ based on local training |
| 69 | +samples. To ensure that training program parameters are coherent, local |
| 70 | +gradients $G_i$ on different devices are aggregated to calculate an |
| 71 | +average gradient $(\sum_{i=1}^{N} G_i) / N$. To complete the training on |
| 72 | +this mini-batch, the training program updates model parameters based on |
| 73 | +the average gradient. |
| 74 | + |
| 75 | +Figure :numref:`ch010/ch10-data-parallel` shows a data-parallel training |
| 76 | +system composed of two devices. For a batch size of 64, each device is |
| 77 | +assigned 32 training samples and shares the same neural network |
| 78 | +parameters (or program replicas). The local training samples are passed |
| 79 | +through the operators in the program replica in sequence for forward and |
| 80 | +backward computation. During backward computation, the program replicas |
| 81 | +generate local gradients. Corresponding local gradients on different |
| 82 | +devices (e.g., gradient 1 on device 1 and gradient 1 on device 2) are |
| 83 | +aggregated (typically by AllReduce, a collective communication |
| 84 | +operation) to calculate an average gradient. |
| 85 | + |
| 86 | + |
| 87 | +:label:`ch010/ch10-data-parallel` |
| 88 | + |
| 89 | +## Model Parallelism |
| 90 | + |
| 91 | +Model parallelism is useful when memory constraints make it impossible |
| 92 | +to train a model on a single device. For example, the memory on a single |
| 93 | +device will be insufficient for a model that contains a large operator |
| 94 | +(such as the compute-intensive fully connected layer for classification |
| 95 | +purpose). In such cases, we can partition this large operator for |
| 96 | +parallel execution. Assume that the operator has $P$ parameters and the |
| 97 | +system consists of $N$ devices. To minimize the workload on each device |
| 98 | +given the limited memory capacity, we can evenly assign the parameters |
| 99 | +across the devices ($P/N$ = number of parameters per device). This |
| 100 | +partitioning method is called **intra-operator parallelism**, which is a |
| 101 | +typical application of model parallelism. |
| 102 | + |
| 103 | +Figure :numref:`ch010/ch10-model-parallel-intra-op` shows an example of |
| 104 | +intra-operator parallelism implemented by two devices. The neural |
| 105 | +network in this example consists of two operators. To complete forward |
| 106 | +and backward computation, operator 1 and operator 2 require 16 GB and 1 |
| 107 | +GB of memory, respectively. However, in this example, the maximum amount |
| 108 | +of memory a single device can provide is only 10 GB. To train this |
| 109 | +network, parallelism is implemented on operator 1. Specifically, the |
| 110 | +parameters of operator 1 are evenly partitioned into two partitions |
| 111 | +between device 1 and device 2, meaning that device 1 runs program |
| 112 | +partition 1 while device 2 runs program partition 2. The network |
| 113 | +training process starts with feeding a mini-batch of training data to |
| 114 | +operator 1. Because the parameters of operator 1 are shared between two |
| 115 | +devices, the data is broadcast to the two devices. Each device completes |
| 116 | +forward computation based on the local partition of parameters. The |
| 117 | +local computation results on the devices are aggregated before being |
| 118 | +sent to downstream operator 2. In backward computation, the data of |
| 119 | +operator 2 is broadcast to device 1 and device 2, so that each device |
| 120 | +completes backward computation based on the local partition of |
| 121 | +operator 1. The local computation results on the devices are aggregated |
| 122 | +and returned to complete the backward computation process. |
| 123 | + |
| 124 | + |
| 125 | +:label:`ch010/ch10-model-parallel-intra-op` |
| 126 | + |
| 127 | +In some cases, the overall model --- rather than specific operators --- |
| 128 | +requires more memory than a single device can provide. Given $N$ |
| 129 | +operators and $M$ devices, we can evenly assign the operators across $M$ |
| 130 | +devices. As such, each device needs to run forward and backward |
| 131 | +computation of only $N/M$ operators, thereby reducing the memory |
| 132 | +overhead of each device. This application of model parallelism is called |
| 133 | +*inter-operator parallelism*. |
| 134 | + |
| 135 | +Figure :numref:`ch010/ch10-model-parallel-inter-op` shows an example of |
| 136 | +inter-operator parallelism implemented by two devices. The neural |
| 137 | +network in this example has two operators, each requiring 10 GB of |
| 138 | +memory for computation (20 GB in total). Because the maximum memory a |
| 139 | +single device can provide in this example is 10 GB, we can place |
| 140 | +operator 1 on device 1 and operator 2 on device 2. In forward |
| 141 | +computation, the output of operator 1 is sent to device 2, which uses |
| 142 | +this output as input to complete forward computation of operator 2. In |
| 143 | +backward computation, device 2 sends the backward computation result of |
| 144 | +operator 2 to device 1 for backward computation of operator 1, |
| 145 | +completing the training on a mini-batch. |
| 146 | + |
| 147 | + |
| 148 | +:label:`ch010/ch10-model-parallel-inter-op` |
| 149 | + |
| 150 | +## Hybrid Parallelism |
| 151 | + |
| 152 | +In training large AI models, the computing power and memory constraints |
| 153 | +often go hand in hand. The solution to overcoming these constraints is |
| 154 | +to adopt a hybrid of data parallelism and model parallelism, that is, |
| 155 | +hybrid parallelism. Figure |
| 156 | +:numref:`ch010/ch10-hybrid-parallel` shows an example of hybrid |
| 157 | +parallelism implemented by four devices. In this example, inter-operator |
| 158 | +parallelism is adopted to reduce memory overheads by allocating operator |
| 159 | +1 to device 1 and operator 2 to device 2. Device 3 and device 4 are |
| 160 | +added to the system to achieve data parallelism, thereby improving the |
| 161 | +computing power of the system. Specifically, the training data is |
| 162 | +partitioned to data partitions 1 and 2, and the model (consisting of |
| 163 | +operators 1 and 2) is replicated on devices 3 and 4 respectively. This |
| 164 | +makes it possible for the program replicas to run in parallel. During |
| 165 | +forward computation, devices 1 and 3 run the replicas of operator 1 |
| 166 | +simultaneously and send their respective computation results to devices |
| 167 | +2 and 4 to compute the replicas of operator 2. During backward |
| 168 | +computation, devices 2 and 4 compute gradients simultaneously, and the |
| 169 | +local gradients are averaged by using the AllReduce operation. The |
| 170 | +averaged gradient is back-propagated to the replicas of operator 1 on |
| 171 | +devices 1 and 3, and the backward computation process ends. |
| 172 | + |
| 173 | + |
| 174 | +:label:`ch010/ch10-hybrid-parallel` |
0 commit comments