Skip to content

Commit 3136b78

Browse files
committed
Upload sections
1 parent 0d283b4 commit 3136b78

File tree

1 file changed

+174
-0
lines changed

1 file changed

+174
-0
lines changed
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Parallelism Methods
2+
3+
This section explores the prevalent methods for implementing distributed
4+
training systems, discussing the design goals and a detailed examination
5+
of each parallelism approach.
6+
7+
## Classification of Methods
8+
9+
Distributed training amalgamates multiple single-node training systems
10+
into a parallel structure to expedite the training process without
11+
sacrificing model accuracy. A single-node training system, depicted in
12+
Figure :numref:`ch010/ch10-single-node`, processes training datasets
13+
split into small batches, termed as mini-batches. Here, a mini-batch of
14+
*data* is input into the model, guided by a training *program*, which
15+
generates gradients to enhance model accuracy. Typically, this program
16+
executes a deep neural network. To illustrate the execution of a neural
17+
network, we employ a computational graph, comprising connected
18+
operators. Each operator executes a layer of the neural network, storing
19+
parameters to be updated during the training.
20+
21+
![Single-node trainingsystem](../img/ch10/ch10-single-node.png)
22+
:label:`ch010/ch10-single-node`
23+
24+
The execution of a computational graph involves two phases: *forward*
25+
and *backward* computation. In the forward phase, data is fed into the
26+
initial operator, which calculates and generates the data required by
27+
the downstream operator. This process is continued sequentially through
28+
all operators until the last one concludes its computation. The backward
29+
phase initiates from the last operator, computing gradients and updating
30+
local parameters accordingly. The process culminates at the first
31+
operator. Upon completion of these two phases for a given mini-batch,
32+
the system loads the next mini-batch to update the model.
33+
34+
Considering a model training job, partitioning the *data* and *program*
35+
can facilitate parallel acceleration. Table
36+
`ch010/ch10-parallel-methods` compiles various partition
37+
methods. Single-node training systems enable a \"single program, single
38+
data\" paradigm. For parallel computing across multiple devices, data is
39+
partitioned and the program is replicated for simultaneous execution,
40+
creating a \"single program, multiple data\" or *data parallelism*
41+
paradigm. Another approach involves partitioning the program,
42+
distributing its operators across devices---termed as \"multiple
43+
programs, single data\" or *model parallelism*. When training
44+
exceptionally large AI models, both data and program are partitioned to
45+
optimize the degree of parallelism (DOP), yielding a \"multiple program,
46+
multiple data\" or *hybrid parallelism* paradigm.
47+
48+
:Parallelism methods
49+
| Classification | Single Data | Multiple Data |
50+
|------------------|-----------------------|-------------------- |
51+
| Single program | single-node execution | data parallelism |
52+
| Multiple program | model parallelism | hybrid parallelism |
53+
:label:`ch010/ch10-parallel-methods`
54+
55+
56+
## Data Parallelism
57+
58+
Data parallelism is used when a single node cannot provide sufficient
59+
computing power. This is the most common parallelism approach adopted by
60+
AI frameworks. Specific implementations include TensorFlow
61+
DistributedStrategy, PyTorch Distributed, and Horovod
62+
DistributedOptimizer. Given a data-parallel system, assume that the
63+
training batch size is $N$, and that there are $M$ devices available for
64+
parallel acceleration. To achieve data parallelism, the batch size is
65+
partitioned into $M$ partitions, with each device getting $N/M$ training
66+
samples. Sharing a replica of the training program, each device executes
67+
and calculates a gradient separately over its own data partition. Each
68+
device (indexed $i$) calculates a gradient $G_i$ based on local training
69+
samples. To ensure that training program parameters are coherent, local
70+
gradients $G_i$ on different devices are aggregated to calculate an
71+
average gradient $(\sum_{i=1}^{N} G_i) / N$. To complete the training on
72+
this mini-batch, the training program updates model parameters based on
73+
the average gradient.
74+
75+
Figure :numref:`ch010/ch10-data-parallel` shows a data-parallel training
76+
system composed of two devices. For a batch size of 64, each device is
77+
assigned 32 training samples and shares the same neural network
78+
parameters (or program replicas). The local training samples are passed
79+
through the operators in the program replica in sequence for forward and
80+
backward computation. During backward computation, the program replicas
81+
generate local gradients. Corresponding local gradients on different
82+
devices (e.g., gradient 1 on device 1 and gradient 1 on device 2) are
83+
aggregated (typically by AllReduce, a collective communication
84+
operation) to calculate an average gradient.
85+
86+
![Data-parallelsystem](../img/ch10/ch10-data-parallel.png)
87+
:label:`ch010/ch10-data-parallel`
88+
89+
## Model Parallelism
90+
91+
Model parallelism is useful when memory constraints make it impossible
92+
to train a model on a single device. For example, the memory on a single
93+
device will be insufficient for a model that contains a large operator
94+
(such as the compute-intensive fully connected layer for classification
95+
purpose). In such cases, we can partition this large operator for
96+
parallel execution. Assume that the operator has $P$ parameters and the
97+
system consists of $N$ devices. To minimize the workload on each device
98+
given the limited memory capacity, we can evenly assign the parameters
99+
across the devices ($P/N$ = number of parameters per device). This
100+
partitioning method is called **intra-operator parallelism**, which is a
101+
typical application of model parallelism.
102+
103+
Figure :numref:`ch010/ch10-model-parallel-intra-op` shows an example of
104+
intra-operator parallelism implemented by two devices. The neural
105+
network in this example consists of two operators. To complete forward
106+
and backward computation, operator 1 and operator 2 require 16 GB and 1
107+
GB of memory, respectively. However, in this example, the maximum amount
108+
of memory a single device can provide is only 10 GB. To train this
109+
network, parallelism is implemented on operator 1. Specifically, the
110+
parameters of operator 1 are evenly partitioned into two partitions
111+
between device 1 and device 2, meaning that device 1 runs program
112+
partition 1 while device 2 runs program partition 2. The network
113+
training process starts with feeding a mini-batch of training data to
114+
operator 1. Because the parameters of operator 1 are shared between two
115+
devices, the data is broadcast to the two devices. Each device completes
116+
forward computation based on the local partition of parameters. The
117+
local computation results on the devices are aggregated before being
118+
sent to downstream operator 2. In backward computation, the data of
119+
operator 2 is broadcast to device 1 and device 2, so that each device
120+
completes backward computation based on the local partition of
121+
operator 1. The local computation results on the devices are aggregated
122+
and returned to complete the backward computation process.
123+
124+
![Model-parallel system: intra-operatorparallelism](../img/ch10/ch10-model-parallel-intra-op.png)
125+
:label:`ch010/ch10-model-parallel-intra-op`
126+
127+
In some cases, the overall model --- rather than specific operators ---
128+
requires more memory than a single device can provide. Given $N$
129+
operators and $M$ devices, we can evenly assign the operators across $M$
130+
devices. As such, each device needs to run forward and backward
131+
computation of only $N/M$ operators, thereby reducing the memory
132+
overhead of each device. This application of model parallelism is called
133+
*inter-operator parallelism*.
134+
135+
Figure :numref:`ch010/ch10-model-parallel-inter-op` shows an example of
136+
inter-operator parallelism implemented by two devices. The neural
137+
network in this example has two operators, each requiring 10 GB of
138+
memory for computation (20 GB in total). Because the maximum memory a
139+
single device can provide in this example is 10 GB, we can place
140+
operator 1 on device 1 and operator 2 on device 2. In forward
141+
computation, the output of operator 1 is sent to device 2, which uses
142+
this output as input to complete forward computation of operator 2. In
143+
backward computation, device 2 sends the backward computation result of
144+
operator 2 to device 1 for backward computation of operator 1,
145+
completing the training on a mini-batch.
146+
147+
![Model-parallel system: inter-operatorparallelism](../img/ch10/ch10-model-parallel-inter-op.png)
148+
:label:`ch010/ch10-model-parallel-inter-op`
149+
150+
## Hybrid Parallelism
151+
152+
In training large AI models, the computing power and memory constraints
153+
often go hand in hand. The solution to overcoming these constraints is
154+
to adopt a hybrid of data parallelism and model parallelism, that is,
155+
hybrid parallelism. Figure
156+
:numref:`ch010/ch10-hybrid-parallel` shows an example of hybrid
157+
parallelism implemented by four devices. In this example, inter-operator
158+
parallelism is adopted to reduce memory overheads by allocating operator
159+
1 to device 1 and operator 2 to device 2. Device 3 and device 4 are
160+
added to the system to achieve data parallelism, thereby improving the
161+
computing power of the system. Specifically, the training data is
162+
partitioned to data partitions 1 and 2, and the model (consisting of
163+
operators 1 and 2) is replicated on devices 3 and 4 respectively. This
164+
makes it possible for the program replicas to run in parallel. During
165+
forward computation, devices 1 and 3 run the replicas of operator 1
166+
simultaneously and send their respective computation results to devices
167+
2 and 4 to compute the replicas of operator 2. During backward
168+
computation, devices 2 and 4 compute gradients simultaneously, and the
169+
local gradients are averaged by using the AllReduce operation. The
170+
averaged gradient is back-propagated to the replicas of operator 1 on
171+
devices 1 and 3, and the backward computation process ends.
172+
173+
![Hybrid-parallelsystem](../img/ch10/ch10-hybrid-parallel.png)
174+
:label:`ch010/ch10-hybrid-parallel`

0 commit comments

Comments
 (0)