Skip to content

Commit 4bad198

Browse files
committed
upload section
1 parent 8e27336 commit 4bad198

File tree

1 file changed

+261
-0
lines changed

1 file changed

+261
-0
lines changed
Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
# Memory Allocation
2+
3+
Memory allocation is a crucial aspect of conventional computer memory
4+
hierarchy, acting as a link between cache and disk storage. It provides
5+
more storage capacity than the cache and enables faster access compared
6+
to disk storage. With the progress of deep learning, accommodating large
7+
deep neural networks within the memory of hardware accelerators or AI
8+
processors has become increasingly challenging. To overcome this
9+
obstacle, various solutions have been developed, including memory reuse,
10+
contiguous memory allocation, and in-place memory allocation. Proper
11+
implementation of contiguous memory allocation and in-place memory
12+
allocation can enhance the execution efficiency of operators and further
13+
optimize performance.
14+
15+
## Device Memory
16+
17+
In a deep learning architecture, the memory closest to the hardware
18+
accelerator (such as the GPU or AI processor) is usually referred to as
19+
the device memory, and that closest to the CPU is referred to as the
20+
host memory. As shown in Figure
21+
:numref:`ch07/ch07-compiler-backend-memory-01`, the CPU can
22+
directly access the host memory but not the device memory. Similarly,
23+
the AI processor can directly access the device memory but not the host
24+
memory. In a typical network training process, data needs to be loaded
25+
from disk storage to the host memory, where it is then processed. After
26+
that, the data is copied from the host memory to the device memory, so
27+
that the device can directly access the data. When the computation is
28+
finished, the user can obtain the training result once the result data
29+
is copied from the device memory back to the host memory.
30+
31+
![Host memory and devicememory](../img/ch07/host-device-memory.png)
32+
:label:`ch07/ch07-compiler-backend-memory-01`
33+
34+
## Process of Memory Allocation
35+
36+
The memory allocation module allocates device memory to the input and
37+
output of each operator in a graph. The compiler frontend interprets the
38+
user script into an IR, based on which the compiler backend performs
39+
operator selection and optimization to determine information such as the
40+
shape, data type, and format of each input/output tensor of each
41+
operator. With this information, the size of each input/output tensor of
42+
each operator can be calculated using Equation
43+
:eqref:`ch05/equation-04`:
44+
45+
$$
46+
\text{size}=\prod_{i=0}^{\text{dimension }}\text{shape}_i \times \text{sizeof}\left ( \text{datatype} \right )
47+
$$
48+
:eqlabel:`equation:ch05/equation-04`
49+
50+
Unaligned memory access can be time-consuming, because the transfer of
51+
data to and from memory is most efficient in chunks of 4, 8, or 16
52+
bytes. When the size of the data to be transferred is not a multiple of
53+
any of these sizes, one or more empty bytes are padded to align the data
54+
in memory.
55+
56+
Figure
57+
:numref:`ch07/ch07-compiler-backend-memory-02` illustrates an
58+
example of memory allocation.
59+
60+
![Memory allocationexample](../img/ch07/memory_allocate.png)
61+
:label:`ch07/ch07-compiler-backend-memory-02`
62+
63+
In this example, memory addresses are assigned to the input tensor,
64+
Conv2D's weight, and Conv2D's output. Subsequently, a memory address is
65+
allocated to the input of BatchNorm. Since the input of BatchNorm is the
66+
same as the output of Conv2D, which already has a allocated memory
67+
address, the output address of Conv2D can be shared with the input of
68+
BatchNorm. This approach avoids redundant memory allocation and
69+
unnecessary memory copies. The entire training process in this example
70+
involves allocating memory for three types based on their data lifetime:
71+
the initial input of the graph, the weights or attributes of operators,
72+
and the output tensor of the final operator.
73+
74+
Frequent allocations and deallocations of memory blocks of various sizes
75+
using functions like `malloc` can significantly degrade performance. To
76+
mitigate this issue, memory pools can be employed. Memory pools involve
77+
pre-allocating a specific amount of memory, allowing memory blocks to be
78+
dynamically allocated from the pool as needed and returned for reuse.
79+
80+
Memory pools are widely utilized in AI frameworks to manage frequent
81+
allocations of device memory and ensure consistent memory lifetime for
82+
tensors. Different AI frameworks adopt similar memory pool designs.
83+
Figure
84+
:numref:`ch07/ch07-compiler-backend-memory-03` presents an
85+
example of memory allocation in an AI framework. In this case, each
86+
tensor's memory is allocated from a pre-allocated device memory space
87+
using double pointers to offset the start and end addresses. Weight
88+
tensors of operators are allocated memory by offsetting from the start
89+
address (with a lifetime lasting throughout the training process). The
90+
output tensor of each operator is allocated memory by offsetting from
91+
the end address (with a shorter lifetime that terminates when the tensor
92+
is no longer needed in the computation process). This approach allows
93+
operator memory to be allocated using offset pointers from pre-allocated
94+
device memory, significantly reducing the time required compared to
95+
direct memory allocations from the device.
96+
97+
![Memory allocation using double offsetpointers](../img/ch07/device_malloc.png)
98+
:label:`ch07/ch07-compiler-backend-memory-03`
99+
100+
## Memory Reuse
101+
102+
In a machine learning system, memory reuse is achieved by analyzing the
103+
lifespan of a tensor and, once it reaches the end of its lifespan,
104+
releasing its device memory back to the memory pool for future reuse by
105+
other tensors. The objective of memory reuse is to enhance memory
106+
utilization and enable the accommodation of larger models within the
107+
constraints of limited device memory. By reusing memory instead of
108+
continuously allocating new memory for tensors, the system can optimize
109+
memory utilization and mitigate the memory limitations inherent in deep
110+
learning computations.
111+
112+
Figure
113+
:numref:`ch07/ch07-compiler-backend-memory-02` provides an
114+
example, where output 1 becomes unused once the computation of the
115+
BatchNorm operator is complete. In this case, the device memory of
116+
output 1 can be reclaimed and reused for output 3 (if output 3 does not
117+
require a larger memory size than output 1).
118+
119+
Figure
120+
:numref:`ch07/ch07-compiler-backend-memory-04` depicts memory
121+
lifetime using coordinate charts. The horizontal axes represent the
122+
tensor lifetime, and the vertical axes represent the memory sizes.
123+
During its lifetime, a tensor occupies a specific amount of device
124+
memory. The objective of memory allocation is to find an optimal
125+
solution that accommodates the maximum number of non-conflicting
126+
rectangular blocks (each denoting a tensor's lifetime and memory size)
127+
in the same memory. In Figure
128+
:numref:`ch07/ch07-compiler-backend-memory-04`, the memory can
129+
accommodate only four rectangular blocks (i.e., tensors T0, T1, T2, and
130+
T3) when no memory reuse policy is applied, as shown in the left chart.
131+
132+
![Memory lifetimecharts](../img/ch07/combine_memory_resue_and_no_reuse_cn.png)
133+
:label:`ch07/ch07-compiler-backend-memory-04`
134+
135+
To determine an appropriate memory reuse policy, we face an NP-complete
136+
problem. AI frameworks often employ greedy algorithms, such as best-fit,
137+
which allocate memory by searching for the smallest available block in
138+
the memory pool one at a time. However, this approach only yields a
139+
locally optimal solution rather than a globally optimal one. To
140+
approximate a globally optimal solution, a method called Safe Optimized
141+
Memory Allocation Solver (SOMAS) can be considered.
142+
143+
SOMAS addresses the computational graph by conducting aggregative
144+
analysis on parallel streams and data dependencies. This analysis
145+
reveals the ancestor-descendant relationships between operators. By
146+
generating a global set of mutually exclusive constraints concerning the
147+
lifetime of each tensor, SOMAS combines multiple heuristic algorithms to
148+
achieve an optimal solution for static memory planning. Through SOMAS,
149+
an optimized memory reuse outcome is obtained, resulting in increased
150+
reusable memory.
151+
152+
As shown in the right chart of Figure
153+
:numref:`ch07/ch07-compiler-backend-memory-04`, with the SOMAS
154+
algorithm, the number of tensors allowed in the same memory is increased
155+
to seven.
156+
157+
## Optimization Techniques for Memory Allocation
158+
159+
In the following, we describe the typical optimization techniques for
160+
memory allocation.
161+
162+
### Memory Fusion
163+
164+
Commonly used memory allocation methods operate at the tensor level,
165+
often resulting in discontinuous device addresses across tensors.
166+
However, certain specialized operators, like AllReduce for
167+
communication, require contiguous memory allocation. Executing a
168+
communication operator involves waiting for communication, which is a
169+
significant performance bottleneck in large-scale distributed systems.
170+
It includes data transfer and computation. To minimize communication
171+
time, we can fuse multiple communication operators into a composite
172+
operator. This allows for contiguous memory allocation of the operator
173+
input, as depicted in Figure
174+
:numref:`ch07/ch07-compiler-backend-memory-06`.
175+
176+
Additionally, the time spent in communication can be reduced during the
177+
weight initialization task in distributed neural network training. This
178+
task involves broadcasting the initialized weight from one process to
179+
all processes. If a network contains multiple weights (which is often
180+
the case), these broadcasts are repeated. To minimize communication time
181+
in this scenario, a typical approach is to allocate contiguous memory
182+
addresses to all weights on the network and then perform a single
183+
broadcast operation.
184+
185+
![Memory fusion of communicationoperators](../img/ch07/memory_fusion.png)
186+
:label:`ch07/ch07-compiler-backend-memory-06`
187+
188+
### In-place Operators
189+
190+
In the memory allocation process depicted in
191+
Figure :numref:`ch07/ch07-compiler-backend-memory-02`, the input and
192+
output of each operator are assigned different memory addresses.
193+
However, this approach can lead to memory waste and performance
194+
degradation for several other operators. Examples include optimizer
195+
operators used to update neural network weights, Python's `+=` or `*=`
196+
operators that modify variable values, and the `a[0]=b` operator that
197+
updates the value of `a[0]` with `b`. These operators share a common
198+
purpose: updating the input value. The concept of in-place can be
199+
illustrated using the `a[0]=b` operator.
200+
201+
In the original implementation shown on the left of Figure
202+
:numref:`ch07/ch07-compiler-backend-memory-08`, the operator
203+
involves three steps: copying tensor `a` to tensor `a’`, assigning
204+
tensor `b` to tensor `a’`, and then copying tensor `a’` back to tensor
205+
`a`. However, by performing the operation in-place, as depicted on the
206+
right of Figure
207+
:numref:`ch07/ch07-compiler-backend-memory-08`, this process is
208+
simplified to a single step: copying tensor `b` to the position
209+
corresponding to tensor `a`. This reduces data copy time by eliminating
210+
two copies and eliminates the need to allocate memory for tensor `a’`.
211+
212+
![Memory allocation of an in-placeoperator](../img/ch07/inplace-op.png)
213+
:label:`ch07/ch07-compiler-backend-memory-08`
214+
215+
## Data Compression
216+
217+
Deep neural networks (DNNs) in modern training heavily rely on GPUs to
218+
effectively train intricate networks with hundreds of layers. A
219+
prominent challenge faced by both researchers and industry professionals
220+
is the constraint imposed by the available GPU main memory as networks
221+
become deeper. This limitation restricts the size of networks that can
222+
be trained. To address this issue, researchers have recognized the value
223+
of employing DNN-layer-specific encoding schemes. Consequently, they
224+
have directed their attention towards storing encoded representations of
225+
the intermediate layer outputs (feature maps) that are required for the
226+
backward pass. These encoded representations are stored during the
227+
temporal gap between their uses and are decoded only when needed for the
228+
backward pass. The full-fidelity feature maps are promptly discarded
229+
after use, resulting in a noteworthy reduction in memory consumption.
230+
231+
## Memory Swap
232+
233+
Machine learning frameworks frequently necessitate users to optimize
234+
their memory utilization to guarantee that the DNN can be accommodated
235+
within the memory capacity of the GPU. This constraint restricts
236+
researchers from thoroughly investigating diverse machine learning
237+
algorithms, compelling them to make concessions either in terms of
238+
network architecture or by distributing the computational load across
239+
multiple GPUs. One feasible approach is to incorporate DRAM to
240+
facilitate memory swapping. By transferring temporarily inactive data to
241+
DRAM, we can optimize GPU utilization. In recent studies, researchers
242+
have implemented a cautious approach to allocating GPU memory for the
243+
immediate computational needs of a specific layer. This strategy
244+
effectively reduces both the maximum and average memory usage, enabling
245+
researchers to train more extensive networks. To elaborate further, the
246+
researchers promptly release feature maps from GPU memory in the absence
247+
of any potential reuse. Alternatively, if there is a possibility of
248+
future reuse but no immediate requirement, the feature maps are
249+
offloaded to CPU memory and subsequently prefetched back to GPU memory.
250+
251+
The fundamental concept behind memory swapping is straightforward and
252+
inherent. However, its implementation remains challenging and
253+
necessitates prior expertise in our compiler frontend. One such
254+
expertise involves maximizing the overlap between computation and data
255+
swapping time. A precise cost model is essential for evaluating the
256+
estimated time required for data movement and the time cost associated
257+
with each layer in DNN (Deep Neural Network). Additionally, there are
258+
numerous strategies to explore in auto scheduling and auto tuning.
259+
Fortunately, there is an abundance of literature available that
260+
addresses these issues. For additional information, please refer to the
261+
Further Readings section.

0 commit comments

Comments
 (0)