|
| 1 | +# Memory Allocation |
| 2 | + |
| 3 | +Memory allocation is a crucial aspect of conventional computer memory |
| 4 | +hierarchy, acting as a link between cache and disk storage. It provides |
| 5 | +more storage capacity than the cache and enables faster access compared |
| 6 | +to disk storage. With the progress of deep learning, accommodating large |
| 7 | +deep neural networks within the memory of hardware accelerators or AI |
| 8 | +processors has become increasingly challenging. To overcome this |
| 9 | +obstacle, various solutions have been developed, including memory reuse, |
| 10 | +contiguous memory allocation, and in-place memory allocation. Proper |
| 11 | +implementation of contiguous memory allocation and in-place memory |
| 12 | +allocation can enhance the execution efficiency of operators and further |
| 13 | +optimize performance. |
| 14 | + |
| 15 | +## Device Memory |
| 16 | + |
| 17 | +In a deep learning architecture, the memory closest to the hardware |
| 18 | +accelerator (such as the GPU or AI processor) is usually referred to as |
| 19 | +the device memory, and that closest to the CPU is referred to as the |
| 20 | +host memory. As shown in Figure |
| 21 | +:numref:`ch07/ch07-compiler-backend-memory-01`, the CPU can |
| 22 | +directly access the host memory but not the device memory. Similarly, |
| 23 | +the AI processor can directly access the device memory but not the host |
| 24 | +memory. In a typical network training process, data needs to be loaded |
| 25 | +from disk storage to the host memory, where it is then processed. After |
| 26 | +that, the data is copied from the host memory to the device memory, so |
| 27 | +that the device can directly access the data. When the computation is |
| 28 | +finished, the user can obtain the training result once the result data |
| 29 | +is copied from the device memory back to the host memory. |
| 30 | + |
| 31 | + |
| 32 | +:label:`ch07/ch07-compiler-backend-memory-01` |
| 33 | + |
| 34 | +## Process of Memory Allocation |
| 35 | + |
| 36 | +The memory allocation module allocates device memory to the input and |
| 37 | +output of each operator in a graph. The compiler frontend interprets the |
| 38 | +user script into an IR, based on which the compiler backend performs |
| 39 | +operator selection and optimization to determine information such as the |
| 40 | +shape, data type, and format of each input/output tensor of each |
| 41 | +operator. With this information, the size of each input/output tensor of |
| 42 | +each operator can be calculated using Equation |
| 43 | +:eqref:`ch05/equation-04`: |
| 44 | + |
| 45 | +$$ |
| 46 | +\text{size}=\prod_{i=0}^{\text{dimension }}\text{shape}_i \times \text{sizeof}\left ( \text{datatype} \right ) |
| 47 | +$$ |
| 48 | +:eqlabel:`equation:ch05/equation-04` |
| 49 | + |
| 50 | +Unaligned memory access can be time-consuming, because the transfer of |
| 51 | +data to and from memory is most efficient in chunks of 4, 8, or 16 |
| 52 | +bytes. When the size of the data to be transferred is not a multiple of |
| 53 | +any of these sizes, one or more empty bytes are padded to align the data |
| 54 | +in memory. |
| 55 | + |
| 56 | +Figure |
| 57 | +:numref:`ch07/ch07-compiler-backend-memory-02` illustrates an |
| 58 | +example of memory allocation. |
| 59 | + |
| 60 | + |
| 61 | +:label:`ch07/ch07-compiler-backend-memory-02` |
| 62 | + |
| 63 | +In this example, memory addresses are assigned to the input tensor, |
| 64 | +Conv2D's weight, and Conv2D's output. Subsequently, a memory address is |
| 65 | +allocated to the input of BatchNorm. Since the input of BatchNorm is the |
| 66 | +same as the output of Conv2D, which already has a allocated memory |
| 67 | +address, the output address of Conv2D can be shared with the input of |
| 68 | +BatchNorm. This approach avoids redundant memory allocation and |
| 69 | +unnecessary memory copies. The entire training process in this example |
| 70 | +involves allocating memory for three types based on their data lifetime: |
| 71 | +the initial input of the graph, the weights or attributes of operators, |
| 72 | +and the output tensor of the final operator. |
| 73 | + |
| 74 | +Frequent allocations and deallocations of memory blocks of various sizes |
| 75 | +using functions like `malloc` can significantly degrade performance. To |
| 76 | +mitigate this issue, memory pools can be employed. Memory pools involve |
| 77 | +pre-allocating a specific amount of memory, allowing memory blocks to be |
| 78 | +dynamically allocated from the pool as needed and returned for reuse. |
| 79 | + |
| 80 | +Memory pools are widely utilized in AI frameworks to manage frequent |
| 81 | +allocations of device memory and ensure consistent memory lifetime for |
| 82 | +tensors. Different AI frameworks adopt similar memory pool designs. |
| 83 | +Figure |
| 84 | +:numref:`ch07/ch07-compiler-backend-memory-03` presents an |
| 85 | +example of memory allocation in an AI framework. In this case, each |
| 86 | +tensor's memory is allocated from a pre-allocated device memory space |
| 87 | +using double pointers to offset the start and end addresses. Weight |
| 88 | +tensors of operators are allocated memory by offsetting from the start |
| 89 | +address (with a lifetime lasting throughout the training process). The |
| 90 | +output tensor of each operator is allocated memory by offsetting from |
| 91 | +the end address (with a shorter lifetime that terminates when the tensor |
| 92 | +is no longer needed in the computation process). This approach allows |
| 93 | +operator memory to be allocated using offset pointers from pre-allocated |
| 94 | +device memory, significantly reducing the time required compared to |
| 95 | +direct memory allocations from the device. |
| 96 | + |
| 97 | + |
| 98 | +:label:`ch07/ch07-compiler-backend-memory-03` |
| 99 | + |
| 100 | +## Memory Reuse |
| 101 | + |
| 102 | +In a machine learning system, memory reuse is achieved by analyzing the |
| 103 | +lifespan of a tensor and, once it reaches the end of its lifespan, |
| 104 | +releasing its device memory back to the memory pool for future reuse by |
| 105 | +other tensors. The objective of memory reuse is to enhance memory |
| 106 | +utilization and enable the accommodation of larger models within the |
| 107 | +constraints of limited device memory. By reusing memory instead of |
| 108 | +continuously allocating new memory for tensors, the system can optimize |
| 109 | +memory utilization and mitigate the memory limitations inherent in deep |
| 110 | +learning computations. |
| 111 | + |
| 112 | +Figure |
| 113 | +:numref:`ch07/ch07-compiler-backend-memory-02` provides an |
| 114 | +example, where output 1 becomes unused once the computation of the |
| 115 | +BatchNorm operator is complete. In this case, the device memory of |
| 116 | +output 1 can be reclaimed and reused for output 3 (if output 3 does not |
| 117 | +require a larger memory size than output 1). |
| 118 | + |
| 119 | +Figure |
| 120 | +:numref:`ch07/ch07-compiler-backend-memory-04` depicts memory |
| 121 | +lifetime using coordinate charts. The horizontal axes represent the |
| 122 | +tensor lifetime, and the vertical axes represent the memory sizes. |
| 123 | +During its lifetime, a tensor occupies a specific amount of device |
| 124 | +memory. The objective of memory allocation is to find an optimal |
| 125 | +solution that accommodates the maximum number of non-conflicting |
| 126 | +rectangular blocks (each denoting a tensor's lifetime and memory size) |
| 127 | +in the same memory. In Figure |
| 128 | +:numref:`ch07/ch07-compiler-backend-memory-04`, the memory can |
| 129 | +accommodate only four rectangular blocks (i.e., tensors T0, T1, T2, and |
| 130 | +T3) when no memory reuse policy is applied, as shown in the left chart. |
| 131 | + |
| 132 | + |
| 133 | +:label:`ch07/ch07-compiler-backend-memory-04` |
| 134 | + |
| 135 | +To determine an appropriate memory reuse policy, we face an NP-complete |
| 136 | +problem. AI frameworks often employ greedy algorithms, such as best-fit, |
| 137 | +which allocate memory by searching for the smallest available block in |
| 138 | +the memory pool one at a time. However, this approach only yields a |
| 139 | +locally optimal solution rather than a globally optimal one. To |
| 140 | +approximate a globally optimal solution, a method called Safe Optimized |
| 141 | +Memory Allocation Solver (SOMAS) can be considered. |
| 142 | + |
| 143 | +SOMAS addresses the computational graph by conducting aggregative |
| 144 | +analysis on parallel streams and data dependencies. This analysis |
| 145 | +reveals the ancestor-descendant relationships between operators. By |
| 146 | +generating a global set of mutually exclusive constraints concerning the |
| 147 | +lifetime of each tensor, SOMAS combines multiple heuristic algorithms to |
| 148 | +achieve an optimal solution for static memory planning. Through SOMAS, |
| 149 | +an optimized memory reuse outcome is obtained, resulting in increased |
| 150 | +reusable memory. |
| 151 | + |
| 152 | +As shown in the right chart of Figure |
| 153 | +:numref:`ch07/ch07-compiler-backend-memory-04`, with the SOMAS |
| 154 | +algorithm, the number of tensors allowed in the same memory is increased |
| 155 | +to seven. |
| 156 | + |
| 157 | +## Optimization Techniques for Memory Allocation |
| 158 | + |
| 159 | +In the following, we describe the typical optimization techniques for |
| 160 | +memory allocation. |
| 161 | + |
| 162 | +### Memory Fusion |
| 163 | + |
| 164 | +Commonly used memory allocation methods operate at the tensor level, |
| 165 | +often resulting in discontinuous device addresses across tensors. |
| 166 | +However, certain specialized operators, like AllReduce for |
| 167 | +communication, require contiguous memory allocation. Executing a |
| 168 | +communication operator involves waiting for communication, which is a |
| 169 | +significant performance bottleneck in large-scale distributed systems. |
| 170 | +It includes data transfer and computation. To minimize communication |
| 171 | +time, we can fuse multiple communication operators into a composite |
| 172 | +operator. This allows for contiguous memory allocation of the operator |
| 173 | +input, as depicted in Figure |
| 174 | +:numref:`ch07/ch07-compiler-backend-memory-06`. |
| 175 | + |
| 176 | +Additionally, the time spent in communication can be reduced during the |
| 177 | +weight initialization task in distributed neural network training. This |
| 178 | +task involves broadcasting the initialized weight from one process to |
| 179 | +all processes. If a network contains multiple weights (which is often |
| 180 | +the case), these broadcasts are repeated. To minimize communication time |
| 181 | +in this scenario, a typical approach is to allocate contiguous memory |
| 182 | +addresses to all weights on the network and then perform a single |
| 183 | +broadcast operation. |
| 184 | + |
| 185 | + |
| 186 | +:label:`ch07/ch07-compiler-backend-memory-06` |
| 187 | + |
| 188 | +### In-place Operators |
| 189 | + |
| 190 | +In the memory allocation process depicted in |
| 191 | +Figure :numref:`ch07/ch07-compiler-backend-memory-02`, the input and |
| 192 | +output of each operator are assigned different memory addresses. |
| 193 | +However, this approach can lead to memory waste and performance |
| 194 | +degradation for several other operators. Examples include optimizer |
| 195 | +operators used to update neural network weights, Python's `+=` or `*=` |
| 196 | +operators that modify variable values, and the `a[0]=b` operator that |
| 197 | +updates the value of `a[0]` with `b`. These operators share a common |
| 198 | +purpose: updating the input value. The concept of in-place can be |
| 199 | +illustrated using the `a[0]=b` operator. |
| 200 | + |
| 201 | +In the original implementation shown on the left of Figure |
| 202 | +:numref:`ch07/ch07-compiler-backend-memory-08`, the operator |
| 203 | +involves three steps: copying tensor `a` to tensor `a’`, assigning |
| 204 | +tensor `b` to tensor `a’`, and then copying tensor `a’` back to tensor |
| 205 | +`a`. However, by performing the operation in-place, as depicted on the |
| 206 | +right of Figure |
| 207 | +:numref:`ch07/ch07-compiler-backend-memory-08`, this process is |
| 208 | +simplified to a single step: copying tensor `b` to the position |
| 209 | +corresponding to tensor `a`. This reduces data copy time by eliminating |
| 210 | +two copies and eliminates the need to allocate memory for tensor `a’`. |
| 211 | + |
| 212 | + |
| 213 | +:label:`ch07/ch07-compiler-backend-memory-08` |
| 214 | + |
| 215 | +## Data Compression |
| 216 | + |
| 217 | +Deep neural networks (DNNs) in modern training heavily rely on GPUs to |
| 218 | +effectively train intricate networks with hundreds of layers. A |
| 219 | +prominent challenge faced by both researchers and industry professionals |
| 220 | +is the constraint imposed by the available GPU main memory as networks |
| 221 | +become deeper. This limitation restricts the size of networks that can |
| 222 | +be trained. To address this issue, researchers have recognized the value |
| 223 | +of employing DNN-layer-specific encoding schemes. Consequently, they |
| 224 | +have directed their attention towards storing encoded representations of |
| 225 | +the intermediate layer outputs (feature maps) that are required for the |
| 226 | +backward pass. These encoded representations are stored during the |
| 227 | +temporal gap between their uses and are decoded only when needed for the |
| 228 | +backward pass. The full-fidelity feature maps are promptly discarded |
| 229 | +after use, resulting in a noteworthy reduction in memory consumption. |
| 230 | + |
| 231 | +## Memory Swap |
| 232 | + |
| 233 | +Machine learning frameworks frequently necessitate users to optimize |
| 234 | +their memory utilization to guarantee that the DNN can be accommodated |
| 235 | +within the memory capacity of the GPU. This constraint restricts |
| 236 | +researchers from thoroughly investigating diverse machine learning |
| 237 | +algorithms, compelling them to make concessions either in terms of |
| 238 | +network architecture or by distributing the computational load across |
| 239 | +multiple GPUs. One feasible approach is to incorporate DRAM to |
| 240 | +facilitate memory swapping. By transferring temporarily inactive data to |
| 241 | +DRAM, we can optimize GPU utilization. In recent studies, researchers |
| 242 | +have implemented a cautious approach to allocating GPU memory for the |
| 243 | +immediate computational needs of a specific layer. This strategy |
| 244 | +effectively reduces both the maximum and average memory usage, enabling |
| 245 | +researchers to train more extensive networks. To elaborate further, the |
| 246 | +researchers promptly release feature maps from GPU memory in the absence |
| 247 | +of any potential reuse. Alternatively, if there is a possibility of |
| 248 | +future reuse but no immediate requirement, the feature maps are |
| 249 | +offloaded to CPU memory and subsequently prefetched back to GPU memory. |
| 250 | + |
| 251 | +The fundamental concept behind memory swapping is straightforward and |
| 252 | +inherent. However, its implementation remains challenging and |
| 253 | +necessitates prior expertise in our compiler frontend. One such |
| 254 | +expertise involves maximizing the overlap between computation and data |
| 255 | +swapping time. A precise cost model is essential for evaluating the |
| 256 | +estimated time required for data movement and the time cost associated |
| 257 | +with each layer in DNN (Deep Neural Network). Additionally, there are |
| 258 | +numerous strategies to explore in auto scheduling and auto tuning. |
| 259 | +Fortunately, there is an abundance of literature available that |
| 260 | +addresses these issues. For additional information, please refer to the |
| 261 | +Further Readings section. |
0 commit comments