|
| 1 | +# Operator Compiler {#sec:operator-compiler} |
| 2 | + |
| 3 | +Operator compilers are used for compiling and optimizing operators, |
| 4 | +which may be part of a neural network or come from the code implemented |
| 5 | +in a domain-specific language (DSL). The compilation is the process of |
| 6 | +*transforming* the source code from one *representation* into another. |
| 7 | + |
| 8 | +The objective of an operator compiler is to improve the *execution |
| 9 | +performance* of operators. An operator compiler accepts tensor |
| 10 | +computation logic described in *dynamic languages* (e.g., Python) as the |
| 11 | +input and outputs executable files on *specific AI processors*. |
| 12 | + |
| 13 | +## Scheduling Strategy |
| 14 | + |
| 15 | +An operator compiler abstracts the execution of statements in an |
| 16 | +operator implementation into \"scheduling strategies\". Since an |
| 17 | +operator typically consists of multiple statements, the focus lies in |
| 18 | +determining the scheduling strategy for the statements within the |
| 19 | +operator. This strategy encompasses considerations such as the |
| 20 | +calculation order, data block movement, and other relevant factors. |
| 21 | + |
| 22 | +If ignoring the specific processor architecture, for the best |
| 23 | +performance, we only need to load all input tensors to the computation |
| 24 | +core based on the *computational logic* of the operator and access the |
| 25 | +result from the core for storage. *Computational logic* refers to basic |
| 26 | +arithmetic operations (e.g., addition, subtraction, multiplication, and |
| 27 | +division) and other function expressions (e.g., convolution, |
| 28 | +transposition, and loss functions). |
| 29 | + |
| 30 | +Modern computer memory hierarchy looks like a pyramid structure, as |
| 31 | +shown in Figure |
| 32 | +:numref:`ch05/ch05-memory_architecture`. As we move up the |
| 33 | +pyramid, the storage elements have a higher cost but a faster access |
| 34 | +time. |
| 35 | + |
| 36 | + |
| 37 | +:label:`ch05/ch05-memory_architecture` |
| 38 | + |
| 39 | +Such hardware design leads to two basic types of locality: |
| 40 | + |
| 41 | +\(1\) Temporal locality: the tendency to access the same memory location |
| 42 | +several times in quick succession. As such, accessing the same location |
| 43 | +in the L1 cache several times is more efficient than accessing different |
| 44 | +locations in the L1 cache several times. |
| 45 | + |
| 46 | +\(2\) Spatial locality: the tendency to access nearby memory locations |
| 47 | +in quick succession. As such, accessing nearby locations in the L1 cache |
| 48 | +several times is more efficient than moving back and forth between the |
| 49 | +L1 cache and the main memory. |
| 50 | + |
| 51 | +Both types of locality help improve system performance. Specifically, in |
| 52 | +order to improve the data access speed, data to be repeatedly processed |
| 53 | +can be placed in fixed nearby memory locations when possible. |
| 54 | + |
| 55 | +For a serial computational task, it is also possible to decouple the |
| 56 | +data part from the logic part and generate a range of independent groups |
| 57 | +of data that can be executed in parallel, as shown in Figure |
| 58 | +:numref:`ch05/ch05-parallel_computing`. |
| 59 | + |
| 60 | + |
| 61 | +:label:`ch05/ch05-parallel_computing` |
| 62 | + |
| 63 | +These specific data-oriented operations performed at program runtime are |
| 64 | +referred to as *schedules*. A schedule defines the following aspects: |
| 65 | + |
| 66 | +\(1\) When and where should each value in a function be calculated? |
| 67 | + |
| 68 | +\(2\) Where is data stored? |
| 69 | + |
| 70 | +\(3\) How long does it take to access each value between those |
| 71 | +calculated using preorder structure consumers? And when is independent |
| 72 | +recomputation performed by each such value? |
| 73 | + |
| 74 | +Simply put, a scheduling strategy is defined by a set of algorithms |
| 75 | +designed during compilation based on the characteristics of target |
| 76 | +hardware architecture to improve locality and parallelism. The purpose |
| 77 | +of this is to ensure that the resulting executable file delivers optimal |
| 78 | +performance at runtime. These algorithms have no effect on the |
| 79 | +computation result; instead, they only adjust the computation process in |
| 80 | +order to shorten the computation time. |
| 81 | + |
| 82 | +## Combining Scheduling Strategies |
| 83 | + |
| 84 | +In the realm of operator compilers, a common optimization technique |
| 85 | +involves combining multiple abstracted scheduling strategies into a |
| 86 | +comprehensive and efficient scheduling set through manual template |
| 87 | +matching. However, this approach may not be fine-tuned and can be |
| 88 | +labor-intensive when applied to achieve refined optimization across |
| 89 | +different operators. To illustrate this, let's consider an optimization |
| 90 | +algorithm implemented in the Tensor Virtual Machine (TVM). It |
| 91 | +accelerates and optimizes a multiply-accumulate code segment on the CPU |
| 92 | +by combining several fundamental scheduling strategies. |
| 93 | + |
| 94 | +In Code `lst:before_tvm`, the basic computational logic is as |
| 95 | +follows: Initialize tensor C, multiply tensor A by tensor B, and |
| 96 | +accumulate the results to tensor C. |
| 97 | + |
| 98 | +**lst:before_tvm** |
| 99 | +``` |
| 100 | +for (m: int32, 0, 1024) { |
| 101 | + for (n: int32, 0, 1024) { |
| 102 | + C[((m*1024) + n)] = 0f32 |
| 103 | + for (k: int32, 0, 1024) { |
| 104 | + let cse_var_2: int32 = (m*1024) |
| 105 | + let cse_var_1: int32 = (cse_var_2 + n) |
| 106 | + C[cse_var_1] = (C[cse_var_1] + (A[(cse_var_2 + k)]*B[((k*1024) + n)])) |
| 107 | + } |
| 108 | + } |
| 109 | +} |
| 110 | +``` |
| 111 | + |
| 112 | +Assuming that the data type is float and that tensors A, B, and C are of |
| 113 | +size 1024 $\times$ 1024, then the total memory required by the tensors |
| 114 | +is 1024 $\times$ 1024 $\times$ 3 $\times$ sizeof(float) = 12 MB. This |
| 115 | +far exceeds the capacity of common caches (e.g., the L1 cache is 32 KB). |
| 116 | +Therefore, if we want to compute on Tensor A, B, and C in a single |
| 117 | +operation, we must store them in the main memory. However, the main |
| 118 | +memory is distant from the compute core, resulting in significantly |
| 119 | +lower access efficiency compared to using the cache for storage. |
| 120 | + |
| 121 | +There are several scheduling strategies that can help improve |
| 122 | +performance: tile, reorder, and split. The size of the L1 cache is 32 |
| 123 | +KB. To ensure that data used in every computation step is stored in the |
| 124 | +cache, tiling based on the factors of 32 is performed. In this way, only |
| 125 | +the tiny block formed by `m.inner `$\times$` n.inner` needs to be taken |
| 126 | +into account, and memory access of the innermost tiny block is |
| 127 | +independent of the outer loops. A tiny block will occupy only 32 |
| 128 | +$\times$ 32 $\times$ 3 $\times$ sizeof(float), which is 12 KB in the |
| 129 | +cache. The optimized code is shown in Code |
| 130 | +`lst:after_tvm`. We perform tiling on loops m and n based on |
| 131 | +factor 32 as the previous analysis. Similarly, we tile the loop k based |
| 132 | +on factor 4, then reorder the k.outer and k.inner axis as the outermost |
| 133 | +axis. |
| 134 | + |
| 135 | +**lst:after_tvm** |
| 136 | +``` |
| 137 | +// Obtain an outer loop by tiling for (m: int32, 0, 1024) based on factor 32. |
| 138 | +for (m.outer: int32, 0, 32) { |
| 139 | + // Obtain an outer loop by tiling for (n: int32, 0, 1024) based on factor 32. |
| 140 | + for (n.outer: |
| 141 | + // Obtain an inner loop by tiling for (m: int32, 0, 1024) based on factor 32. |
| 142 | + for (m.inner.init: int32, 0, 32) { |
| 143 | + // Obtain an inner loop by tiling for (n: int32, 0, 1024) based on factor 32. |
| 144 | + for (n.inner.init: int32, 0, 32) { |
| 145 | + // Obtain the corresponding factors. |
| 146 | + C[((((m.outer*32768) + (m.inner.init*1024)) + (n.outer*32)) + n.inner.init)] = 0f32 |
| 147 | + } |
| 148 | + } |
| 149 | + // Obtain an outer loop by splitting for (k: int32, 0, 1024) based on factor 4, with reorder. |
| 150 | + for (k.outer: int32, 0, 256) { |
| 151 | + // Obtain an outer loop by splitting for (k: int32, 0, 1024) based on factor 4, with reorder. |
| 152 | + for (k.inner: int32, 0, 4) { |
| 153 | + // Obtain an inner loop by tiling for (m: int32, 0, 1024) based on factor 32. |
| 154 | + for (m.inner: int32, 0, 32) { |
| 155 | + // Obtain an inner loop by tiling for (n: int32, 0, 1024) based on factor 32. |
| 156 | + for (n.inner: int32, 0, 32) { |
| 157 | + // Outer axis factor obtained by tiling along axis n |
| 158 | + let cse_var_3: int32 = (n.outer*32) |
| 159 | + // Outer axis & inner axis factors obtained by tiling along axis m |
| 160 | + let cse_var_2: int32 = ((m.outer*32768) + (m.inner*1024)) |
| 161 | + // Outer axis & inner axis factors obtained by tiling along axes m & n |
| 162 | + let cse_var_1: int32 = ((cse_var_2 + cse_var_3) + n.inner) |
| 163 | + // Split the computational logic into different layers so that data involved every loop can be stored in the cache. |
| 164 | + C[cse_var_1] = (C[cse_var_1] + (A[((cse_var_2 + (k.outer*4)) + n.inner)] * B[((((k.outer*4096) + (k.inner*1024)) + cse_var_3) + n.inner)])) |
| 165 | + } |
| 166 | + } |
| 167 | + } |
| 168 | + } |
| 169 | + } |
| 170 | +} |
| 171 | +``` |
| 172 | + |
| 173 | +## Finding Optimized Strategies with Polyhedral Models |
| 174 | + |
| 175 | +Another optimization approach is to automatically select an operator |
| 176 | +schedule from a schedule search space. A good example of this idea is |
| 177 | +the polyhedral compilation. They improve the generalization of operator |
| 178 | +compilation at the expense of prolonged compile time. |
| 179 | + |
| 180 | +Polyhedral compilation mainly optimizes the loops in user code by |
| 181 | +abstracting each loop into a multidimensional space, computing instances |
| 182 | +into points in the space, and dependencies between the instances into |
| 183 | +lines in the space. The main idea of this algorithm is to model the |
| 184 | +memory access characteristics in code and adjust the execution order of |
| 185 | +each instance within each loop. In this way, it aims to enable better |
| 186 | +locality and parallelism of the loop code under the new schedule. |
| 187 | + |
| 188 | +Code `lst:before_poly` is used as an example to describe the |
| 189 | +algorithm. |
| 190 | + |
| 191 | +**lst:before_poly** |
| 192 | +``` |
| 193 | +for (int i = 0; i < N; i++) |
| 194 | + for (int j = 1; j < N; j++) |
| 195 | + a[i+1][j] = a[i][j+1] - a[i][j] + a[i][j-1]; |
| 196 | +``` |
| 197 | + |
| 198 | +As shown in Figure :numref:`ch05/ch05-poly_test`, a memory access structure is first |
| 199 | +modeled by using the polyhedral model algorithm, and then dependencies |
| 200 | +(denoted by arrows) between instances (denoted by nodes) are analyzed. |
| 201 | + |
| 202 | + |
| 203 | +:label:`ch05/ch05-poly_test` |
| 204 | + |
| 205 | +Complex dependency analysis and schedule transformation are then |
| 206 | +performed to obtain an optimal solution that fits the memory model. |
| 207 | +Using the polyhedral model algorithm, the code is optimized to that |
| 208 | +shown in Code `lst:after_poly`. |
| 209 | + |
| 210 | +**lst:after_poly** |
| 211 | +``` |
| 212 | +for (int i_new = 0; i_new < N; i_new++) |
| 213 | + for (int j_new = i+1; j_new < i+N; j_new++) |
| 214 | + a[i_new+1][j_new-i_new] = a[i_new][j_new-i_new+1] - a[i_new][j_new-i_new] + a[i_new][j_new-i_new-1]; |
| 215 | +``` |
| 216 | + |
| 217 | +The resulting code looks relatively complex. We can model the code (as |
| 218 | +shown in Figure :numref:`ch05/ch05-poly`) to determine its performance |
| 219 | +improvements. Through dependency analysis, we find that the loop |
| 220 | +dependencies present in the source code are removed in the optimized |
| 221 | +code, thereby increasing the opportunities for parallel computing. |
| 222 | +Specifically, parallel computing is possible when the loop dependencies |
| 223 | +are partitioned along the dashed lines based on the green blocks, as |
| 224 | +shown in Figure :numref:`ch05/ch05-poly`. |
| 225 | + |
| 226 | + |
| 227 | +:label:`ch05/ch05-poly` |
| 228 | + |
| 229 | +We have only introduced the Polyhedral Compilation technique in this |
| 230 | +section. However, there are other optimization techniques available, |
| 231 | +such as Ansor, which is a heuristic searching method with pruning. |
| 232 | + |
| 233 | +## Adaptation to Instruction Sets |
| 234 | + |
| 235 | +We have previously explored the optimization techniques of operator |
| 236 | +compilers. In this section, we build on this foundation to examine how |
| 237 | +operator compilers adapt to instruction sets on different chips. |
| 238 | +Typically, a general-purpose compiler is designed to be compatible with |
| 239 | +as many backend architectures and instruction sets as possible. However, |
| 240 | +this can present challenges when the compiler must handle backends with |
| 241 | +different architectures and instruction sets. |
| 242 | + |
| 243 | +Two common programming models adopted by AI processors are single |
| 244 | +instruction, multiple data (SIMD) and single instruction, multiple |
| 245 | +threads (SIMT). As shown in Figures |
| 246 | +:numref:`ch05/ch05-SIMD` and |
| 247 | +:numref:`ch05/ch05-SIMT`, respectively, SIMD corresponds to chips |
| 248 | +with vector instructions, while SIMT corresponds to chips that support |
| 249 | +multiple threads. Recently, some chips have begun to combine both |
| 250 | +programming models in order to support both multithreaded parallel |
| 251 | +computing and vector instructions. When handling different programming |
| 252 | +models, an operator compiler adopts different optimization strategies, |
| 253 | +such as vectorization. |
| 254 | + |
| 255 | + |
| 256 | +:label:`ch05/ch05-SIMD` |
| 257 | + |
| 258 | + |
| 259 | +:label:`ch05/ch05-SIMT` |
| 260 | + |
| 261 | +Operator compilers place a strong emphasis on differentiated support in |
| 262 | +the frontend, midend, and backend. In the frontend, support for multiple |
| 263 | +backend instruction sets is added, allowing AI programmers to focus on |
| 264 | +algorithm logic without having to worry about chip differences. In the |
| 265 | +midend, the architectures of different chips are identified, which |
| 266 | +allows for specific optimization methods to be implemented for each |
| 267 | +chip. When generating backend code, the instruction sets of different |
| 268 | +chips are further identified to ensure efficient execution on target |
| 269 | +chips. |
| 270 | + |
| 271 | +## Expression Ability |
| 272 | + |
| 273 | +The representation capability of an operator compiler is important |
| 274 | +because it determines how well the frontend can express the input code |
| 275 | +in an IR without loss of syntax information. The frontend of an operator |
| 276 | +compiler is often fed with code programmed in flexible languages (e.g., |
| 277 | +PyTorch code written in Python). However, flexible expressions (e.g., |
| 278 | +indexing and view syntax in Python) pose high requirements on the |
| 279 | +frontend expression ability of operator compilers. From the model |
| 280 | +perspective, managing the inputs of an operatorn often contain many |
| 281 | +control flow statements. Also, some models allow for dynamic-shape |
| 282 | +operators whose shapes vary with control flow decisions across |
| 283 | +iterations. |
| 284 | + |
| 285 | +Additionally, there are a large number of operators that may not have |
| 286 | +optimized implementation provided by the accelerator libraries (e.g., |
| 287 | +cuDNN) directly. This phenomenon is referred to as long tail operators. |
| 288 | +However, the long tail operators can have highly flexible syntax or |
| 289 | +abundant control flow statements and sometimes support dynamic shapes, |
| 290 | +making it extremely difficult for the frontend of existing operator |
| 291 | +compilers to express, optimize, or accelerate them. Consequently, such |
| 292 | +operators have to be executed by the Python interpreter or slow virtual |
| 293 | +machines, leading to a performance bottleneck in network execution. This |
| 294 | +is why it is imperative to improve the expression ability of the |
| 295 | +operator compiler frontend. |
0 commit comments