|
| 1 | +# Operator Selection |
| 2 | + |
| 3 | +Following graph optimization, the compiler backend generates a sequence |
| 4 | +of operators that can be executed on hardware. This is achieved by |
| 5 | +selecting the most suitable operators from a set of candidate operators |
| 6 | +for each node in the IR. Since these candidate operators have diverse |
| 7 | +specifications, their execution efficiency varies depending on the |
| 8 | +scenario. Therefore, the primary objective of operator selection is to |
| 9 | +choose the operators that are most appropriate for the target device |
| 10 | +based on the information provided by the IR. |
| 11 | + |
| 12 | +## Basic Concepts of Operator Selection |
| 13 | + |
| 14 | +We can think of the nodes in a backend-optimized IR as being units of |
| 15 | +execution that are visible to the user, and each unit represents a |
| 16 | +hardware-agnostic operation in the user code. In essence, operator |
| 17 | +selection involves selecting appropriate hardware information, which is |
| 18 | +referred to as operator information. Such information defines the |
| 19 | +following: |
| 20 | + |
| 21 | +1. The format of an operator, which is a determinant of the operator's |
| 22 | + performance on the target platform. Machine learning systems |
| 23 | + commonly use NCHW and NHWC formats. |
| 24 | + |
| 25 | +2. The data type (such as float32, float16, or int32) of an operator on |
| 26 | + the target platform. The operators selected are those with data |
| 27 | + types close to (or the same as) user definitions. |
| 28 | + |
| 29 | +### Data Formats |
| 30 | + |
| 31 | +In machine learning systems, many operations are converted into matrix |
| 32 | +multiplication (e.g., convolution) for faster computation. Matrix |
| 33 | +multiplication in the form of |
| 34 | +$\textit{\textit{A}}\times \textit{\textit{B}} = \textit{\textit{C}}$ is |
| 35 | +essentially a row-by-column multiplication. Specifically, the entry *ij* |
| 36 | +of **C** is obtained by multiplying the entries in the *i*th row of |
| 37 | +**A** and the corresponding entries in the *j*th column of **B** and |
| 38 | +then adding the results together. Consider the example shown in Figure |
| 39 | +:numref:`ch07/ch07-compiler-backend-06`. Matrix data is stored in |
| 40 | +row-major order by default, as shown at the top of the figure. However, |
| 41 | +matrix **B** is read in column-major order in the matrix multiplication |
| 42 | +process, as shown at the bottom. |
| 43 | + |
| 44 | + |
| 45 | +:label:`ch07/ch07-compiler-backend-06` |
| 46 | + |
| 47 | +Storing matrix **B** in the reading order increases the computation |
| 48 | +efficiency because access to contiguous blocks of memory is faster. We |
| 49 | +can therefore see that data formats play an important role in |
| 50 | +performance improvement. |
| 51 | + |
| 52 | +There are two major formats in machine learning systems: NCHW and NHWC. |
| 53 | +For an image input, N denotes the batch size, C denotes the number of |
| 54 | +channels, and H and W denote the height and width respectively. Figure |
| 55 | +:numref:`ch07/ch07-compiler-backend-07` depicts the logical |
| 56 | +diagram of an input with batch size 2, channels 16, height 5, and width |
| 57 | +4. |
| 58 | + |
| 59 | + |
| 60 | +:label:`ch07/ch07-compiler-backend-07` |
| 61 | + |
| 62 | +A multidimensional matrix is flattened into 1D format before it is |
| 63 | +written to memory. This involves indexing, which maps logical data to |
| 64 | +physical memory. |
| 65 | + |
| 66 | +Access to machine learning data is performed in an axis-wise order from |
| 67 | +the last axis forward. For instance, data in NCHW format is read in the |
| 68 | +axis order of W, H, C, and N. Equation |
| 69 | +:eqref:`ch05/equation-01` denotes the mapping between |
| 70 | +logical memory and physical memory for this format of data. |
| 71 | + |
| 72 | +$$ |
| 73 | +\text{offsetnchw}(n,c,h,w) = n \times \textit{C} \times \textit{H} \times \textit{W} + c \times \textit{H} \times \textit{W} + h \times \textit{W} + w |
| 74 | +$$ |
| 75 | +:eqlabel:`equation:ch05/equation-01` |
| 76 | + |
| 77 | +As shown in Figure |
| 78 | +:numref:`ch07/ch07-compiler-backend-08`, matrix elements are |
| 79 | +flattened from the lowest dimension (i.e., W axis) forward, and |
| 80 | +neighboring elements of an axis reside next to each other in memory. To |
| 81 | +take the same element on the next image in the same location, the whole |
| 82 | +image size ($C*H*W$) has to be jumped. Assume we have a batch of eight |
| 83 | +RGB images of size 32$\times$`<!-- -->`{=html}32, or a matrix with |
| 84 | +$N=8,C=3,H=32,W=32$. Memory storage of these images begins from the |
| 85 | +first channel of the first image by flattening the matrix along axis W |
| 86 | +and then arranging matrix elements along axis H. This is performed |
| 87 | +before the next channel is processed. The same procedure is repeated |
| 88 | +until the last channel of the last image is processed. NCHW is the |
| 89 | +default format on PyTorch and MindSpore. |
| 90 | + |
| 91 | + |
| 92 | +:label:`ch07/ch07-compiler-backend-08` |
| 93 | + |
| 94 | +Access to data in NHWC format also begins at the lowest dimension (i.e., |
| 95 | +C axis) forward. NHWC is the default format on TensorFlow (PyTorch |
| 96 | +refers to it as the channel-last format). Equation |
| 97 | +:eqref:`ch05/equation-02` denotes the mapping from logical |
| 98 | +memory to physical memory for this format of data. |
| 99 | + |
| 100 | +$$ |
| 101 | +\text{offsetnchw}(n,h,w,c) = n \times \textit{H} \times \textit{W} \times \textit{C} + h \times \textit{W} \times \textit{C} + w \times \textit{C} + c |
| 102 | +$$ |
| 103 | +:eqlabel:`equation:ch05/equation-02` |
| 104 | + |
| 105 | +Figure |
| 106 | +:numref:`ch07/ch07-compiler-backend-nchwandnhwc` compares the |
| 107 | +logical indexing of the NCHW and NHWC formats. The \[x:1\] marks refer |
| 108 | +to the jumps from the innermost axis to the next. For example, \[a:1\] |
| 109 | +indicates the jump from axis W to axis H, and \[b:1\] indicates the jump |
| 110 | +from axis C (the innermost) to axis W. |
| 111 | + |
| 112 | + |
| 113 | +:label:`ch07/ch07-compiler-backend-nchwandnhwc` |
| 114 | + |
| 115 | +These two formats offer a high degree of flexibility and are therefore |
| 116 | +used on many frameworks. However, to accelerate computing on hardware, |
| 117 | +further optimization is needed. In a machine learning system, if the |
| 118 | +size of the user input exceeds what the compute component can pass |
| 119 | +through the network at a time (which is often the case), the input will |
| 120 | +be batched before computation. For further optimization, many frameworks |
| 121 | +introduce blocked formats (which are more hardware-friendly), such as |
| 122 | +the nChw16c and nChw8c formats of the oneAPI Deep Neural Network Library |
| 123 | +(oneDNN) and the NC1HWC0 format on the Ascend platform. By leveraging |
| 124 | +hardware acceleration instructions to move and compute data, matrices |
| 125 | +can be quickly transformed into vectors, increasing the utilization of |
| 126 | +the on-chip cache. |
| 127 | + |
| 128 | +### Data Types |
| 129 | + |
| 130 | +Single-precision (float32), occupying 32 bits in memory, is the most |
| 131 | +commonly used data type in machine learning systems. In applications |
| 132 | +where higher precision is not essential, the half-precision (float16) |
| 133 | +data type may be used, occupying 16 bits in memory. When used on |
| 134 | +hardware, float16 offers up to 7 times more arithmetic throughput with |
| 135 | +less memory footprint compared with the single-precision data type --- |
| 136 | +this allows for larger batch sizes and consequently reduced training |
| 137 | +time. Next, we will look at the differences between half-precision |
| 138 | +floating-point numbers and single-precision floating-point numbers. |
| 139 | + |
| 140 | +In Figure :numref:`ch07/ch07-float32andfloat16`, *Sig* refers to the sign |
| 141 | +bit that indicates the sign of a number, *Exponent* refers to the |
| 142 | +exponent bits, and *Mantissa* refers to the mantissa bits. |
| 143 | + |
| 144 | + |
| 145 | +:label:`ch07/ch07-float32andfloat16` |
| 146 | + |
| 147 | +Applying Equation |
| 148 | +:eqref:`ch05/equation-03` will convert a float16 number in |
| 149 | +binary scientific notation to decimal format. |
| 150 | + |
| 151 | +$$ |
| 152 | +(-1)^{\text{Sig}}\times 2^{\text{Exponent}-15}\times (\frac{\text{Mantissa}}{1024}+1) |
| 153 | +$$ |
| 154 | +:eqlabel:`equation:ch05/equation-03` |
| 155 | + |
| 156 | +If the exponent bits and mantissa bits are all 0s, the number is 0. If |
| 157 | +the exponent bits are all 0s but the mantissa bits are not, the number |
| 158 | +is very small. If the exponent bits are all 1s and the mantissa bits are |
| 159 | +all 0s, the number is an infinity, either positive or negative depending |
| 160 | +on the sign bit. Not a Number (NaN) is denoted by the exponent bits |
| 161 | +being all 1s while the mantissa bits are not all 0s. bfloat16 is a |
| 162 | +special data type developed by Google for machine learning on its tensor |
| 163 | +processing units (TPUs). Although bfloat16 is not an industry-standard |
| 164 | +IEEE 16-bit floating-point data type, it has the same exponent size as |
| 165 | +float32, meaning that it can be easily converted to and from float32. |
| 166 | + |
| 167 | +### Operator Information Library |
| 168 | + |
| 169 | +Hardware devices support different operators based on their data format |
| 170 | +and data type requirements. Each device maintains an operator |
| 171 | +information library that contains a comprehensive list of operators |
| 172 | +supported by that device. During the operator selection process, the |
| 173 | +most suitable operators are chosen from this library. The library serves |
| 174 | +as a reference for determining which operators are compatible and can be |
| 175 | +efficiently executed on a particular hardware device. |
| 176 | + |
| 177 | +## Process of Operator Selection |
| 178 | + |
| 179 | +Operator selection involves selecting the most appropriate operator for |
| 180 | +each operation node in an IR. Operator information contains the |
| 181 | +supported device type, data type, and data format. After the compiler |
| 182 | +frontend completes type inference and static analysis, the data type of |
| 183 | +user code is derived from the IR. |
| 184 | + |
| 185 | +Figure :numref:`ch07/ch07-compiler-backend-select` shows the operator |
| 186 | +selection process. First, the target hardware needs to be selected (or |
| 187 | +this step can be skipped in order to keep the default hardware selection |
| 188 | +defined in the compiler backend). The implementation, supported data |
| 189 | +types, and execution efficiency of a given operator vary depending on |
| 190 | +the target hardware. Then, the compiler backend selects an operator |
| 191 | +based on the data type and data format derived from the IR. |
| 192 | + |
| 193 | + |
| 194 | +:label:`ch07/ch07-compiler-backend-select` |
| 195 | + |
| 196 | +The result of the operator selection process might not be as expected |
| 197 | +due to software or hardware specifications. Sometimes, we might need to |
| 198 | +adjust the precision of a particular node to find an operator with the |
| 199 | +right data type. For example, the Conv2D operator supported by Ascend |
| 200 | +(i.e., the backend of MindSpore) allows only the float16 data type. When |
| 201 | +used on a float32 network on Ascend, the Conv2D operator is executable |
| 202 | +only when its input precision is reduced from float32 to float16. |
| 203 | + |
| 204 | +Converting operators from one format to another can be time-consuming |
| 205 | +and incur memory movement overheads. To avoid this, data should be |
| 206 | +transferred between operators of the same format whenever possible. In |
| 207 | +addition, data type inconsistency may lead to reduced precision, |
| 208 | +potentially slowing down or even preventing network convergence. As |
| 209 | +such, thorough operator analysis is needed to ensure that the right data |
| 210 | +type is selected. |
| 211 | + |
| 212 | +Simply put, an operator selection algorithm is considered optimal if it |
| 213 | +keeps the data type as consistent as possible with user settings while |
| 214 | +also minimizing data format conversion. |
0 commit comments