|
| 1 | +# Conversion to Inference Model and Model Optimization {#sec:ch-deploy/model-optimization} |
| 2 | + |
| 3 | +## Model Conversion |
| 4 | + |
| 5 | +As mentioned earlier, TensorFlow, PyTorch, MindSpore, MXNet, and CNTK |
| 6 | +define their own model data structures. This means that the inference |
| 7 | +system needs to convert these structures to a unified one. Open Neural |
| 8 | +Network Exchange (ONNX) is designed to implement such conversion. It |
| 9 | +supports an extensive range of machine learning operators and converts |
| 10 | +models from various frameworks (e.g., TensorFlow and PyTorch) into ONNX |
| 11 | +models. Because models are structured data, the conversion process |
| 12 | +involves converting the data structure. It starts by analyzing the |
| 13 | +similarities and differences between two data structures. If they are |
| 14 | +the same, data is transferred; if the structures are similar but with |
| 15 | +slight differences, data is mapped; if the structures differ |
| 16 | +significantly, extra semantics conversion might be required; and if they |
| 17 | +are totally incompatible, the conversion will fail. ONNX features strong |
| 18 | +expressive power, meaning that it can convert models from most |
| 19 | +frameworks in the industry to compatible ONNX models. If a model is |
| 20 | +abstracted as a graph, its data structure can be defined as follows: |
| 21 | + |
| 22 | +1. **Topological expression of model:** The topological connections of |
| 23 | + a model are represented as edges in a graph. From the perspective of |
| 24 | + a model, these edges define the data flows and control flows in the |
| 25 | + model. Based on such definitions, we can extend to the expressions |
| 26 | + of the subgraphs, model inputs and outputs, and control flow |
| 27 | + structures. For example, the control flow on TensorFlow 1.x is |
| 28 | + expressed as a cyclic graph. To prevent the formation of cycles, |
| 29 | + TensorFlow 1.x uses operators such as Enter, Exit, Switch, LoopCond, |
| 30 | + and NextIteration, whereas ONNX uses operators such as Loop and If. |
| 31 | + As such, when converting a TensorFlow1.x control flow model into an |
| 32 | + ONNX model, the control flow graph structure in the TensorFlow model |
| 33 | + must be merged into a While or If operator on ONNX. |
| 34 | + |
| 35 | +2. **Operator prototype definition:** Operators can be regarded as data |
| 36 | + processing or control flow nodes in a model or as vertices in a |
| 37 | + graph. An operator prototype defines the type, inputs, outputs, and |
| 38 | + attributes of an operator. For instance, Slice has different |
| 39 | + semantics on Caffe and ONNX. To convert a Caffe model into an ONNX |
| 40 | + model, we need to map Slice on Caffe to Split on ONNX. |
| 41 | + FusedBatchnorm on TensorFlow does not have a mapping operator on |
| 42 | + Caffe. Rather, Batchnorm and Scale on Caffe need to be combined to |
| 43 | + express the same semantics of FusedBatchnorm on TensorFlow. |
| 44 | + Generally, the model conversion process involves converting the |
| 45 | + topological relationships and mapping the operator prototypes |
| 46 | + between models. |
| 47 | + |
| 48 | +Following model conversion, some input-agnostic operations are conducted |
| 49 | +for optimization purposes prior to model deployment, including constant |
| 50 | +folding, operator fusion, operator replacement, and operator reordering |
| 51 | +--- optimization methods discussed earlier in this book. For instance, |
| 52 | +constant folding is usually performed during the compilation executed on |
| 53 | +the compiler frontend, whereas, operator fusion and partition are often |
| 54 | +performed (depending on the backend hardware support) once the |
| 55 | +compilation is complete. However, some optimization operations can only |
| 56 | +be performed in their entirety during the deployment phase. |
| 57 | + |
| 58 | + |
| 59 | +:label:`ch-deploy/fusion-storage}## Operator Fusion {#sec:ch-deploy/kernel-fusion` |
| 60 | + |
| 61 | +Operator fusion involves combining multiple operators in a deep neural |
| 62 | +network (DNN) model into a new operator based on certain rules, reducing |
| 63 | +the inference latency and power consumption by lowering the computation |
| 64 | +workload and load/store overhead during online inference. |
| 65 | + |
| 66 | +The two main performance benefits brought by operator fusion are as |
| 67 | +follows: First, it maximizes the utilization of registers and caches. |
| 68 | +And second, because it combines operators, the load/store time between |
| 69 | +the CPU and memory is reduced. Figure |
| 70 | +:numref:`ch-deploy/fusion-storage` shows the architecture of a |
| 71 | +computer's storage system. While the storage capacity increases from the |
| 72 | +level-1 cache (L1) to hard disk, so too does the time for reading data. |
| 73 | +After operator fusion is performed, the previous computation result can |
| 74 | +be temporarily stored in the CPU's register or cache where the next |
| 75 | +computation can directly read the result, reducing the number of I/O |
| 76 | +operations on the memory. Furthermore, operator fusion allows some |
| 77 | +computation to be completed in advance, eliminating redundant or even |
| 78 | +cyclic redundant computing during forward computation. |
| 79 | + |
| 80 | + |
| 81 | +:label:`ch-deploy/conv-bn-fusion` |
| 82 | + |
| 83 | +To describe the principle of operator fusion, we will use two operators, |
| 84 | +Convolution and Batchnorm, as shown in Figure |
| 85 | +:numref:`ch-deploy/conv-bn-fusion`. In the figure, the |
| 86 | +solid-colored boxes indicate operators, the resulting operators after |
| 87 | +fusion is performed are represented by hatched boxes, and the weights or |
| 88 | +constant tensors of operators are outlined in white. The fusion can be |
| 89 | +understood as the simplification of an equation. The computation of |
| 90 | +Convolution is expressed as Equation |
| 91 | +:eqref:`ch-deploy/conv-equation`. |
| 92 | + |
| 93 | +$$\bf{Y_{\rm conv}}=\bf{W_{\rm conv}}\cdot\bf{X_{\rm conv}}+\bf{B_{\rm conv}}$$ |
| 94 | +:eqlabel:`equ:ch-deploy/conv-equation` |
| 95 | + |
| 96 | +Here, we do not need to understand what each variable means. Instead, we |
| 97 | +only need to keep in mind that Equation |
| 98 | +:eqref:`ch-deploy/conv-equation` is an equation for |
| 99 | +$\bf{Y_{\rm conv}}$ with respect to $\bf{X_{\rm conv}}$, and other |
| 100 | +symbols are constants. |
| 101 | + |
| 102 | +Equation |
| 103 | +:eqref:`ch-deploy/bn-equation` is about the computation of |
| 104 | +Batchnorm: |
| 105 | + |
| 106 | +$$\bf{Y_{\rm bn}}=\gamma\frac{\bf{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$ |
| 107 | +:eqlabel:`equ:ch-deploy/bn-equation` |
| 108 | + |
| 109 | +Similarly, it is an equation for $\bf{Y_{\rm bn}}$ with respect to |
| 110 | +$\bf{X_{\rm bn}}$. Other symbols in the equation represent constants. |
| 111 | + |
| 112 | +As shown in Figure |
| 113 | +:numref:`ch-deploy/conv-bn-fusion`, when the output of |
| 114 | +Convolution is used as the input of Batchnorm, the formula of Batchnorm |
| 115 | +is a function for $\bf{Y_{\rm bn}}$ with respect to $\bf{X_{\rm conv}}$. |
| 116 | +After substituting $\bf{Y_{\rm conv}}$ into $\bf{X_{\rm bn}}$ and |
| 117 | +uniting and extracting the constants, we obtain Equation |
| 118 | +:eqref:`ch-deploy/conv-bn-equation-3`. |
| 119 | + |
| 120 | +$$\bf{Y_{\rm bn}}=\bf{A}\cdot\bf{X_{\rm conv}}+\bf{B}$$ |
| 121 | +:eqlabel:`equ:ch-deploy/conv-bn-equation-3` |
| 122 | + |
| 123 | +Here, $\bf{A}$ and $\bf{B}$ are two matrices. It can be noticed that |
| 124 | +Equation |
| 125 | +:eqref:`ch-deploy/conv-bn-equation-3` is a formula for computing |
| 126 | +Convolution. The preceding example shows that the computation of |
| 127 | +Convolution and Batchnorm can be fused into an equivalent Convolution |
| 128 | +operator. Such fusion is referred to as formula fusion. |
| 129 | + |
| 130 | +The fusion of Convolution and Batchnorm eliminates a Batchnorm |
| 131 | +operation, thereby reducing the quantity of parameters and computation |
| 132 | +workload are reduced, and thereby the load/store operations are also |
| 133 | +reduced. In general, this fusion not only optimizes the power |
| 134 | +consumption and performance during model deployment, but also brings |
| 135 | +certain benefits in compressing the model size. |
| 136 | + |
| 137 | +Symbols that are considered as constants in the Convolution and |
| 138 | +Batchnorm formulas during fusion are considered as parameters during |
| 139 | +training. Performing fusion during the training process will result in |
| 140 | +missing model parameters. Because the fusion eliminates a Batchnorm |
| 141 | +operator and corresponding parameters from the network, the algorithm of |
| 142 | +the DNN is changed, degrading the accuracy to unacceptable levels. |
| 143 | +Therefore, the fusion of Convolution and Batchnorm is an optimization |
| 144 | +method typically used during deployment. To evaluate the optimization |
| 145 | +effect, we constructed a sample network with Convolution and Batchnorm |
| 146 | +using MindSpore Lite. We ran the sample network and mobilenet-v2 network |
| 147 | +for inference in dual threads on a Huawei Mate 30 smartphone to compare |
| 148 | +the time of running 3,000 inference epochs before and after the fusion. |
| 149 | +As shown in Table |
| 150 | +[1](#tab:ch09/ch09-conv-bn-fusion){reference-type="ref" |
| 151 | +reference="tab:ch09/ch09-conv-bn-fusion"}, the inference performance of |
| 152 | +the sample network and mobilenet-v2 network is improved considerably |
| 153 | +after the fusion --- by 8.5% and 11.7% respectively. Such improvements |
| 154 | +are achieved without bringing side effects and without requiring |
| 155 | +additional hardware or operator libraries. |
| 156 | + |
| 157 | +:Convolution + Batchnorm inference performance before and after fusion (unit: ms) |
| 158 | + |
| 159 | +|Fusion | Sample | Mobilenet-v2 | |
| 160 | +|---------------| --------| -------------- | |
| 161 | +|Before fusion | 0.035 | 15.415 | |
| 162 | +|After fusion | 0.031 | 13.606 | |
| 163 | +:label:`ch09/ch09-conv-bn-fusion` |
| 164 | + |
| 165 | +## Operator Replacement |
| 166 | + |
| 167 | +The principle of operator replacement is to simplify an operator formula |
| 168 | +by uniting like terms, extracting common factors, and employing other |
| 169 | +mathematical methods, and then map the simplified formula to a certain |
| 170 | +type of operators that have the same computational logic but are more |
| 171 | +suitable for online deployment. In this way, we can reduce the |
| 172 | +computation workload and compress the model. |
| 173 | + |
| 174 | + |
| 175 | +:label:`ch-deploy/bn-replace` |
| 176 | + |
| 177 | +Figure :numref:`ch-deploy/bn-replace` depicts the replacement of |
| 178 | +Batchnorm with Scale, which is used as an example to describe the |
| 179 | +principle of operator replacement. After decomposing Equation |
| 180 | +:eqref:`ch-deploy/bn-equation` (the Batchnorm formula) and |
| 181 | +folding the constants, Batchnorm is defined as Equation |
| 182 | +:eqref:`ch-deploy/replace-scale` |
| 183 | + |
| 184 | +$$\bf{Y_{bn}}=scale\cdot\bf{X_{bn}}+offset$$ |
| 185 | +:eqlabel:`equ:ch-deploy/replace-scale` |
| 186 | + |
| 187 | +where **scale** and **offsets** are scalars. This simplified formula can |
| 188 | +be mapped to a Scale operator. |
| 189 | + |
| 190 | +Compared with the original Batchnorm formula, the simplified formula has |
| 191 | +fewer parameters and involves less computation workload. This indicates |
| 192 | +that operator replacement is an effective approach to optimizing the |
| 193 | +power consumption and performance of a model during deployment. Symbols |
| 194 | +that are considered as constants in Batchnorm during deployment are not |
| 195 | +considered as constants during training, meaning that the replacement |
| 196 | +can be performed only during deployment. Operator replacement reduces |
| 197 | +the quantity of parameters and changes the structure of the model, |
| 198 | +weakening the expressive power and reducing the accuracy of the model |
| 199 | +during convergence. |
| 200 | + |
| 201 | +## Operator Reordering |
| 202 | + |
| 203 | +Another way of reducing the computation workload of an inference model |
| 204 | +is to adjust the topological order of its operators according to certain |
| 205 | +rules, on the condition that the inference accuracy is not degraded. |
| 206 | +Common methods of operator reordering include moving cropping operators |
| 207 | +(e.g., Slice, StrideSlice, and Crop) forward, and reordering Reshape, |
| 208 | +Transpose, and BinaryOp. |
| 209 | + |
| 210 | + |
| 211 | +:label:`ch-deploy/crop-reorder` |
| 212 | + |
| 213 | +Crop is used to cut a part out of the input feature map as the output. |
| 214 | +After Crop is executed, the size of the feature map is reduced. As shown |
| 215 | +in Figure :numref:`ch-deploy/crop-reorder`, moving Crop forward to cut the |
| 216 | +feature map before other operators reduces the computation workload of |
| 217 | +subsequent operators, thereby improving the inference performance in the |
| 218 | +deployment phase. Such improvement is related to the operator |
| 219 | +parameters. Note, however, that Crop can be moved forward only along |
| 220 | +element-wise operators. |
| 221 | + |
| 222 | +The experiment result above proves that optimizing models before |
| 223 | +inference makes it possible to significantly reduce the latency, power |
| 224 | +consumption, and memory usage. |
0 commit comments