Skip to content

Commit 76905d2

Browse files
committed
upload section
1 parent ae773ac commit 76905d2

File tree

1 file changed

+224
-0
lines changed

1 file changed

+224
-0
lines changed
Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
# Conversion to Inference Model and Model Optimization {#sec:ch-deploy/model-optimization}
2+
3+
## Model Conversion
4+
5+
As mentioned earlier, TensorFlow, PyTorch, MindSpore, MXNet, and CNTK
6+
define their own model data structures. This means that the inference
7+
system needs to convert these structures to a unified one. Open Neural
8+
Network Exchange (ONNX) is designed to implement such conversion. It
9+
supports an extensive range of machine learning operators and converts
10+
models from various frameworks (e.g., TensorFlow and PyTorch) into ONNX
11+
models. Because models are structured data, the conversion process
12+
involves converting the data structure. It starts by analyzing the
13+
similarities and differences between two data structures. If they are
14+
the same, data is transferred; if the structures are similar but with
15+
slight differences, data is mapped; if the structures differ
16+
significantly, extra semantics conversion might be required; and if they
17+
are totally incompatible, the conversion will fail. ONNX features strong
18+
expressive power, meaning that it can convert models from most
19+
frameworks in the industry to compatible ONNX models. If a model is
20+
abstracted as a graph, its data structure can be defined as follows:
21+
22+
1. **Topological expression of model:** The topological connections of
23+
a model are represented as edges in a graph. From the perspective of
24+
a model, these edges define the data flows and control flows in the
25+
model. Based on such definitions, we can extend to the expressions
26+
of the subgraphs, model inputs and outputs, and control flow
27+
structures. For example, the control flow on TensorFlow 1.x is
28+
expressed as a cyclic graph. To prevent the formation of cycles,
29+
TensorFlow 1.x uses operators such as Enter, Exit, Switch, LoopCond,
30+
and NextIteration, whereas ONNX uses operators such as Loop and If.
31+
As such, when converting a TensorFlow1.x control flow model into an
32+
ONNX model, the control flow graph structure in the TensorFlow model
33+
must be merged into a While or If operator on ONNX.
34+
35+
2. **Operator prototype definition:** Operators can be regarded as data
36+
processing or control flow nodes in a model or as vertices in a
37+
graph. An operator prototype defines the type, inputs, outputs, and
38+
attributes of an operator. For instance, Slice has different
39+
semantics on Caffe and ONNX. To convert a Caffe model into an ONNX
40+
model, we need to map Slice on Caffe to Split on ONNX.
41+
FusedBatchnorm on TensorFlow does not have a mapping operator on
42+
Caffe. Rather, Batchnorm and Scale on Caffe need to be combined to
43+
express the same semantics of FusedBatchnorm on TensorFlow.
44+
Generally, the model conversion process involves converting the
45+
topological relationships and mapping the operator prototypes
46+
between models.
47+
48+
Following model conversion, some input-agnostic operations are conducted
49+
for optimization purposes prior to model deployment, including constant
50+
folding, operator fusion, operator replacement, and operator reordering
51+
--- optimization methods discussed earlier in this book. For instance,
52+
constant folding is usually performed during the compilation executed on
53+
the compiler frontend, whereas, operator fusion and partition are often
54+
performed (depending on the backend hardware support) once the
55+
compilation is complete. However, some optimization operations can only
56+
be performed in their entirety during the deployment phase.
57+
58+
![Layered computer storagearchitecture](../img/ch08/ch09-storage.png)
59+
:label:`ch-deploy/fusion-storage}## Operator Fusion {#sec:ch-deploy/kernel-fusion`
60+
61+
Operator fusion involves combining multiple operators in a deep neural
62+
network (DNN) model into a new operator based on certain rules, reducing
63+
the inference latency and power consumption by lowering the computation
64+
workload and load/store overhead during online inference.
65+
66+
The two main performance benefits brought by operator fusion are as
67+
follows: First, it maximizes the utilization of registers and caches.
68+
And second, because it combines operators, the load/store time between
69+
the CPU and memory is reduced. Figure
70+
:numref:`ch-deploy/fusion-storage` shows the architecture of a
71+
computer's storage system. While the storage capacity increases from the
72+
level-1 cache (L1) to hard disk, so too does the time for reading data.
73+
After operator fusion is performed, the previous computation result can
74+
be temporarily stored in the CPU's register or cache where the next
75+
computation can directly read the result, reducing the number of I/O
76+
operations on the memory. Furthermore, operator fusion allows some
77+
computation to be completed in advance, eliminating redundant or even
78+
cyclic redundant computing during forward computation.
79+
80+
![Convolution + Batchnorm operatorfusion](../img/ch08/ch09-conv-bn-fusion.png)
81+
:label:`ch-deploy/conv-bn-fusion`
82+
83+
To describe the principle of operator fusion, we will use two operators,
84+
Convolution and Batchnorm, as shown in Figure
85+
:numref:`ch-deploy/conv-bn-fusion`. In the figure, the
86+
solid-colored boxes indicate operators, the resulting operators after
87+
fusion is performed are represented by hatched boxes, and the weights or
88+
constant tensors of operators are outlined in white. The fusion can be
89+
understood as the simplification of an equation. The computation of
90+
Convolution is expressed as Equation
91+
:eqref:`ch-deploy/conv-equation`.
92+
93+
$$\bf{Y_{\rm conv}}=\bf{W_{\rm conv}}\cdot\bf{X_{\rm conv}}+\bf{B_{\rm conv}}$$
94+
:eqlabel:`equ:ch-deploy/conv-equation`
95+
96+
Here, we do not need to understand what each variable means. Instead, we
97+
only need to keep in mind that Equation
98+
:eqref:`ch-deploy/conv-equation` is an equation for
99+
$\bf{Y_{\rm conv}}$ with respect to $\bf{X_{\rm conv}}$, and other
100+
symbols are constants.
101+
102+
Equation
103+
:eqref:`ch-deploy/bn-equation` is about the computation of
104+
Batchnorm:
105+
106+
$$\bf{Y_{\rm bn}}=\gamma\frac{\bf{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$
107+
:eqlabel:`equ:ch-deploy/bn-equation`
108+
109+
Similarly, it is an equation for $\bf{Y_{\rm bn}}$ with respect to
110+
$\bf{X_{\rm bn}}$. Other symbols in the equation represent constants.
111+
112+
As shown in Figure
113+
:numref:`ch-deploy/conv-bn-fusion`, when the output of
114+
Convolution is used as the input of Batchnorm, the formula of Batchnorm
115+
is a function for $\bf{Y_{\rm bn}}$ with respect to $\bf{X_{\rm conv}}$.
116+
After substituting $\bf{Y_{\rm conv}}$ into $\bf{X_{\rm bn}}$ and
117+
uniting and extracting the constants, we obtain Equation
118+
:eqref:`ch-deploy/conv-bn-equation-3`.
119+
120+
$$\bf{Y_{\rm bn}}=\bf{A}\cdot\bf{X_{\rm conv}}+\bf{B}$$
121+
:eqlabel:`equ:ch-deploy/conv-bn-equation-3`
122+
123+
Here, $\bf{A}$ and $\bf{B}$ are two matrices. It can be noticed that
124+
Equation
125+
:eqref:`ch-deploy/conv-bn-equation-3` is a formula for computing
126+
Convolution. The preceding example shows that the computation of
127+
Convolution and Batchnorm can be fused into an equivalent Convolution
128+
operator. Such fusion is referred to as formula fusion.
129+
130+
The fusion of Convolution and Batchnorm eliminates a Batchnorm
131+
operation, thereby reducing the quantity of parameters and computation
132+
workload are reduced, and thereby the load/store operations are also
133+
reduced. In general, this fusion not only optimizes the power
134+
consumption and performance during model deployment, but also brings
135+
certain benefits in compressing the model size.
136+
137+
Symbols that are considered as constants in the Convolution and
138+
Batchnorm formulas during fusion are considered as parameters during
139+
training. Performing fusion during the training process will result in
140+
missing model parameters. Because the fusion eliminates a Batchnorm
141+
operator and corresponding parameters from the network, the algorithm of
142+
the DNN is changed, degrading the accuracy to unacceptable levels.
143+
Therefore, the fusion of Convolution and Batchnorm is an optimization
144+
method typically used during deployment. To evaluate the optimization
145+
effect, we constructed a sample network with Convolution and Batchnorm
146+
using MindSpore Lite. We ran the sample network and mobilenet-v2 network
147+
for inference in dual threads on a Huawei Mate 30 smartphone to compare
148+
the time of running 3,000 inference epochs before and after the fusion.
149+
As shown in Table
150+
[1](#tab:ch09/ch09-conv-bn-fusion){reference-type="ref"
151+
reference="tab:ch09/ch09-conv-bn-fusion"}, the inference performance of
152+
the sample network and mobilenet-v2 network is improved considerably
153+
after the fusion --- by 8.5% and 11.7% respectively. Such improvements
154+
are achieved without bringing side effects and without requiring
155+
additional hardware or operator libraries.
156+
157+
:Convolution + Batchnorm inference performance before and after fusion (unit: ms)
158+
159+
|Fusion | Sample | Mobilenet-v2 |
160+
|---------------| --------| -------------- |
161+
|Before fusion | 0.035 | 15.415 |
162+
|After fusion | 0.031 | 13.606 |
163+
:label:`ch09/ch09-conv-bn-fusion`
164+
165+
## Operator Replacement
166+
167+
The principle of operator replacement is to simplify an operator formula
168+
by uniting like terms, extracting common factors, and employing other
169+
mathematical methods, and then map the simplified formula to a certain
170+
type of operators that have the same computational logic but are more
171+
suitable for online deployment. In this way, we can reduce the
172+
computation workload and compress the model.
173+
174+
![Replacement ofBatchnorm](../img/ch08/ch09-bn-replace.png)
175+
:label:`ch-deploy/bn-replace`
176+
177+
Figure :numref:`ch-deploy/bn-replace` depicts the replacement of
178+
Batchnorm with Scale, which is used as an example to describe the
179+
principle of operator replacement. After decomposing Equation
180+
:eqref:`ch-deploy/bn-equation` (the Batchnorm formula) and
181+
folding the constants, Batchnorm is defined as Equation
182+
:eqref:`ch-deploy/replace-scale`
183+
184+
$$\bf{Y_{bn}}=scale\cdot\bf{X_{bn}}+offset$$
185+
:eqlabel:`equ:ch-deploy/replace-scale`
186+
187+
where **scale** and **offsets** are scalars. This simplified formula can
188+
be mapped to a Scale operator.
189+
190+
Compared with the original Batchnorm formula, the simplified formula has
191+
fewer parameters and involves less computation workload. This indicates
192+
that operator replacement is an effective approach to optimizing the
193+
power consumption and performance of a model during deployment. Symbols
194+
that are considered as constants in Batchnorm during deployment are not
195+
considered as constants during training, meaning that the replacement
196+
can be performed only during deployment. Operator replacement reduces
197+
the quantity of parameters and changes the structure of the model,
198+
weakening the expressive power and reducing the accuracy of the model
199+
during convergence.
200+
201+
## Operator Reordering
202+
203+
Another way of reducing the computation workload of an inference model
204+
is to adjust the topological order of its operators according to certain
205+
rules, on the condition that the inference accuracy is not degraded.
206+
Common methods of operator reordering include moving cropping operators
207+
(e.g., Slice, StrideSlice, and Crop) forward, and reordering Reshape,
208+
Transpose, and BinaryOp.
209+
210+
![Reordering ofCrop](../img/ch08/ch09-crop-reorder.png)
211+
:label:`ch-deploy/crop-reorder`
212+
213+
Crop is used to cut a part out of the input feature map as the output.
214+
After Crop is executed, the size of the feature map is reduced. As shown
215+
in Figure :numref:`ch-deploy/crop-reorder`, moving Crop forward to cut the
216+
feature map before other operators reduces the computation workload of
217+
subsequent operators, thereby improving the inference performance in the
218+
deployment phase. Such improvement is related to the operator
219+
parameters. Note, however, that Crop can be moved forward only along
220+
element-wise operators.
221+
222+
The experiment result above proves that optimizing models before
223+
inference makes it possible to significantly reduce the latency, power
224+
consumption, and memory usage.

0 commit comments

Comments
 (0)