Skip to content

Commit 249876d

Browse files
committed
Upload chapter of model deployment
1 parent 69e13ee commit 249876d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1649
-0
lines changed

.DS_Store

8 KB
Binary file not shown.

ch_model_deployment/Advanced_Efficient_Techniques.md

Lines changed: 344 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Chapter Summary
2+
3+
1. Model deployment is restricted by factors including the model size,
4+
runtime memory usage, inference latency, and inference power
5+
consumption.
6+
7+
2. Models can be compressed using techniques such as quantization,
8+
pruning, and knowledge distillation in the offline phase. In
9+
addition, some model optimization techniques, such as operator
10+
fusion, can also reduce the model size, albeit to a lesser degree.
11+
12+
3. Runtime memory usage can be improved by optimizing the model size,
13+
deployment framework size, and runtime temporary memory usage.
14+
Methods for optimizing the model size have been summarized earlier.
15+
Making the framework code simpler and more modular helps optimize
16+
the deployment framework. Memory pooling can help implement memory
17+
overcommitment to optimize the runtime temporary memory usage.
18+
19+
4. Model inference latency can be optimized from two aspects. In the
20+
offline phase, the model computation workload can be reduced using
21+
model optimization and compression methods. Furthermore, improving
22+
the inference parallelism and optimizing operator implementation can
23+
help maximize the utilization of the computing power. In addition to
24+
the computation workload and computing power, consideration should
25+
be given to the load/store overhead during inference.
26+
27+
5. Power consumption during inference can be reduced through offline
28+
model optimization and compression technologies. By reducing the
29+
computational workload, these technologies also facilitate power
30+
consumption reduction, which coincides with the optimization method
31+
for model inference latency.
32+
33+
6. In addition to the optimization of factors related to model
34+
deployment, this chapter also discussed technologies regarding
35+
deployment security, such as model obfuscation and model encryption.
36+
Secure deployment protects the model assets of enterprises and
37+
prevents hackers from attacking the deployment environment by
38+
tampering with models.
Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# Conversion to Inference Model and Model Optimization {#sec:ch-deploy/model-optimization}
2+
3+
## Model Conversion
4+
5+
As mentioned earlier, TensorFlow, PyTorch, MindSpore, MXNet, and CNTK
6+
define their own model data structures. This means that the inference
7+
system needs to convert these structures to a unified one. Open Neural
8+
Network Exchange (ONNX) is designed to implement such conversion. It
9+
supports an extensive range of machine learning operators and converts
10+
models from various frameworks (e.g., TensorFlow and PyTorch) into ONNX
11+
models. Because models are structured data, the conversion process
12+
involves converting the data structure. It starts by analyzing the
13+
similarities and differences between two data structures. If they are
14+
the same, data is transferred; if the structures are similar but with
15+
slight differences, data is mapped; if the structures differ
16+
significantly, extra semantics conversion might be required; and if they
17+
are totally incompatible, the conversion will fail. ONNX features strong
18+
expressive power, meaning that it can convert models from most
19+
frameworks in the industry to compatible ONNX models. If a model is
20+
abstracted as a graph, its data structure can be defined as follows:
21+
22+
1. **Topological expression of model:** The topological connections of
23+
a model are represented as edges in a graph. From the perspective of
24+
a model, these edges define the data flows and control flows in the
25+
model. Based on such definitions, we can extend to the expressions
26+
of the subgraphs, model inputs and outputs, and control flow
27+
structures. For example, the control flow on TensorFlow 1.x is
28+
expressed as a cyclic graph. To prevent the formation of cycles,
29+
TensorFlow 1.x uses operators such as Enter, Exit, Switch, LoopCond,
30+
and NextIteration, whereas ONNX uses operators such as Loop and If.
31+
As such, when converting a TensorFlow1.x control flow model into an
32+
ONNX model, the control flow graph structure in the TensorFlow model
33+
must be merged into a While or If operator on ONNX.
34+
35+
2. **Operator prototype definition:** Operators can be regarded as data
36+
processing or control flow nodes in a model or as vertices in a
37+
graph. An operator prototype defines the type, inputs, outputs, and
38+
attributes of an operator. For instance, Slice has different
39+
semantics on Caffe and ONNX. To convert a Caffe model into an ONNX
40+
model, we need to map Slice on Caffe to Split on ONNX.
41+
FusedBatchnorm on TensorFlow does not have a mapping operator on
42+
Caffe. Rather, Batchnorm and Scale on Caffe need to be combined to
43+
express the same semantics of FusedBatchnorm on TensorFlow.
44+
Generally, the model conversion process involves converting the
45+
topological relationships and mapping the operator prototypes
46+
between models.
47+
48+
Following model conversion, some input-agnostic operations are conducted
49+
for optimization purposes prior to model deployment, including constant
50+
folding, operator fusion, operator replacement, and operator reordering
51+
--- optimization methods discussed earlier in this book. For instance,
52+
constant folding is usually performed during the compilation executed on
53+
the compiler frontend, whereas, operator fusion and partition are often
54+
performed (depending on the backend hardware support) once the
55+
compilation is complete. However, some optimization operations can only
56+
be performed in their entirety during the deployment phase.
57+
58+
![Layered computer storage
59+
architecture](../img/ch08/ch09-storage.png){#fig:ch-deploy/fusion-storage}
60+
61+
## Operator Fusion {#sec:ch-deploy/kernel-fusion}
62+
63+
Operator fusion involves combining multiple operators in a deep neural
64+
network (DNN) model into a new operator based on certain rules, reducing
65+
the inference latency and power consumption by lowering the computation
66+
workload and load/store overhead during online inference.
67+
68+
The two main performance benefits brought by operator fusion are as
69+
follows: First, it maximizes the utilization of registers and caches.
70+
And second, because it combines operators, the load/store time between
71+
the CPU and memory is reduced. Figure
72+
[1](#fig:ch-deploy/fusion-storage){reference-type="ref"
73+
reference="fig:ch-deploy/fusion-storage"} shows the architecture of a
74+
computer's storage system. While the storage capacity increases from the
75+
level-1 cache (L1) to hard disk, so too does the time for reading data.
76+
After operator fusion is performed, the previous computation result can
77+
be temporarily stored in the CPU's register or cache where the next
78+
computation can directly read the result, reducing the number of I/O
79+
operations on the memory. Furthermore, operator fusion allows some
80+
computation to be completed in advance, eliminating redundant or even
81+
cyclic redundant computing during forward computation.
82+
83+
![Convolution + Batchnorm operator
84+
fusion](../img/ch08/ch09-conv-bn-fusion.png){#fig:ch-deploy/conv-bn-fusion}
85+
86+
To describe the principle of operator fusion, we will use two operators,
87+
Convolution and Batchnorm, as shown in Figure
88+
[2](#fig:ch-deploy/conv-bn-fusion){reference-type="ref"
89+
reference="fig:ch-deploy/conv-bn-fusion"}. In the figure, the
90+
solid-colored boxes indicate operators, the resulting operators after
91+
fusion is performed are represented by hatched boxes, and the weights or
92+
constant tensors of operators are outlined in white. The fusion can be
93+
understood as the simplification of an equation. The computation of
94+
Convolution is expressed as Equation
95+
[\[equ:ch-deploy/conv-equation\]](#equ:ch-deploy/conv-equation){reference-type="ref"
96+
reference="equ:ch-deploy/conv-equation"}.
97+
98+
$$\label{equ:ch-deploy/conv-equation}
99+
\bm{Y_{\rm conv}}=\bm{W_{\rm conv}}\cdot\bm{X_{\rm conv}}+\bm{B_{\rm conv}}$$
100+
101+
Here, we do not need to understand what each variable means. Instead, we
102+
only need to keep in mind that Equation
103+
[\[equ:ch-deploy/conv-equation\]](#equ:ch-deploy/conv-equation){reference-type="ref"
104+
reference="equ:ch-deploy/conv-equation"} is an equation for
105+
$\bm{Y_{\rm conv}}$ with respect to $\bm{X_{\rm conv}}$, and other
106+
symbols are constants.
107+
108+
Equation
109+
[\[equ:ch-deploy/bn-equation\]](#equ:ch-deploy/bn-equation){reference-type="ref"
110+
reference="equ:ch-deploy/bn-equation"} is about the computation of
111+
Batchnorm:
112+
113+
$$\label{equ:ch-deploy/bn-equation}
114+
\bm{Y_{\rm bn}}=\gamma\frac{\bm{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$
115+
116+
Similarly, it is an equation for $\bm{Y_{\rm bn}}$ with respect to
117+
$\bm{X_{\rm bn}}$. Other symbols in the equation represent constants.
118+
119+
As shown in Figure
120+
[2](#fig:ch-deploy/conv-bn-fusion){reference-type="ref"
121+
reference="fig:ch-deploy/conv-bn-fusion"}, when the output of
122+
Convolution is used as the input of Batchnorm, the formula of Batchnorm
123+
is a function for $\bm{Y_{\rm bn}}$ with respect to $\bm{X_{\rm conv}}$.
124+
After substituting $\bm{Y_{\rm conv}}$ into $\bm{X_{\rm bn}}$ and
125+
uniting and extracting the constants, we obtain Equation
126+
[\[equ:ch-deploy/conv-bn-equation-3\]](#equ:ch-deploy/conv-bn-equation-3){reference-type="ref"
127+
reference="equ:ch-deploy/conv-bn-equation-3"}.
128+
129+
$$\label{equ:ch-deploy/conv-bn-equation-3}
130+
\bm{Y_{\rm bn}}=\bm{A}\cdot\bm{X_{\rm conv}}+\bm{B}$$
131+
132+
Here, $\bm{A}$ and $\bm{B}$ are two matrices. It can be noticed that
133+
Equation
134+
[\[equ:ch-deploy/conv-bn-equation-3\]](#equ:ch-deploy/conv-bn-equation-3){reference-type="ref"
135+
reference="equ:ch-deploy/conv-bn-equation-3"} is a formula for computing
136+
Convolution. The preceding example shows that the computation of
137+
Convolution and Batchnorm can be fused into an equivalent Convolution
138+
operator. Such fusion is referred to as formula fusion.
139+
140+
The fusion of Convolution and Batchnorm eliminates a Batchnorm
141+
operation, thereby reducing the quantity of parameters and computation
142+
workload are reduced, and thereby the load/store operations are also
143+
reduced. In general, this fusion not only optimizes the power
144+
consumption and performance during model deployment, but also brings
145+
certain benefits in compressing the model size.
146+
147+
Symbols that are considered as constants in the Convolution and
148+
Batchnorm formulas during fusion are considered as parameters during
149+
training. Performing fusion during the training process will result in
150+
missing model parameters. Because the fusion eliminates a Batchnorm
151+
operator and corresponding parameters from the network, the algorithm of
152+
the DNN is changed, degrading the accuracy to unacceptable levels.
153+
Therefore, the fusion of Convolution and Batchnorm is an optimization
154+
method typically used during deployment. To evaluate the optimization
155+
effect, we constructed a sample network with Convolution and Batchnorm
156+
using MindSpore Lite. We ran the sample network and mobilenet-v2 network
157+
for inference in dual threads on a Huawei Mate 30 smartphone to compare
158+
the time of running 3,000 inference epochs before and after the fusion.
159+
As shown in Table
160+
[1](#tab:ch09/ch09-conv-bn-fusion){reference-type="ref"
161+
reference="tab:ch09/ch09-conv-bn-fusion"}, the inference performance of
162+
the sample network and mobilenet-v2 network is improved considerably
163+
after the fusion --- by 8.5% and 11.7% respectively. Such improvements
164+
are achieved without bringing side effects and without requiring
165+
additional hardware or operator libraries.
166+
167+
::: {#tab:ch09/ch09-conv-bn-fusion}
168+
Fusion Sample Mobilenet-v2
169+
--------------- -------- --------------
170+
Before fusion 0.035 15.415
171+
After fusion 0.031 13.606
172+
173+
: Convolution + Batchnorm inference performance before and after
174+
fusion (unit: ms)
175+
:::
176+
177+
## Operator Replacement
178+
179+
The principle of operator replacement is to simplify an operator formula
180+
by uniting like terms, extracting common factors, and employing other
181+
mathematical methods, and then map the simplified formula to a certain
182+
type of operators that have the same computational logic but are more
183+
suitable for online deployment. In this way, we can reduce the
184+
computation workload and compress the model.
185+
186+
![Replacement of
187+
Batchnorm](../img/ch08/ch09-bn-replace.png){#fig:ch-deploy/bn-replace}
188+
189+
Figure [3](#fig:ch-deploy/bn-replace){reference-type="ref"
190+
reference="fig:ch-deploy/bn-replace"} depicts the replacement of
191+
Batchnorm with Scale, which is used as an example to describe the
192+
principle of operator replacement. After decomposing Equation
193+
[\[equ:ch-deploy/bn-equation\]](#equ:ch-deploy/bn-equation){reference-type="ref"
194+
reference="equ:ch-deploy/bn-equation"} (the Batchnorm formula) and
195+
folding the constants, Batchnorm is defined as Equation
196+
[\[equ:ch-deploy/replace-scale\]](#equ:ch-deploy/replace-scale){reference-type="ref"
197+
reference="equ:ch-deploy/replace-scale"}
198+
199+
$$\label{equ:ch-deploy/replace-scale}
200+
\bm{Y_{bn}}=scale\cdot\bm{X_{bn}}+offset$$
201+
202+
where **scale** and **offsets** are scalars. This simplified formula can
203+
be mapped to a Scale operator.
204+
205+
Compared with the original Batchnorm formula, the simplified formula has
206+
fewer parameters and involves less computation workload. This indicates
207+
that operator replacement is an effective approach to optimizing the
208+
power consumption and performance of a model during deployment. Symbols
209+
that are considered as constants in Batchnorm during deployment are not
210+
considered as constants during training, meaning that the replacement
211+
can be performed only during deployment. Operator replacement reduces
212+
the quantity of parameters and changes the structure of the model,
213+
weakening the expressive power and reducing the accuracy of the model
214+
during convergence.
215+
216+
## Operator Reordering
217+
218+
Another way of reducing the computation workload of an inference model
219+
is to adjust the topological order of its operators according to certain
220+
rules, on the condition that the inference accuracy is not degraded.
221+
Common methods of operator reordering include moving cropping operators
222+
(e.g., Slice, StrideSlice, and Crop) forward, and reordering Reshape,
223+
Transpose, and BinaryOp.
224+
225+
![Reordering of
226+
Crop](../img/ch08/ch09-crop-reorder.png){#fig:ch-deploy/crop-reorder}
227+
228+
Crop is used to cut a part out of the input feature map as the output.
229+
After Crop is executed, the size of the feature map is reduced. As shown
230+
in Figure [4](#fig:ch-deploy/crop-reorder){reference-type="ref"
231+
reference="fig:ch-deploy/crop-reorder"}, moving Crop forward to cut the
232+
feature map before other operators reduces the computation workload of
233+
subsequent operators, thereby improving the inference performance in the
234+
deployment phase. Such improvement is related to the operator
235+
parameters. Note, however, that Crop can be moved forward only along
236+
element-wise operators.
237+
238+
The experiment result above proves that optimizing models before
239+
inference makes it possible to significantly reduce the latency, power
240+
consumption, and memory usage.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Further Reading
2+
3+
1. A Distributed Graph-Theoretic Framework for Automatic
4+
Parallelization in Multi-Core Systems[^1]
5+
6+
2. SCOP: Scientific Control for Reliable Neural Network Pruning[^2]
7+
8+
3. Searching for Low-Bit Weights in Quantized Neural Networks[^3]
9+
10+
4. GhostNet: More Features from Cheap Operations[^4]
11+
12+
5. AdderNet: Do We Really Need Multiplications in Deep Learning?[^5]
13+
14+
6. Blockwise Parallel Decoding for Deep Autoregressive Models[^6]
15+
16+
7. Medusa: Simple framework for accelerating LLM generation with
17+
multiple decoding heads[^7]
18+
19+
8. FlashAttention-2: Faster Attention with Better Parallelism and Work
20+
Partitioning[^8]
21+
22+
[^1]: <https://proceedings.mlsys.org/paper/2021/file/a5e00132373a7031000fd987a3c9f87b-Paper.pdf>
23+
24+
[^2]: <https://arxiv.org/abs/2010.10732>
25+
26+
[^3]: <https://arxiv.org/abs/2009.08695>
27+
28+
[^4]: <https://arxiv.org/abs/1911.11907>
29+
30+
[^5]: <https://arxiv.org/abs/1912.13200>
31+
32+
[^6]: <https://arxiv.org/abs/1811.03115>
33+
34+
[^7]: <https://www.together.ai/blog/medusa>
35+
36+
[^8]: <https://arxiv.org/abs/2307.08691>

0 commit comments

Comments
 (0)