Skip to content

Commit ef4297a

Browse files
committed
Fix equation and image tag format
1 parent eb1b760 commit ef4297a

File tree

7 files changed

+150
-142
lines changed

7 files changed

+150
-142
lines changed

.DS_Store

0 Bytes
Binary file not shown.

chapter_model_deployment/Advanced_Efficient_Techniques.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,7 @@ based on insights provided by Leviathan et al. [@leviathan2023fast].
2222
is achieved by processing them with the outputs from the
2323
approximation models in parallel.
2424

25-
Figure [1](#fig:ch-deploy/sd){reference-type="ref"
26-
reference="fig:ch-deploy/sd"} is a brief overview of Speculative
25+
Figure :numref:`ch-deploy/sd` is a brief overview of Speculative
2726
Decoding. It involves initially generating a series of tokens using a
2827
draft model, which is a smaller and less complex model. These generated
2928
tokens are then verified in parallel with the target model, which is a
@@ -53,8 +52,9 @@ $M_{\text{target}}(\text{prefix} + [x_1 + ... + x_{\gamma}])$. If the
5352
condition $q(x) < p(x)$ is met, the token is retained. In contrast, if
5453
not met, the token faces a rejection chance of $1 - \frac{p(x)}{q(x)}$,
5554
following which it is reselected from an adjusted distribution:
56-
$$\label{equ:sd_adjusted}
57-
p'(x) = norm(max(0, p(x) - q(x)))$$ In the paper [@leviathan2023fast],
55+
$$
56+
p'(x) = norm(max(0, p(x) - q(x)))$$
57+
:eqlabel:`equ:sd_adjusted` In the paper [@leviathan2023fast],
5858
Leviathan et al. have proved the correctness of this adjusted
5959
distribution for resampling.
6060

@@ -154,24 +154,25 @@ algorithm designed to minimize the intensive access to the GPU's high
154154
bandwidth memory (HBM). This innovation led to significant gains in both
155155
computational speed and throughput.
156156

157-
Figure [2](#fig:ch-deploy/memory){reference-type="ref"
158-
reference="fig:ch-deploy/memory"} shows the memory hierarchy with
157+
Figure :numref:`ch-deploy/memory` shows the memory hierarchy with
159158
corresponding bandwidths. The main goal of FlashAttention is to avoid
160159
reading and writing the large attention matrix to and from HBM. And
161160
perform computation in SRAM as much as possible.
162161

163162
The standard Scaled Dot-Product Attention [@attention] formula is
164-
$$\label{equ:std_attn}
165-
\textbf{A} = Softmax(\frac{\textbf{QK}^T}{\sqrt{d_k}})\textbf{V}$$
163+
$$
164+
\textbf{A} = Softmax(\frac{\textbf{QK}^T}{\sqrt{d_k}})\textbf{V}$$
165+
:eqlabel:`equ:std_attn`
166166

167167
As $d_k$ is a scalar, we can simplify it into three parts:
168168

169-
$$\label{equ:attn_sep}
169+
$$
170170
\begin{aligned}
171171
\textbf{S} = \textbf{QK}^T\\
172172
\textbf{P} = Softmax(\textbf{S})\\
173173
\textbf{O} = \textbf{PV}
174-
\end{aligned}$$
174+
\end{aligned}$$
175+
:eqlabel:`equ:attn_sep`
175176

176177
The matrices **K**, **Q**, **V** are all stored in HBM. The standard
177178
implementation of attention follows these steps:
@@ -214,8 +215,7 @@ s(x) = e^{m(x_{1})-m(x)}s_{1}(x_1) + e^{m(x_2)-m(x)}s_{1}(x_2)\\
214215
Softmax(x) = \frac{l(x)}{s(x)}
215216
\end{aligned}$$
216217

217-
Figure [3](#fig:ch-deploy/flashattn){reference-type="ref"
218-
reference="fig:ch-deploy/flashattn"} shows a brief overview of
218+
Figure :numref:`ch-deploy/flashattn` shows a brief overview of
219219
FlashAttention with two blocks. Following decomposition, Softmax
220220
calculations can be executed block by block. Therefore, **K, Q** and
221221
**V** are initially divided into blocks. Subsequently, compute the

chapter_model_deployment/Conversion_to_Inference_Model_and_Model_Optimization.md

Lines changed: 45 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -55,10 +55,8 @@ performed (depending on the backend hardware support) once the
5555
compilation is complete. However, some optimization operations can only
5656
be performed in their entirety during the deployment phase.
5757

58-
![Layered computer storage
59-
architecture](../img/ch08/ch09-storage.png){#fig:ch-deploy/fusion-storage}
60-
61-
## Operator Fusion {#sec:ch-deploy/kernel-fusion}
58+
![Layered computer storagearchitecture](../img/ch08/ch09-storage.png)
59+
:label:`ch-deploy/fusion-storage}## Operator Fusion {#sec:ch-deploy/kernel-fusion`
6260

6361
Operator fusion involves combining multiple operators in a deep neural
6462
network (DNN) model into a new operator based on certain rules, reducing
@@ -69,8 +67,7 @@ The two main performance benefits brought by operator fusion are as
6967
follows: First, it maximizes the utilization of registers and caches.
7068
And second, because it combines operators, the load/store time between
7169
the CPU and memory is reduced. Figure
72-
[1](#fig:ch-deploy/fusion-storage){reference-type="ref"
73-
reference="fig:ch-deploy/fusion-storage"} shows the architecture of a
70+
:numref:`ch-deploy/fusion-storage` shows the architecture of a
7471
computer's storage system. While the storage capacity increases from the
7572
level-1 cache (L1) to hard disk, so too does the time for reading data.
7673
After operator fusion is performed, the previous computation result can
@@ -80,57 +77,55 @@ operations on the memory. Furthermore, operator fusion allows some
8077
computation to be completed in advance, eliminating redundant or even
8178
cyclic redundant computing during forward computation.
8279

83-
![Convolution + Batchnorm operator
84-
fusion](../img/ch08/ch09-conv-bn-fusion.png){#fig:ch-deploy/conv-bn-fusion}
80+
![Convolution + Batchnorm operatorfusion](../img/ch08/ch09-conv-bn-fusion.png)
81+
:label:`ch-deploy/conv-bn-fusion`
8582

8683
To describe the principle of operator fusion, we will use two operators,
8784
Convolution and Batchnorm, as shown in Figure
88-
[2](#fig:ch-deploy/conv-bn-fusion){reference-type="ref"
89-
reference="fig:ch-deploy/conv-bn-fusion"}. In the figure, the
85+
:numref:`ch-deploy/conv-bn-fusion`. In the figure, the
9086
solid-colored boxes indicate operators, the resulting operators after
9187
fusion is performed are represented by hatched boxes, and the weights or
9288
constant tensors of operators are outlined in white. The fusion can be
9389
understood as the simplification of an equation. The computation of
9490
Convolution is expressed as Equation
95-
[\[equ:ch-deploy/conv-equation\]](#equ:ch-deploy/conv-equation){reference-type="ref"
96-
reference="equ:ch-deploy/conv-equation"}.
91+
:eqref:`ch-deploy/conv-equation`.
9792

98-
$$\mathbf{Y_{\rm conv}}=\mathbf{W_{\rm conv}}\cdot\mathbf{X_{\rm conv}}+\mathbf{B_{\rm conv}}, \text{equ:ch-deploy/conv-equation}$$
93+
$$
94+
\bm{Y_{\rm conv}}=\bm{W_{\rm conv}}\cdot\bm{X_{\rm conv}}+\bm{B_{\rm conv}}$$
95+
:eqlabel:`equ:ch-deploy/conv-equation`
9996

10097
Here, we do not need to understand what each variable means. Instead, we
10198
only need to keep in mind that Equation
102-
[\[equ:ch-deploy/conv-equation\]](#equ:ch-deploy/conv-equation){reference-type="ref"
103-
reference="equ:ch-deploy/conv-equation"} is an equation for
104-
$\mathbf{Y_{\rm conv}}$ with respect to $\mathbf{X_{\rm conv}}$, and other
99+
:eqref:`ch-deploy/conv-equation` is an equation for
100+
$\bm{Y_{\rm conv}}$ with respect to $\bm{X_{\rm conv}}$, and other
105101
symbols are constants.
106102

107103
Equation
108-
[\[equ:ch-deploy/bn-equation\]](#equ:ch-deploy/bn-equation){reference-type="ref"
109-
reference="equ:ch-deploy/bn-equation"} is about the computation of
104+
:eqref:`ch-deploy/bn-equation` is about the computation of
110105
Batchnorm:
111106

112-
**equ:ch-deploy/bn-equation:**\
113-
$$\mathbf{Y_{\rm bn}}=\gamma\frac{\mathbf{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$
107+
$$
108+
\bm{Y_{\rm bn}}=\gamma\frac{\bm{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$
109+
:eqlabel:`equ:ch-deploy/bn-equation`
114110

115-
Similarly, it is an equation for $\mathbf{Y_{\rm bn}}$ with respect to
116-
$\mathbf{X_{\rm bn}}$. Other symbols in the equation represent constants.
111+
Similarly, it is an equation for $\bm{Y_{\rm bn}}$ with respect to
112+
$\bm{X_{\rm bn}}$. Other symbols in the equation represent constants.
117113

118114
As shown in Figure
119-
[2](#fig:ch-deploy/conv-bn-fusion){reference-type="ref"
120-
reference="fig:ch-deploy/conv-bn-fusion"}, when the output of
115+
:numref:`ch-deploy/conv-bn-fusion`, when the output of
121116
Convolution is used as the input of Batchnorm, the formula of Batchnorm
122-
is a function for $\mathbf{Y_{\rm bn}}$ with respect to $\mathbf{X_{\rm conv}}$.
123-
After substituting $\mathbf{Y_{\rm conv}}$ into $\mathbf{X_{\rm bn}}$ and
117+
is a function for $\bm{Y_{\rm bn}}$ with respect to $\bm{X_{\rm conv}}$.
118+
After substituting $\bm{Y_{\rm conv}}$ into $\bm{X_{\rm bn}}$ and
124119
uniting and extracting the constants, we obtain Equation
125-
[\[equ:ch-deploy/conv-bn-equation-3\]](#equ:ch-deploy/conv-bn-equation-3){reference-type="ref"
126-
reference="equ:ch-deploy/conv-bn-equation-3"}.
120+
:eqref:`ch-deploy/conv-bn-equation-3`.
127121

128-
$$\mathbf{Y_{\rm bn}}=\mathbf{A}\cdot\mathbf{X_{\rm conv}}+\mathbf{B}, \text{equ:ch-deploy/conv-bn-equation-3}$$
122+
$$
123+
\bm{Y_{\rm bn}}=\bm{A}\cdot\bm{X_{\rm conv}}+\bm{B}$$
124+
:eqlabel:`equ:ch-deploy/conv-bn-equation-3`
129125

130-
Here, $\mathbf{A}$ and $\mathbf{B}$ are two matrices. It can be noticed that
126+
Here, $\bm{A}$ and $\bm{B}$ are two matrices. It can be noticed that
131127
Equation
132-
[\[equ:ch-deploy/conv-bn-equation-3\]](#equ:ch-deploy/conv-bn-equation-3){reference-type="ref"
133-
reference="equ:ch-deploy/conv-bn-equation-3"} is a formula for computing
128+
:eqref:`ch-deploy/conv-bn-equation-3` is a formula for computing
134129
Convolution. The preceding example shows that the computation of
135130
Convolution and Batchnorm can be fused into an equivalent Convolution
136131
operator. Such fusion is referred to as formula fusion.
@@ -162,13 +157,14 @@ after the fusion --- by 8.5% and 11.7% respectively. Such improvements
162157
are achieved without bringing side effects and without requiring
163158
additional hardware or operator libraries.
164159

165-
::: {#tab:ch09/ch09-conv-bn-fusion} <br>
166-
Fusion | Sample | Mobilenet-v2 |
167-
---------------| --------|-------------- |
168-
Before fusion | 0.035 | 15.415 |
169-
After fusion | 0.031 | 13.606 |
160+
::: {#tab:ch09/ch09-conv-bn-fusion}
161+
Fusion Sample Mobilenet-v2
162+
--------------- -------- --------------
163+
Before fusion 0.035 15.415
164+
After fusion 0.031 13.606
170165

171-
Convolution + Batchnorm inference performance before and after fusion (unit: ms)
166+
: Convolution + Batchnorm inference performance before and after
167+
fusion (unit: ms)
172168
:::
173169

174170
## Operator Replacement
@@ -180,20 +176,19 @@ type of operators that have the same computational logic but are more
180176
suitable for online deployment. In this way, we can reduce the
181177
computation workload and compress the model.
182178

183-
![Replacement of
184-
Batchnorm](../img/ch08/ch09-bn-replace.png){#fig:ch-deploy/bn-replace}
179+
![Replacement ofBatchnorm](../img/ch08/ch09-bn-replace.png)
180+
:label:`ch-deploy/bn-replace`
185181

186-
Figure [3](#fig:ch-deploy/bn-replace){reference-type="ref"
187-
reference="fig:ch-deploy/bn-replace"} depicts the replacement of
182+
Figure :numref:`ch-deploy/bn-replace` depicts the replacement of
188183
Batchnorm with Scale, which is used as an example to describe the
189184
principle of operator replacement. After decomposing Equation
190-
[\[equ:ch-deploy/bn-equation\]](#equ:ch-deploy/bn-equation){reference-type="ref"
191-
reference="equ:ch-deploy/bn-equation"} (the Batchnorm formula) and
185+
:eqref:`ch-deploy/bn-equation` (the Batchnorm formula) and
192186
folding the constants, Batchnorm is defined as Equation
193-
[\[equ:ch-deploy/replace-scale\]](#equ:ch-deploy/replace-scale){reference-type="ref"
194-
reference="equ:ch-deploy/replace-scale"}
187+
:eqref:`ch-deploy/replace-scale`
195188

196-
$$\mathbf{Y_{bn}}=scale\cdot\mathbf{X_{bn}}+offset, \text{equ:ch-deploy/replace-scale} $$
189+
$$
190+
\bm{Y_{bn}}=scale\cdot\bm{X_{bn}}+offset$$
191+
:eqlabel:`equ:ch-deploy/replace-scale`
197192

198193
where **scale** and **offsets** are scalars. This simplified formula can
199194
be mapped to a Scale operator.
@@ -218,13 +213,12 @@ Common methods of operator reordering include moving cropping operators
218213
(e.g., Slice, StrideSlice, and Crop) forward, and reordering Reshape,
219214
Transpose, and BinaryOp.
220215

221-
![Reordering of
222-
Crop](../img/ch08/ch09-crop-reorder.png){#fig:ch-deploy/crop-reorder}
216+
![Reordering ofCrop](../img/ch08/ch09-crop-reorder.png)
217+
:label:`ch-deploy/crop-reorder`
223218

224219
Crop is used to cut a part out of the input feature map as the output.
225220
After Crop is executed, the size of the feature map is reduced. As shown
226-
in Figure [4](#fig:ch-deploy/crop-reorder){reference-type="ref"
227-
reference="fig:ch-deploy/crop-reorder"}, moving Crop forward to cut the
221+
in Figure :numref:`ch-deploy/crop-reorder`, moving Crop forward to cut the
228222
feature map before other operators reduces the computation workload of
229223
subsequent operators, thereby improving the inference performance in the
230224
deployment phase. Such improvement is related to the operator

chapter_model_deployment/Index.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Model Deployment {#ch:deploy}
2+
3+
In earlier chapters, we discussed the basic components of the machine
4+
learning model training system. In this chapter, we look at the basics
5+
of model deployment, a process whereby a trained model is deployed in a
6+
runtime environment for inference. We explore the conversion from a
7+
training model into an inference model, model compression methods that
8+
adapt to hardware restrictions, model inference and performance
9+
optimization, and model security protection.
10+
11+
The key aspects this chapter explores are as follows:
12+
13+
1. Conversion and optimization from a training model to an inference
14+
model.
15+
16+
2. Common methods for model compression: quantization, sparsification,
17+
and knowledge distillation.
18+
19+
3. Model inference process and common methods for performance
20+
optimization.
21+
22+
4. Common methods for model security protection.

chapter_model_deployment/Model_Compression.md

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,15 @@ Model quantization is a technique that approximates floating-point
1414
weights of contiguous values (usually float32 or many possibly discrete
1515
values) at the cost of slightly reducing accuracy to a limited number of
1616
discrete values (usually int8). As shown in Figure
17-
[1](#fig:ch-deploy/quant-minmax){reference-type="ref"
18-
reference="fig:ch-deploy/quant-minmax"}, $T$ represents the data range
17+
:numref:`ch-deploy/quant-minmax`, $T$ represents the data range
1918
before quantization. In order to reduce the model size, model
2019
quantization represents floating-point data with fewer bits. As such,
2120
the memory usage during inference can be reduced, and the inference on
2221
processors that are good at processing low-precision operations can be
2322
accelerated.
2423

25-
![Principles of
26-
quantization](../img/ch08/ch09-quant-minmax.png){#fig:ch-deploy/quant-minmax}
24+
![Principles ofquantization](../img/ch08/ch09-quant-minmax.png)
25+
:label:`ch-deploy/quant-minmax`
2726

2827
The number of bits and the range of data represented by different data
2928
types in a computer are different. Based on service requirements, a
@@ -49,12 +48,12 @@ that linear quantization is more commonly used. The following therefore
4948
focuses on the principles of linear quantization.
5049

5150
In Equation
52-
[\[equ:ch-deploy/quantization-q\]](#equ:ch-deploy/quantization-q){reference-type="ref"
53-
reference="equ:ch-deploy/quantization-q"}, assume that $r$ represents
51+
:eqref:`ch-deploy/quantization-q`, assume that $r$ represents
5452
the floating-point number before quantization. We are then able to
5553
obtain the integer $q$ after quantization.
5654

57-
$$q=clip(round(\frac{r}{s}+z),q_{min},q_{max}), \text{equ:ch-deploy/quantization-q}$$
55+
$$q=clip(round(\frac{r}{s}+z),q_{min},q_{max})$$
56+
:eqlabel:`equ:ch-deploy/quantization-q`
5857

5958
$clip(\cdot)$ and $round(\cdot)$ indicate the truncation and rounding
6059
operations, and $q_{min}$ and $q_{max}$ indicate the minimum and maximum
@@ -170,16 +169,16 @@ before quantization. Assume that the mean value and variance of the
170169
weight of a channel are $E(w_c)$ and $||w_c-E(w_c)||$, and the mean
171170
value and variance after quantization are $E(\hat{w_c})$ and
172171
$||\hat{w_c}-E(\hat{w_c})||$, respectively. Equation
173-
[\[equ:ch-deploy/post-quantization\]](#equ:ch-deploy/post-quantization){reference-type="ref"
174-
reference="equ:ch-deploy/post-quantization"} is the calibration of the
172+
:eqref:`ch-deploy/post-quantization` is the calibration of the
175173
weight:
176174

177175
$$
178176
\begin{aligned}
179177
\hat{w_c}\leftarrow\zeta_c(\hat{w_c}+u_c) \\
180178
u_c=E(w_c)-E(\hat{w_c}) \\
181179
\zeta_c=\frac{||w_c-E(w_c)||}{||\hat{w_c}-E(\hat{w_c})||}
182-
\end{aligned}, \text{equ:ch-deploy/post-quantization}$$
180+
\end{aligned}$$
181+
:eqlabel:`equ:ch-deploy/post-quantization`
183182

184183
As a general model compression method, quantization can significantly
185184
improve the memory and compression efficiency of neural networks, and
@@ -281,15 +280,14 @@ identified more efficiently. As such, iterative pruning is widely used.
281280
To illustrate how to prune a network, we will use Deep
282281
Compression [@han2015deep] as an example. Removing most weights leads to
283282
a loss of accuracy of the neural network, as shown in Figure
284-
[2](#fig:ch-deploy/deepcomp){reference-type="ref"
285-
reference="fig:ch-deploy/deepcomp"}. Fine-tuning a pruned sparse neural
283+
:numref:`ch-deploy/deepcomp`. Fine-tuning a pruned sparse neural
286284
network can help improve model accuracy, and the pruned network may be
287285
quantized to represent weights using fewer bits. In addition, using
288286
Huffman coding can further reduce the memory cost of the deep neural
289287
network.
290288

291-
![Deep Compression
292-
algorithm](../img/ch08/ch09-deepcomp.png){#fig:ch-deploy/deepcomp}
289+
![Deep Compressionalgorithm](../img/ch08/ch09-deepcomp.png)
290+
:label:`ch-deploy/deepcomp`
293291

294292
In addition to removing redundant neurons, a dictionary learning-based
295293
method can be used to remove unnecessary weights on a deep convolutional
@@ -326,7 +324,9 @@ classification result of the teacher network, that is, Equation
326324
[\[c2Fcn:distill\]](#c2Fcn:distill){reference-type="ref"
327325
reference="c2Fcn:distill"}.
328326

329-
$$\mathcal{L}_{KD}(\theta_S) = \mathcal{H}(o_S,\mathbf{y}) +\lambda\mathcal{H}(\tau(o_S),\tau(o_T)), \text{c2Fcn:distill}$$
327+
$$\mathcal{L}_{KD}(\theta_S) = \mathcal{H}(o_S,\mathbf{y}) +\lambda\mathcal{H}(\tau(o_S),\tau(o_T)),
328+
$$
329+
:eqlabel:`c2Fcn:distill`
330330

331331
where $\mathcal{H}(\cdot,\cdot)$ is the cross-entropy function, $o_S$
332332
and $o_T$ are outputs of the student network and the teacher network,
@@ -357,8 +357,7 @@ module. The attention module generates an attention map, which
357357
identifies the importance of different areas of an input image to the
358358
classification result. The attention map is then transferred from the
359359
teacher network to the student network, as depicted in Figure
360-
 [3](#fig:ch-deploy/attentionTS){reference-type="ref"
361-
reference="fig:ch-deploy/attentionTS"}.
360+
 :numref:`ch-deploy/attentionTS`.
362361

363362
KD is an effective method to optimize small networks. It can be combined
364363
with other compression methods such as pruning and quantization to train

0 commit comments

Comments
 (0)