openmlsys
diff --git a/‎.DS_Store‎
0 Bytes b/‎.DS_Store‎
0 Bytes
diff --git a/‎chapter_model_deployment/Advanced_Efficient_Techniques.md‎
Lines changed: 12 additions & 12 deletions b/‎chapter_model_deployment/Advanced_Efficient_Techniques.md‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎chapter_model_deployment/Conversion_to_Inference_Model_and_Model_Optimization.md‎
Lines changed: 45 additions & 51 deletions b/‎chapter_model_deployment/Conversion_to_Inference_Model_and_Model_Optimization.md‎
Lines changed: 45 additions & 51 deletions
diff --git a/‎chapter_model_deployment/Index.md‎
Lines changed: 22 additions & 0 deletions b/‎chapter_model_deployment/Index.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎chapter_model_deployment/Model_Compression.md‎
Lines changed: 16 additions & 17 deletions b/‎chapter_model_deployment/Model_Compression.md‎
Lines changed: 16 additions & 17 deletions
@@ -22,8 +22,7 @@ based on insights provided by Leviathan et al. [@leviathan2023fast].
     is achieved by processing them with the outputs from the
     approximation models in parallel.
 
-Figure [1](#fig:ch-deploy/sd){reference-type="ref"
-reference="fig:ch-deploy/sd"} is a brief overview of Speculative
+Figure :numref:`ch-deploy/sd` is a brief overview of Speculative
 Decoding. It involves initially generating a series of tokens using a
 draft model, which is a smaller and less complex model. These generated
 tokens are then verified in parallel with the target model, which is a
@@ -53,8 +52,9 @@ $M_{\text{target}}(\text{prefix} + [x_1 + ... + x_{\gamma}])$. If the
 condition $q(x) < p(x)$ is met, the token is retained. In contrast, if
 not met, the token faces a rejection chance of $1 - \frac{p(x)}{q(x)}$,
 following which it is reselected from an adjusted distribution:
-$$\label{equ:sd_adjusted}
-p'(x) = norm(max(0, p(x) - q(x)))$$ In the paper [@leviathan2023fast],
+$$
+p'(x) = norm(max(0, p(x) - q(x)))$$ 
+:eqlabel:`equ:sd_adjusted` In the paper [@leviathan2023fast],
 Leviathan et al. have proved the correctness of this adjusted
 distribution for resampling.
 
@@ -154,24 +154,25 @@ algorithm designed to minimize the intensive access to the GPU's high
 bandwidth memory (HBM). This innovation led to significant gains in both
 computational speed and throughput.
 
-Figure [2](#fig:ch-deploy/memory){reference-type="ref"
-reference="fig:ch-deploy/memory"} shows the memory hierarchy with
+Figure :numref:`ch-deploy/memory` shows the memory hierarchy with
 corresponding bandwidths. The main goal of FlashAttention is to avoid
 reading and writing the large attention matrix to and from HBM. And
 perform computation in SRAM as much as possible.
 
 The standard Scaled Dot-Product Attention [@attention] formula is
-$$\label{equ:std_attn}
-\textbf{A} = Softmax(\frac{\textbf{QK}^T}{\sqrt{d_k}})\textbf{V}$$
+$$
+\textbf{A} = Softmax(\frac{\textbf{QK}^T}{\sqrt{d_k}})\textbf{V}$$ 
+:eqlabel:`equ:std_attn`
 
 As $d_k$ is a scalar, we can simplify it into three parts:
 
-$$\label{equ:attn_sep}
+$$
 \begin{aligned}
     \textbf{S} = \textbf{QK}^T\\
     \textbf{P} = Softmax(\textbf{S})\\
     \textbf{O} = \textbf{PV}
-\end{aligned}$$
+\end{aligned}$$ 
+:eqlabel:`equ:attn_sep`
 
 The matrices **K**, **Q**, **V** are all stored in HBM. The standard
 implementation of attention follows these steps:
@@ -214,8 +215,7 @@ s(x) = e^{m(x_{1})-m(x)}s_{1}(x_1) + e^{m(x_2)-m(x)}s_{1}(x_2)\\
 Softmax(x) = \frac{l(x)}{s(x)}
 \end{aligned}$$
 
-Figure [3](#fig:ch-deploy/flashattn){reference-type="ref"
-reference="fig:ch-deploy/flashattn"} shows a brief overview of
+Figure :numref:`ch-deploy/flashattn` shows a brief overview of
 FlashAttention with two blocks. Following decomposition, Softmax
 calculations can be executed block by block. Therefore, **K, Q** and
 **V** are initially divided into blocks. Subsequently, compute the
 
@@ -55,10 +55,8 @@ performed (depending on the backend hardware support) once the
 compilation is complete. However, some optimization operations can only
 be performed in their entirety during the deployment phase.
 
-![Layered computer storage
-architecture](../img/ch08/ch09-storage.png){#fig:ch-deploy/fusion-storage}
-
-## Operator Fusion {#sec:ch-deploy/kernel-fusion}
+![Layered computer storagearchitecture](../img/ch08/ch09-storage.png)
+:label:`ch-deploy/fusion-storage}## Operator Fusion {#sec:ch-deploy/kernel-fusion`
 
 Operator fusion involves combining multiple operators in a deep neural
 network (DNN) model into a new operator based on certain rules, reducing
@@ -69,8 +67,7 @@ The two main performance benefits brought by operator fusion are as
 follows: First, it maximizes the utilization of registers and caches.
 And second, because it combines operators, the load/store time between
 the CPU and memory is reduced. Figure
-[1](#fig:ch-deploy/fusion-storage){reference-type="ref"
-reference="fig:ch-deploy/fusion-storage"} shows the architecture of a
+:numref:`ch-deploy/fusion-storage` shows the architecture of a
 computer's storage system. While the storage capacity increases from the
 level-1 cache (L1) to hard disk, so too does the time for reading data.
 After operator fusion is performed, the previous computation result can
@@ -80,57 +77,55 @@ operations on the memory. Furthermore, operator fusion allows some
 computation to be completed in advance, eliminating redundant or even
 cyclic redundant computing during forward computation.
 
-![Convolution + Batchnorm operator
-fusion](../img/ch08/ch09-conv-bn-fusion.png){#fig:ch-deploy/conv-bn-fusion}
+![Convolution + Batchnorm operatorfusion](../img/ch08/ch09-conv-bn-fusion.png)
+:label:`ch-deploy/conv-bn-fusion`
 
 To describe the principle of operator fusion, we will use two operators,
 Convolution and Batchnorm, as shown in Figure
-[2](#fig:ch-deploy/conv-bn-fusion){reference-type="ref"
-reference="fig:ch-deploy/conv-bn-fusion"}. In the figure, the
+:numref:`ch-deploy/conv-bn-fusion`. In the figure, the
 solid-colored boxes indicate operators, the resulting operators after
 fusion is performed are represented by hatched boxes, and the weights or
 constant tensors of operators are outlined in white. The fusion can be
 understood as the simplification of an equation. The computation of
 Convolution is expressed as Equation
-[\[equ:ch-deploy/conv-equation\]](#equ:ch-deploy/conv-equation){reference-type="ref"
-reference="equ:ch-deploy/conv-equation"}.
+:eqref:`ch-deploy/conv-equation`.
 
-$$\mathbf{Y_{\rm conv}}=\mathbf{W_{\rm conv}}\cdot\mathbf{X_{\rm conv}}+\mathbf{B_{\rm conv}}, \text{equ:ch-deploy/conv-equation}$$
+$$
+\bm{Y_{\rm conv}}=\bm{W_{\rm conv}}\cdot\bm{X_{\rm conv}}+\bm{B_{\rm conv}}$$ 
+:eqlabel:`equ:ch-deploy/conv-equation`
 
 Here, we do not need to understand what each variable means. Instead, we
 only need to keep in mind that Equation
-[\[equ:ch-deploy/conv-equation\]](#equ:ch-deploy/conv-equation){reference-type="ref"
-reference="equ:ch-deploy/conv-equation"} is an equation for
-$\mathbf{Y_{\rm conv}}$ with respect to $\mathbf{X_{\rm conv}}$, and other
+:eqref:`ch-deploy/conv-equation` is an equation for
+$\bm{Y_{\rm conv}}$ with respect to $\bm{X_{\rm conv}}$, and other
 symbols are constants.
 
 Equation
-[\[equ:ch-deploy/bn-equation\]](#equ:ch-deploy/bn-equation){reference-type="ref"
-reference="equ:ch-deploy/bn-equation"} is about the computation of
+:eqref:`ch-deploy/bn-equation` is about the computation of
 Batchnorm:
 
-**equ:ch-deploy/bn-equation:**\
-$$\mathbf{Y_{\rm bn}}=\gamma\frac{\mathbf{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$
+$$
+\bm{Y_{\rm bn}}=\gamma\frac{\bm{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$ 
+:eqlabel:`equ:ch-deploy/bn-equation`
 
-Similarly, it is an equation for $\mathbf{Y_{\rm bn}}$ with respect to
-$\mathbf{X_{\rm bn}}$. Other symbols in the equation represent constants.
+Similarly, it is an equation for $\bm{Y_{\rm bn}}$ with respect to
+$\bm{X_{\rm bn}}$. Other symbols in the equation represent constants.
 
 As shown in Figure
-[2](#fig:ch-deploy/conv-bn-fusion){reference-type="ref"
-reference="fig:ch-deploy/conv-bn-fusion"}, when the output of
+:numref:`ch-deploy/conv-bn-fusion`, when the output of
 Convolution is used as the input of Batchnorm, the formula of Batchnorm
-is a function for $\mathbf{Y_{\rm bn}}$ with respect to $\mathbf{X_{\rm conv}}$.
-After substituting $\mathbf{Y_{\rm conv}}$ into $\mathbf{X_{\rm bn}}$ and
+is a function for $\bm{Y_{\rm bn}}$ with respect to $\bm{X_{\rm conv}}$.
+After substituting $\bm{Y_{\rm conv}}$ into $\bm{X_{\rm bn}}$ and
 uniting and extracting the constants, we obtain Equation
-[\[equ:ch-deploy/conv-bn-equation-3\]](#equ:ch-deploy/conv-bn-equation-3){reference-type="ref"
-reference="equ:ch-deploy/conv-bn-equation-3"}.
+:eqref:`ch-deploy/conv-bn-equation-3`.
 
-$$\mathbf{Y_{\rm bn}}=\mathbf{A}\cdot\mathbf{X_{\rm conv}}+\mathbf{B}, \text{equ:ch-deploy/conv-bn-equation-3}$$
+$$
+\bm{Y_{\rm bn}}=\bm{A}\cdot\bm{X_{\rm conv}}+\bm{B}$$ 
+:eqlabel:`equ:ch-deploy/conv-bn-equation-3`
 
-Here, $\mathbf{A}$ and $\mathbf{B}$ are two matrices. It can be noticed that
+Here, $\bm{A}$ and $\bm{B}$ are two matrices. It can be noticed that
 Equation
-[\[equ:ch-deploy/conv-bn-equation-3\]](#equ:ch-deploy/conv-bn-equation-3){reference-type="ref"
-reference="equ:ch-deploy/conv-bn-equation-3"} is a formula for computing
+:eqref:`ch-deploy/conv-bn-equation-3` is a formula for computing
 Convolution. The preceding example shows that the computation of
 Convolution and Batchnorm can be fused into an equivalent Convolution
 operator. Such fusion is referred to as formula fusion.
@@ -162,13 +157,14 @@ after the fusion --- by 8.5% and 11.7% respectively. Such improvements
 are achieved without bringing side effects and without requiring
 additional hardware or operator libraries.
 
-::: {#tab:ch09/ch09-conv-bn-fusion} <br>
-  Fusion         | Sample  | Mobilenet-v2 |
-  ---------------| --------|-------------- |
-  Before fusion  | 0.035   |  15.415 |
-  After fusion   | 0.031   |   13.606 |
+::: {#tab:ch09/ch09-conv-bn-fusion}
+  Fusion           Sample   Mobilenet-v2
+  --------------- -------- --------------
+  Before fusion    0.035       15.415
+  After fusion     0.031       13.606
 
-Convolution + Batchnorm inference performance before and after fusion (unit: ms) 
+  : Convolution + Batchnorm inference performance before and after
+  fusion (unit: ms)
 :::
 
 ## Operator Replacement
@@ -180,20 +176,19 @@ type of operators that have the same computational logic but are more
 suitable for online deployment. In this way, we can reduce the
 computation workload and compress the model.
 
-![Replacement of
-Batchnorm](../img/ch08/ch09-bn-replace.png){#fig:ch-deploy/bn-replace}
+![Replacement ofBatchnorm](../img/ch08/ch09-bn-replace.png)
+:label:`ch-deploy/bn-replace`
 
-Figure [3](#fig:ch-deploy/bn-replace){reference-type="ref"
-reference="fig:ch-deploy/bn-replace"} depicts the replacement of
+Figure :numref:`ch-deploy/bn-replace` depicts the replacement of
 Batchnorm with Scale, which is used as an example to describe the
 principle of operator replacement. After decomposing Equation
-[\[equ:ch-deploy/bn-equation\]](#equ:ch-deploy/bn-equation){reference-type="ref"
-reference="equ:ch-deploy/bn-equation"} (the Batchnorm formula) and
+:eqref:`ch-deploy/bn-equation` (the Batchnorm formula) and
 folding the constants, Batchnorm is defined as Equation
-[\[equ:ch-deploy/replace-scale\]](#equ:ch-deploy/replace-scale){reference-type="ref"
-reference="equ:ch-deploy/replace-scale"}
+:eqref:`ch-deploy/replace-scale`
 
-$$\mathbf{Y_{bn}}=scale\cdot\mathbf{X_{bn}}+offset, \text{equ:ch-deploy/replace-scale} $$
+$$
+\bm{Y_{bn}}=scale\cdot\bm{X_{bn}}+offset$$ 
+:eqlabel:`equ:ch-deploy/replace-scale`
 
 where **scale** and **offsets** are scalars. This simplified formula can
 be mapped to a Scale operator.
@@ -218,13 +213,12 @@ Common methods of operator reordering include moving cropping operators
 (e.g., Slice, StrideSlice, and Crop) forward, and reordering Reshape,
 Transpose, and BinaryOp.
 
-![Reordering of
-Crop](../img/ch08/ch09-crop-reorder.png){#fig:ch-deploy/crop-reorder}
+![Reordering ofCrop](../img/ch08/ch09-crop-reorder.png)
+:label:`ch-deploy/crop-reorder`
 
 Crop is used to cut a part out of the input feature map as the output.
 After Crop is executed, the size of the feature map is reduced. As shown
-in Figure [4](#fig:ch-deploy/crop-reorder){reference-type="ref"
-reference="fig:ch-deploy/crop-reorder"}, moving Crop forward to cut the
+in Figure :numref:`ch-deploy/crop-reorder`, moving Crop forward to cut the
 feature map before other operators reduces the computation workload of
 subsequent operators, thereby improving the inference performance in the
 deployment phase. Such improvement is related to the operator
 
@@ -0,0 +1,22 @@
+# Model Deployment {#ch:deploy}
+
+In earlier chapters, we discussed the basic components of the machine
+learning model training system. In this chapter, we look at the basics
+of model deployment, a process whereby a trained model is deployed in a
+runtime environment for inference. We explore the conversion from a
+training model into an inference model, model compression methods that
+adapt to hardware restrictions, model inference and performance
+optimization, and model security protection.
+
+The key aspects this chapter explores are as follows:
+
+1.  Conversion and optimization from a training model to an inference
+    model.
+
+2.  Common methods for model compression: quantization, sparsification,
+    and knowledge distillation.
+
+3.  Model inference process and common methods for performance
+    optimization.
+
+4.  Common methods for model security protection.
@@ -14,16 +14,15 @@ Model quantization is a technique that approximates floating-point
 weights of contiguous values (usually float32 or many possibly discrete
 values) at the cost of slightly reducing accuracy to a limited number of
 discrete values (usually int8). As shown in Figure
-[1](#fig:ch-deploy/quant-minmax){reference-type="ref"
-reference="fig:ch-deploy/quant-minmax"}, $T$ represents the data range
+:numref:`ch-deploy/quant-minmax`, $T$ represents the data range
 before quantization. In order to reduce the model size, model
 quantization represents floating-point data with fewer bits. As such,
 the memory usage during inference can be reduced, and the inference on
 processors that are good at processing low-precision operations can be
 accelerated.
 
-![Principles of
-quantization](../img/ch08/ch09-quant-minmax.png){#fig:ch-deploy/quant-minmax}
+![Principles ofquantization](../img/ch08/ch09-quant-minmax.png)
+:label:`ch-deploy/quant-minmax`
 
 The number of bits and the range of data represented by different data
 types in a computer are different. Based on service requirements, a
@@ -49,12 +48,12 @@ that linear quantization is more commonly used. The following therefore
 focuses on the principles of linear quantization.
 
 In Equation
-[\[equ:ch-deploy/quantization-q\]](#equ:ch-deploy/quantization-q){reference-type="ref"
-reference="equ:ch-deploy/quantization-q"}, assume that $r$ represents
+:eqref:`ch-deploy/quantization-q`, assume that $r$ represents
 the floating-point number before quantization. We are then able to
 obtain the integer $q$ after quantization.
 
-$$q=clip(round(\frac{r}{s}+z),q_{min},q_{max}), \text{equ:ch-deploy/quantization-q}$$
+$$q=clip(round(\frac{r}{s}+z),q_{min},q_{max})$$ 
+:eqlabel:`equ:ch-deploy/quantization-q`
 
 $clip(\cdot)$ and $round(\cdot)$ indicate the truncation and rounding
 operations, and $q_{min}$ and $q_{max}$ indicate the minimum and maximum
@@ -170,16 +169,16 @@ before quantization. Assume that the mean value and variance of the
 weight of a channel are $E(w_c)$ and $||w_c-E(w_c)||$, and the mean
 value and variance after quantization are $E(\hat{w_c})$ and
 $||\hat{w_c}-E(\hat{w_c})||$, respectively. Equation
-[\[equ:ch-deploy/post-quantization\]](#equ:ch-deploy/post-quantization){reference-type="ref"
-reference="equ:ch-deploy/post-quantization"} is the calibration of the
+:eqref:`ch-deploy/post-quantization` is the calibration of the
 weight:
 
 $$
 \begin{aligned}
 \hat{w_c}\leftarrow\zeta_c(\hat{w_c}+u_c) \\
 u_c=E(w_c)-E(\hat{w_c})   \\
 \zeta_c=\frac{||w_c-E(w_c)||}{||\hat{w_c}-E(\hat{w_c})||}
-\end{aligned}, \text{equ:ch-deploy/post-quantization}$$
+\end{aligned}$$ 
+:eqlabel:`equ:ch-deploy/post-quantization`
 
 As a general model compression method, quantization can significantly
 improve the memory and compression efficiency of neural networks, and
@@ -281,15 +280,14 @@ identified more efficiently. As such, iterative pruning is widely used.
 To illustrate how to prune a network, we will use Deep
 Compression [@han2015deep] as an example. Removing most weights leads to
 a loss of accuracy of the neural network, as shown in Figure
-[2](#fig:ch-deploy/deepcomp){reference-type="ref"
-reference="fig:ch-deploy/deepcomp"}. Fine-tuning a pruned sparse neural
+:numref:`ch-deploy/deepcomp`. Fine-tuning a pruned sparse neural
 network can help improve model accuracy, and the pruned network may be
 quantized to represent weights using fewer bits. In addition, using
 Huffman coding can further reduce the memory cost of the deep neural
 network.
 
-![Deep Compression
-algorithm](../img/ch08/ch09-deepcomp.png){#fig:ch-deploy/deepcomp}
+![Deep Compressionalgorithm](../img/ch08/ch09-deepcomp.png)
+:label:`ch-deploy/deepcomp`
 
 In addition to removing redundant neurons, a dictionary learning-based
 method can be used to remove unnecessary weights on a deep convolutional
@@ -326,7 +324,9 @@ classification result of the teacher network, that is, Equation
 [\[c2Fcn:distill\]](#c2Fcn:distill){reference-type="ref"
 reference="c2Fcn:distill"}.
 
-$$\mathcal{L}_{KD}(\theta_S) = \mathcal{H}(o_S,\mathbf{y}) +\lambda\mathcal{H}(\tau(o_S),\tau(o_T)), \text{c2Fcn:distill}$$
+$$\mathcal{L}_{KD}(\theta_S) = \mathcal{H}(o_S,\mathbf{y}) +\lambda\mathcal{H}(\tau(o_S),\tau(o_T)),
+$$ 
+:eqlabel:`c2Fcn:distill`
 
 where $\mathcal{H}(\cdot,\cdot)$ is the cross-entropy function, $o_S$
 and $o_T$ are outputs of the student network and the teacher network,
@@ -357,8 +357,7 @@ module. The attention module generates an attention map, which
 identifies the importance of different areas of an input image to the
 classification result. The attention map is then transferred from the
 teacher network to the student network, as depicted in Figure
- [3](#fig:ch-deploy/attentionTS){reference-type="ref"
-reference="fig:ch-deploy/attentionTS"}.
+ :numref:`ch-deploy/attentionTS`.
 
 KD is an effective method to optimize small networks. It can be combined
 with other compression methods such as pruning and quantization to train