openmlsys
diff --git a/‎.DS_Store‎
8 KB b/‎.DS_Store‎
8 KB
diff --git a/‎ch_model_deployment/Advanced_Efficient_Techniques.md‎
Lines changed: 344 additions & 0 deletions b/‎ch_model_deployment/Advanced_Efficient_Techniques.md‎
Lines changed: 344 additions & 0 deletions
diff --git a/‎ch_model_deployment/Chapter_Summary.md‎
Lines changed: 38 additions & 0 deletions b/‎ch_model_deployment/Chapter_Summary.md‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎ch_model_deployment/Conversion_to_Inference_Model_and_Model_Optimization.md‎
Lines changed: 240 additions & 0 deletions b/‎ch_model_deployment/Conversion_to_Inference_Model_and_Model_Optimization.md‎
Lines changed: 240 additions & 0 deletions
diff --git a/‎ch_model_deployment/Further_Reading.md‎
Lines changed: 36 additions & 0 deletions b/‎ch_model_deployment/Further_Reading.md‎
Lines changed: 36 additions & 0 deletions
@@ -0,0 +1,38 @@
+# Chapter Summary
+
+1.  Model deployment is restricted by factors including the model size,
+    runtime memory usage, inference latency, and inference power
+    consumption.
+
+2.  Models can be compressed using techniques such as quantization,
+    pruning, and knowledge distillation in the offline phase. In
+    addition, some model optimization techniques, such as operator
+    fusion, can also reduce the model size, albeit to a lesser degree.
+
+3.  Runtime memory usage can be improved by optimizing the model size,
+    deployment framework size, and runtime temporary memory usage.
+    Methods for optimizing the model size have been summarized earlier.
+    Making the framework code simpler and more modular helps optimize
+    the deployment framework. Memory pooling can help implement memory
+    overcommitment to optimize the runtime temporary memory usage.
+
+4.  Model inference latency can be optimized from two aspects. In the
+    offline phase, the model computation workload can be reduced using
+    model optimization and compression methods. Furthermore, improving
+    the inference parallelism and optimizing operator implementation can
+    help maximize the utilization of the computing power. In addition to
+    the computation workload and computing power, consideration should
+    be given to the load/store overhead during inference.
+
+5.  Power consumption during inference can be reduced through offline
+    model optimization and compression technologies. By reducing the
+    computational workload, these technologies also facilitate power
+    consumption reduction, which coincides with the optimization method
+    for model inference latency.
+
+6.  In addition to the optimization of factors related to model
+    deployment, this chapter also discussed technologies regarding
+    deployment security, such as model obfuscation and model encryption.
+    Secure deployment protects the model assets of enterprises and
+    prevents hackers from attacking the deployment environment by
+    tampering with models.
@@ -0,0 +1,240 @@
+# Conversion to Inference Model and Model Optimization {#sec:ch-deploy/model-optimization}
+
+## Model Conversion
+
+As mentioned earlier, TensorFlow, PyTorch, MindSpore, MXNet, and CNTK
+define their own model data structures. This means that the inference
+system needs to convert these structures to a unified one. Open Neural
+Network Exchange (ONNX) is designed to implement such conversion. It
+supports an extensive range of machine learning operators and converts
+models from various frameworks (e.g., TensorFlow and PyTorch) into ONNX
+models. Because models are structured data, the conversion process
+involves converting the data structure. It starts by analyzing the
+similarities and differences between two data structures. If they are
+the same, data is transferred; if the structures are similar but with
+slight differences, data is mapped; if the structures differ
+significantly, extra semantics conversion might be required; and if they
+are totally incompatible, the conversion will fail. ONNX features strong
+expressive power, meaning that it can convert models from most
+frameworks in the industry to compatible ONNX models. If a model is
+abstracted as a graph, its data structure can be defined as follows:
+
+1.  **Topological expression of model:** The topological connections of
+    a model are represented as edges in a graph. From the perspective of
+    a model, these edges define the data flows and control flows in the
+    model. Based on such definitions, we can extend to the expressions
+    of the subgraphs, model inputs and outputs, and control flow
+    structures. For example, the control flow on TensorFlow 1.x is
+    expressed as a cyclic graph. To prevent the formation of cycles,
+    TensorFlow 1.x uses operators such as Enter, Exit, Switch, LoopCond,
+    and NextIteration, whereas ONNX uses operators such as Loop and If.
+    As such, when converting a TensorFlow1.x control flow model into an
+    ONNX model, the control flow graph structure in the TensorFlow model
+    must be merged into a While or If operator on ONNX.
+
+2.  **Operator prototype definition:** Operators can be regarded as data
+    processing or control flow nodes in a model or as vertices in a
+    graph. An operator prototype defines the type, inputs, outputs, and
+    attributes of an operator. For instance, Slice has different
+    semantics on Caffe and ONNX. To convert a Caffe model into an ONNX
+    model, we need to map Slice on Caffe to Split on ONNX.
+    FusedBatchnorm on TensorFlow does not have a mapping operator on
+    Caffe. Rather, Batchnorm and Scale on Caffe need to be combined to
+    express the same semantics of FusedBatchnorm on TensorFlow.
+    Generally, the model conversion process involves converting the
+    topological relationships and mapping the operator prototypes
+    between models.
+
+Following model conversion, some input-agnostic operations are conducted
+for optimization purposes prior to model deployment, including constant
+folding, operator fusion, operator replacement, and operator reordering
+--- optimization methods discussed earlier in this book. For instance,
+constant folding is usually performed during the compilation executed on
+the compiler frontend, whereas, operator fusion and partition are often
+performed (depending on the backend hardware support) once the
+compilation is complete. However, some optimization operations can only
+be performed in their entirety during the deployment phase.
+
+![Layered computer storage
+architecture](../img/ch08/ch09-storage.png){#fig:ch-deploy/fusion-storage}
+
+## Operator Fusion {#sec:ch-deploy/kernel-fusion}
+
+Operator fusion involves combining multiple operators in a deep neural
+network (DNN) model into a new operator based on certain rules, reducing
+the inference latency and power consumption by lowering the computation
+workload and load/store overhead during online inference.
+
+The two main performance benefits brought by operator fusion are as
+follows: First, it maximizes the utilization of registers and caches.
+And second, because it combines operators, the load/store time between
+the CPU and memory is reduced. Figure
+[1](#fig:ch-deploy/fusion-storage){reference-type="ref"
+reference="fig:ch-deploy/fusion-storage"} shows the architecture of a
+computer's storage system. While the storage capacity increases from the
+level-1 cache (L1) to hard disk, so too does the time for reading data.
+After operator fusion is performed, the previous computation result can
+be temporarily stored in the CPU's register or cache where the next
+computation can directly read the result, reducing the number of I/O
+operations on the memory. Furthermore, operator fusion allows some
+computation to be completed in advance, eliminating redundant or even
+cyclic redundant computing during forward computation.
+
+![Convolution + Batchnorm operator
+fusion](../img/ch08/ch09-conv-bn-fusion.png){#fig:ch-deploy/conv-bn-fusion}
+
+To describe the principle of operator fusion, we will use two operators,
+Convolution and Batchnorm, as shown in Figure
+[2](#fig:ch-deploy/conv-bn-fusion){reference-type="ref"
+reference="fig:ch-deploy/conv-bn-fusion"}. In the figure, the
+solid-colored boxes indicate operators, the resulting operators after
+fusion is performed are represented by hatched boxes, and the weights or
+constant tensors of operators are outlined in white. The fusion can be
+understood as the simplification of an equation. The computation of
+Convolution is expressed as Equation
+[\[equ:ch-deploy/conv-equation\]](#equ:ch-deploy/conv-equation){reference-type="ref"
+reference="equ:ch-deploy/conv-equation"}.
+
+$$\label{equ:ch-deploy/conv-equation}
+\bm{Y_{\rm conv}}=\bm{W_{\rm conv}}\cdot\bm{X_{\rm conv}}+\bm{B_{\rm conv}}$$
+
+Here, we do not need to understand what each variable means. Instead, we
+only need to keep in mind that Equation
+[\[equ:ch-deploy/conv-equation\]](#equ:ch-deploy/conv-equation){reference-type="ref"
+reference="equ:ch-deploy/conv-equation"} is an equation for
+$\bm{Y_{\rm conv}}$ with respect to $\bm{X_{\rm conv}}$, and other
+symbols are constants.
+
+Equation
+[\[equ:ch-deploy/bn-equation\]](#equ:ch-deploy/bn-equation){reference-type="ref"
+reference="equ:ch-deploy/bn-equation"} is about the computation of
+Batchnorm:
+
+$$\label{equ:ch-deploy/bn-equation}
+\bm{Y_{\rm bn}}=\gamma\frac{\bm{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$
+
+Similarly, it is an equation for $\bm{Y_{\rm bn}}$ with respect to
+$\bm{X_{\rm bn}}$. Other symbols in the equation represent constants.
+
+As shown in Figure
+[2](#fig:ch-deploy/conv-bn-fusion){reference-type="ref"
+reference="fig:ch-deploy/conv-bn-fusion"}, when the output of
+Convolution is used as the input of Batchnorm, the formula of Batchnorm
+is a function for $\bm{Y_{\rm bn}}$ with respect to $\bm{X_{\rm conv}}$.
+After substituting $\bm{Y_{\rm conv}}$ into $\bm{X_{\rm bn}}$ and
+uniting and extracting the constants, we obtain Equation
+[\[equ:ch-deploy/conv-bn-equation-3\]](#equ:ch-deploy/conv-bn-equation-3){reference-type="ref"
+reference="equ:ch-deploy/conv-bn-equation-3"}.
+
+$$\label{equ:ch-deploy/conv-bn-equation-3}
+\bm{Y_{\rm bn}}=\bm{A}\cdot\bm{X_{\rm conv}}+\bm{B}$$
+
+Here, $\bm{A}$ and $\bm{B}$ are two matrices. It can be noticed that
+Equation
+[\[equ:ch-deploy/conv-bn-equation-3\]](#equ:ch-deploy/conv-bn-equation-3){reference-type="ref"
+reference="equ:ch-deploy/conv-bn-equation-3"} is a formula for computing
+Convolution. The preceding example shows that the computation of
+Convolution and Batchnorm can be fused into an equivalent Convolution
+operator. Such fusion is referred to as formula fusion.
+
+The fusion of Convolution and Batchnorm eliminates a Batchnorm
+operation, thereby reducing the quantity of parameters and computation
+workload are reduced, and thereby the load/store operations are also
+reduced. In general, this fusion not only optimizes the power
+consumption and performance during model deployment, but also brings
+certain benefits in compressing the model size.
+
+Symbols that are considered as constants in the Convolution and
+Batchnorm formulas during fusion are considered as parameters during
+training. Performing fusion during the training process will result in
+missing model parameters. Because the fusion eliminates a Batchnorm
+operator and corresponding parameters from the network, the algorithm of
+the DNN is changed, degrading the accuracy to unacceptable levels.
+Therefore, the fusion of Convolution and Batchnorm is an optimization
+method typically used during deployment. To evaluate the optimization
+effect, we constructed a sample network with Convolution and Batchnorm
+using MindSpore Lite. We ran the sample network and mobilenet-v2 network
+for inference in dual threads on a Huawei Mate 30 smartphone to compare
+the time of running 3,000 inference epochs before and after the fusion.
+As shown in Table
+[1](#tab:ch09/ch09-conv-bn-fusion){reference-type="ref"
+reference="tab:ch09/ch09-conv-bn-fusion"}, the inference performance of
+the sample network and mobilenet-v2 network is improved considerably
+after the fusion --- by 8.5% and 11.7% respectively. Such improvements
+are achieved without bringing side effects and without requiring
+additional hardware or operator libraries.
+
+::: {#tab:ch09/ch09-conv-bn-fusion}
+  Fusion           Sample   Mobilenet-v2
+  --------------- -------- --------------
+  Before fusion    0.035       15.415
+  After fusion     0.031       13.606
+
+  : Convolution + Batchnorm inference performance before and after
+  fusion (unit: ms)
+:::
+
+## Operator Replacement
+
+The principle of operator replacement is to simplify an operator formula
+by uniting like terms, extracting common factors, and employing other
+mathematical methods, and then map the simplified formula to a certain
+type of operators that have the same computational logic but are more
+suitable for online deployment. In this way, we can reduce the
+computation workload and compress the model.
+
+![Replacement of
+Batchnorm](../img/ch08/ch09-bn-replace.png){#fig:ch-deploy/bn-replace}
+
+Figure [3](#fig:ch-deploy/bn-replace){reference-type="ref"
+reference="fig:ch-deploy/bn-replace"} depicts the replacement of
+Batchnorm with Scale, which is used as an example to describe the
+principle of operator replacement. After decomposing Equation
+[\[equ:ch-deploy/bn-equation\]](#equ:ch-deploy/bn-equation){reference-type="ref"
+reference="equ:ch-deploy/bn-equation"} (the Batchnorm formula) and
+folding the constants, Batchnorm is defined as Equation
+[\[equ:ch-deploy/replace-scale\]](#equ:ch-deploy/replace-scale){reference-type="ref"
+reference="equ:ch-deploy/replace-scale"}
+
+$$\label{equ:ch-deploy/replace-scale}
+\bm{Y_{bn}}=scale\cdot\bm{X_{bn}}+offset$$
+
+where **scale** and **offsets** are scalars. This simplified formula can
+be mapped to a Scale operator.
+
+Compared with the original Batchnorm formula, the simplified formula has
+fewer parameters and involves less computation workload. This indicates
+that operator replacement is an effective approach to optimizing the
+power consumption and performance of a model during deployment. Symbols
+that are considered as constants in Batchnorm during deployment are not
+considered as constants during training, meaning that the replacement
+can be performed only during deployment. Operator replacement reduces
+the quantity of parameters and changes the structure of the model,
+weakening the expressive power and reducing the accuracy of the model
+during convergence.
+
+## Operator Reordering
+
+Another way of reducing the computation workload of an inference model
+is to adjust the topological order of its operators according to certain
+rules, on the condition that the inference accuracy is not degraded.
+Common methods of operator reordering include moving cropping operators
+(e.g., Slice, StrideSlice, and Crop) forward, and reordering Reshape,
+Transpose, and BinaryOp.
+
+![Reordering of
+Crop](../img/ch08/ch09-crop-reorder.png){#fig:ch-deploy/crop-reorder}
+
+Crop is used to cut a part out of the input feature map as the output.
+After Crop is executed, the size of the feature map is reduced. As shown
+in Figure [4](#fig:ch-deploy/crop-reorder){reference-type="ref"
+reference="fig:ch-deploy/crop-reorder"}, moving Crop forward to cut the
+feature map before other operators reduces the computation workload of
+subsequent operators, thereby improving the inference performance in the
+deployment phase. Such improvement is related to the operator
+parameters. Note, however, that Crop can be moved forward only along
+element-wise operators.
+
+The experiment result above proves that optimizing models before
+inference makes it possible to significantly reduce the latency, power
+consumption, and memory usage.
@@ -0,0 +1,36 @@
+# Further Reading
+
+1.  A Distributed Graph-Theoretic Framework for Automatic
+    Parallelization in Multi-Core Systems[^1]
+
+2.  SCOP: Scientific Control for Reliable Neural Network Pruning[^2]
+
+3.  Searching for Low-Bit Weights in Quantized Neural Networks[^3]
+
+4.  GhostNet: More Features from Cheap Operations[^4]
+
+5.  AdderNet: Do We Really Need Multiplications in Deep Learning?[^5]
+
+6.  Blockwise Parallel Decoding for Deep Autoregressive Models[^6]
+
+7.  Medusa: Simple framework for accelerating LLM generation with
+    multiple decoding heads[^7]
+
+8.  FlashAttention-2: Faster Attention with Better Parallelism and Work
+    Partitioning[^8]
+
+[^1]: <https://proceedings.mlsys.org/paper/2021/file/a5e00132373a7031000fd987a3c9f87b-Paper.pdf>
+
+[^2]: <https://arxiv.org/abs/2010.10732>
+
+[^3]: <https://arxiv.org/abs/2009.08695>
+
+[^4]: <https://arxiv.org/abs/1911.11907>
+
+[^5]: <https://arxiv.org/abs/1912.13200>
+
+[^6]: <https://arxiv.org/abs/1811.03115>
+
+[^7]: <https://www.together.ai/blog/medusa>
+
+[^8]: <https://arxiv.org/abs/2307.08691>