Upload sections

mikebo93 · mikebo93 · commit b46e00e0dd12 · 2025-03-26T22:45:35.000Z
diff --git a/chapter_model_deployment/Chapter_Summary.md b/chapter_model_deployment/Chapter_Summary.md
@@ -0,0 +1,38 @@
+# Chapter Summary
+
+1.  Model deployment is restricted by factors including the model size,
+    runtime memory usage, inference latency, and inference power
+    consumption.
+
+2.  Models can be compressed using techniques such as quantization,
+    pruning, and knowledge distillation in the offline phase. In
+    addition, some model optimization techniques, such as operator
+    fusion, can also reduce the model size, albeit to a lesser degree.
+
+3.  Runtime memory usage can be improved by optimizing the model size,
+    deployment framework size, and runtime temporary memory usage.
+    Methods for optimizing the model size have been summarized earlier.
+    Making the framework code simpler and more modular helps optimize
+    the deployment framework. Memory pooling can help implement memory
+    overcommitment to optimize the runtime temporary memory usage.
+
+4.  Model inference latency can be optimized from two aspects. In the
+    offline phase, the model computation workload can be reduced using
+    model optimization and compression methods. Furthermore, improving
+    the inference parallelism and optimizing operator implementation can
+    help maximize the utilization of the computing power. In addition to
+    the computation workload and computing power, consideration should
+    be given to the load/store overhead during inference.
+
+5.  Power consumption during inference can be reduced through offline
+    model optimization and compression technologies. By reducing the
+    computational workload, these technologies also facilitate power
+    consumption reduction, which coincides with the optimization method
+    for model inference latency.
+
+6.  In addition to the optimization of factors related to model
+    deployment, this chapter also discussed technologies regarding
+    deployment security, such as model obfuscation and model encryption.
+    Secure deployment protects the model assets of enterprises and
+    prevents hackers from attacking the deployment environment by
+    tampering with models.
diff --git a/chapter_model_deployment/Index.md b/chapter_model_deployment/Index.md
@@ -0,0 +1,22 @@
+# Model Deployment {#ch:deploy}
+
+In earlier chapters, we discussed the basic components of the machine
+learning model training system. In this chapter, we look at the basics
+of model deployment, a process whereby a trained model is deployed in a
+runtime environment for inference. We explore the conversion from a
+training model into an inference model, model compression methods that
+adapt to hardware restrictions, model inference and performance
+optimization, and model security protection.
+
+The key aspects this chapter explores are as follows:
+
+1.  Conversion and optimization from a training model to an inference
+    model.
+
+2.  Common methods for model compression: quantization, sparsification,
+    and knowledge distillation.
+
+3.  Model inference process and common methods for performance
+    optimization.
+
+4.  Common methods for model security protection.
diff --git a/chapter_model_deployment/Overview.md b/chapter_model_deployment/Overview.md
@@ -0,0 +1,71 @@
+# Overview
+
+After training a model, we need to save it and its parameters to files
+to make them persistent. However, because different training frameworks
+adopt different data structures for such files, the inference system
+must support models trained using different training frameworks and
+convert the data in the files into a unified data structure. During the
+conversion from a training model to an inference model, optimization
+operations such as operator fusion and constant folding on the model can
+be performed to improve the inference performance.
+
+The hardware restrictions of different production environments must be
+considered when we deploy an inference model. For instance, a
+large-scale model needs to be deployed on a server in a computing or
+data center with strong computing power, whereas a mid-scale model
+should be deployed on an edge server, PC, or smartphone --- such devices
+often have limited computing resources and memory. For simple,
+small-scale models, ultra-low power microcontrollers can be used. In
+addition, different hardware supports different data types (such as
+float32, float16, bfloat16, and int8). To adapt to the hardware
+restrictions, a trained model may sometimes need to be compressed in
+order to reduce model complexity or data precision and reduce model
+parameters.
+
+Before a model can be used for inference, it needs to be deployed in the
+runtime environment. To optimize model inference, which may be affected
+by latency, memory usage, and power consumption, we can design chips
+dedicated for machine learning --- such dedicated chips usually
+outperform general-purpose ones in terms of energy efficiency. Another
+approach is to fully leverage hardware capabilities through
+software-hardware collaboration. Take a CPU as an example. When
+designing and optimizing models for a specific CPU architecture, we can
+suitably divide data blocks to meet the cache size, rearrange data to
+facilitate contiguous data access during computing, reduce data
+dependency to improve the parallelism of hardware pipelines, and use
+extended instruction sets to improve the computing performance.
+
+Because models are an important enterprise asset, it is important to
+ensure their security after they are deployed in the runtime
+environment. This chapter will discuss some of the common protection
+measures and use model obfuscation as an example.
+
+Some of the common methods used in the industry to address the preceding
+challenges are as follows:
+
+1.  **Model compression:** Technologies that reduce the model size and
+    computation complexity by means of quantization and pruning. Such
+    technologies can be categorized according to whether retraining is
+    required.
+
+2.  **Operator fusion:** Technologies that combine multiple operators
+    into one by simplifying expressions and fusing attributes, aiming to
+    reduce the computation complexity and size of the model.
+
+3.  **Constant folding:** Forward computation of operators that meet
+    certain conditions is completed in the offline phase, reducing the
+    computation complexity and size of a model. This requires that the
+    inputs of operators be constants in the offline phase.
+
+4.  **Data format:** According to the operator library and hardware
+    restrictions and exploration of the optimal data format of each
+    layer on the network, data is rearranged or data rearrangement
+    operators are inserted, in order to reduce the inference latency
+    during model deployment.
+
+5.  **Model obfuscation:** Network nodes or branches are added and
+    operator names are changed for a trained model, so that it is
+    difficult for attackers to understand the original model structure
+    even if they steal the model. An obfuscated model may be directly
+    executed in the deployment environment, thereby ensuring the
+    security of the model during execution.
diff --git a/chapter_model_deployment/Security_Protection_of_Models.md b/chapter_model_deployment/Security_Protection_of_Models.md
@@ -0,0 +1,165 @@
+# Security Protection of Models
+
+After training and optimizing models locally, AI service providers
+deploy the models on third-party platforms (such as mobile devices, edge
+devices, and cloud servers) to provide inference services. The design
+and training of the AI models require a large amount of time, data, and
+computing power. This is why model and service providers protect the
+intellectual property rights of the models (including model structures
+and parameters) from being stolen during transfer, storage, and running
+in the deployment phase.
+
+## Overview
+
+The security protection of models can be divided into static protection
+and dynamic protection. Static protection refers to protecting models
+during transfer and storage. At present, it is widely implemented based
+on file encryption, in which AI model files are transferred and stored
+in ciphertext and are decrypted in the memory before being used for
+inference. However, throughout the inference process, models remain in
+plaintext in the memory, making it possible for theft. Dynamic
+protection refers to protecting models during runtime. Dynamic
+protection methods currently available can be classified into three
+categories. The first is trusted execution environment-based (TEE-based)
+protection. TEEs are usually secure zones isolated on trusted hardware,
+and AI model files are stored and transferred in non-secure zones and
+running after decryption in the secure zones. Although this method
+involves only a short inference latency on the CPU, it requires specific
+trusted hardware, making it difficult to implement. In addition, due to
+constraints on hardware resources, protecting large-scale deep models is
+difficult and heterogeneous hardware acceleration is still challenging.
+The second is a cryptographic computing-based protection, which ensures
+that models remain in ciphertext during transfer, storage, and running
+using cryptographic techniques (such as homomorphic encryption and
+secure multi-party computation). Although this method is free from
+hardware constraints, it has large computation or communications
+overheads and cannot protect model structure information. The third is
+obfuscation-based protection. This method scrambles the computational
+logic of models with fake nodes, so that attackers cannot understand the
+models even if they obtain them. Compared with the former two methods,
+obfuscation-based protection brings a smaller overhead to the
+performance and neglectable loss of accuracy. Furthermore, it is
+hardware-agnostic, and can support protection of very large models. We
+will focus on protection using the obfuscation-based method.
+
+## Model Obfuscation
+
+Model obfuscation can automatically obfuscate the computational logic of
+plaintext AI models, preventing attackers from understanding the models
+even if they obtain them during transfer and storage. In addition,
+models can run while still being obfuscated, thereby ensuring the
+confidentiality while they are running. Obfuscation does not affect the
+inference results and brings only a low performance overhead.
+
+![Procedure of modelobfuscation](../img/ch08/model_obfuscate.png)
+:label:`ch-deploy/model_obfuscate`
+
+Figure :numref:`ch-deploy/model_obfuscate` depicts the model obfuscation
+procedure, which is described as follows.
+
+1.  **Interpret the given model into a computational graph:** Based on
+    the structure of a trained model, interpret the model file into the
+    graph expression (computational graph) of the model computational
+    logic for subsequent operations. The resulting computational graph
+    contains information such as node identifiers, node operator types,
+    node parameters, and network structures.
+
+2.  **Scramble the network structure of the computational graph[^1]:**
+    Scramble the relationship between nodes in the computational graph
+    using graph compression, augmentation, and other techniques in order
+    to conceal the true computational logic. In graph compression, the
+    key subgraph structure is matched by checking the entire graph.
+    These subgraphs are compressed and replaced with a single new
+    computing node. Graph augmentation adds new input/output edges to
+    the compressed graph in order to further conceal the dependencies
+    between nodes. An input/output edge comes from or points to an
+    existing node in the graph, or comes from or points to the new
+    obfuscation node in this step.
+
+3.  **Anonymize nodes in the computational graph:** Traverse the
+    computational graph processed in Step (2) and select the nodes to be
+    protected. For a node to be protected, we can replace the node
+    identifier, operator type, and other attributes that can describe
+    the computational logic of the model with non-semantic symbols. For
+    node identifier anonymization, the anonymized node identifier must
+    be unique in order to distinguish different nodes. For operator type
+    anonymization, to avoid operator type explosion caused by
+    large-scale computational graph anonymization, we can divide nodes
+    with the same operator type into several disjoint sets, and replace
+    the operator type of nodes in the same set with the same symbol.
+    Step (5) ensures that the model can be identified and executed after
+    node anonymization.
+
+4.  **Scramble weights of the computational graph:** Add random noise
+    and mapping functions to the weights to be protected. The random
+    noise and mapping functions can vary with weights. Step (6) ensures
+    that the noise of weights does not change the model execution
+    result. The computational graph processed after Steps (2), (3),
+    and (4) are then saved as a model file for subsequent operations.
+
+5.  **Transform operator interfaces:** Steps (5) and (6) transform
+    operators to be protected in order to generate candidate obfuscated
+    operators. An original operator may correspond to multiple
+    obfuscated operators. The quantity of candidate obfuscated operators
+    depends on how many sets the nodes are grouped into in Step (3). In
+    this step, the operator interfaces are transformed based on the
+    anonymized operator types and operator input/output relationship
+    obtained after Steps (2), (3), and (4). Such transformation can be
+    implemented by changing the input, output, or interface name.
+    Changing the input and output involves modification on the input and
+    output data, making the form of the obfuscated operator different
+    from that of the original operator. The added data includes the data
+    dependency introduced by graph augmentation in Step (2) and the
+    random noise introduced by weight obfuscation in Step (4). The
+    operator name is changed to the name of the anonymized operator
+    obtained in Step (3) to ensure that the model can still be
+    identified and executed after the nodes are anonymized and that the
+    operator name does not reveal the computational logic.
+
+6.  **Transform the operator implementation:** Transform the operator
+    code implementation by encrypting strings, adding redundant code,
+    and employing other code obfuscation techniques in order to keep the
+    computational logic consistent between the original operator and
+    obfuscated operator while also making the logic more difficult to
+    understand. A combination of different code obfuscation techniques
+    may be applied to different operators in order to realize the code
+    implementation transformation. In addition to equivalent code
+    transformation, the obfuscated operators further implement some
+    additional computational logic. For example, in Step (4), noise has
+    been added to the weights of an operator. The obfuscated operator
+    also implements an inverse mapping function of the weight noise,
+    dynamically eliminating noise in the operator execution process and
+    ensuring that the computation result is the same as the original
+    model. The generated obfuscated operators can then be saved as a
+    library file for subsequent operations.
+
+7.  **Deploy the model and operator library:** Deploy the obfuscated
+    model and corresponding operator library file on the desired device.
+
+8.  **Load the obfuscated model:** Parse the obfuscated model file and
+    obtain the graph expression of the model computational logic, that
+    is, the obfuscated computational graph obtained after Step (2), (3),
+    and (4).
+
+9.  **Initialize the computational graph:** Initialize the computational
+    graph to generate an execution task sequence. According to security
+    configuration options, if runtime model security needs to be
+    protected, the obfuscated graph should be directly initialized to
+    generate an execution task sequence. Each compute unit in the
+    sequence corresponds to execution of one obfuscated operator or
+    original operator. If security protection is required during only
+    model transfer and storage, restore the obfuscated graph in the
+    memory to the source graph, and then initialize the source graph to
+    generate an execution task sequence. Each unit in the sequence
+    corresponds to the execution of an original operator. In this way,
+    performance overheads during inference can be further reduced.
+
+10. **Execute inference tasks:** The model executes the compute units
+    sequentially on the input of the AI application in order to obtain
+    an inference result. If a compute unit corresponds to an obfuscated
+    operator, the obfuscated operator library is invoked. Otherwise, the
+    original operator library is invoked.
+
+[^1]: Scrambling refers to adding noise to the computational graph.
+    Common methods include adding redundant nodes and edges and merging
+    some subgraphs.