Skip to content

Commit b46e00e

Browse files
committed
Upload sections
1 parent 4f49c83 commit b46e00e

File tree

4 files changed

+296
-0
lines changed

4 files changed

+296
-0
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Chapter Summary
2+
3+
1. Model deployment is restricted by factors including the model size,
4+
runtime memory usage, inference latency, and inference power
5+
consumption.
6+
7+
2. Models can be compressed using techniques such as quantization,
8+
pruning, and knowledge distillation in the offline phase. In
9+
addition, some model optimization techniques, such as operator
10+
fusion, can also reduce the model size, albeit to a lesser degree.
11+
12+
3. Runtime memory usage can be improved by optimizing the model size,
13+
deployment framework size, and runtime temporary memory usage.
14+
Methods for optimizing the model size have been summarized earlier.
15+
Making the framework code simpler and more modular helps optimize
16+
the deployment framework. Memory pooling can help implement memory
17+
overcommitment to optimize the runtime temporary memory usage.
18+
19+
4. Model inference latency can be optimized from two aspects. In the
20+
offline phase, the model computation workload can be reduced using
21+
model optimization and compression methods. Furthermore, improving
22+
the inference parallelism and optimizing operator implementation can
23+
help maximize the utilization of the computing power. In addition to
24+
the computation workload and computing power, consideration should
25+
be given to the load/store overhead during inference.
26+
27+
5. Power consumption during inference can be reduced through offline
28+
model optimization and compression technologies. By reducing the
29+
computational workload, these technologies also facilitate power
30+
consumption reduction, which coincides with the optimization method
31+
for model inference latency.
32+
33+
6. In addition to the optimization of factors related to model
34+
deployment, this chapter also discussed technologies regarding
35+
deployment security, such as model obfuscation and model encryption.
36+
Secure deployment protects the model assets of enterprises and
37+
prevents hackers from attacking the deployment environment by
38+
tampering with models.

chapter_model_deployment/Index.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Model Deployment {#ch:deploy}
2+
3+
In earlier chapters, we discussed the basic components of the machine
4+
learning model training system. In this chapter, we look at the basics
5+
of model deployment, a process whereby a trained model is deployed in a
6+
runtime environment for inference. We explore the conversion from a
7+
training model into an inference model, model compression methods that
8+
adapt to hardware restrictions, model inference and performance
9+
optimization, and model security protection.
10+
11+
The key aspects this chapter explores are as follows:
12+
13+
1. Conversion and optimization from a training model to an inference
14+
model.
15+
16+
2. Common methods for model compression: quantization, sparsification,
17+
and knowledge distillation.
18+
19+
3. Model inference process and common methods for performance
20+
optimization.
21+
22+
4. Common methods for model security protection.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Overview
2+
3+
After training a model, we need to save it and its parameters to files
4+
to make them persistent. However, because different training frameworks
5+
adopt different data structures for such files, the inference system
6+
must support models trained using different training frameworks and
7+
convert the data in the files into a unified data structure. During the
8+
conversion from a training model to an inference model, optimization
9+
operations such as operator fusion and constant folding on the model can
10+
be performed to improve the inference performance.
11+
12+
The hardware restrictions of different production environments must be
13+
considered when we deploy an inference model. For instance, a
14+
large-scale model needs to be deployed on a server in a computing or
15+
data center with strong computing power, whereas a mid-scale model
16+
should be deployed on an edge server, PC, or smartphone --- such devices
17+
often have limited computing resources and memory. For simple,
18+
small-scale models, ultra-low power microcontrollers can be used. In
19+
addition, different hardware supports different data types (such as
20+
float32, float16, bfloat16, and int8). To adapt to the hardware
21+
restrictions, a trained model may sometimes need to be compressed in
22+
order to reduce model complexity or data precision and reduce model
23+
parameters.
24+
25+
Before a model can be used for inference, it needs to be deployed in the
26+
runtime environment. To optimize model inference, which may be affected
27+
by latency, memory usage, and power consumption, we can design chips
28+
dedicated for machine learning --- such dedicated chips usually
29+
outperform general-purpose ones in terms of energy efficiency. Another
30+
approach is to fully leverage hardware capabilities through
31+
software-hardware collaboration. Take a CPU as an example. When
32+
designing and optimizing models for a specific CPU architecture, we can
33+
suitably divide data blocks to meet the cache size, rearrange data to
34+
facilitate contiguous data access during computing, reduce data
35+
dependency to improve the parallelism of hardware pipelines, and use
36+
extended instruction sets to improve the computing performance.
37+
38+
Because models are an important enterprise asset, it is important to
39+
ensure their security after they are deployed in the runtime
40+
environment. This chapter will discuss some of the common protection
41+
measures and use model obfuscation as an example.
42+
43+
Some of the common methods used in the industry to address the preceding
44+
challenges are as follows:
45+
46+
1. **Model compression:** Technologies that reduce the model size and
47+
computation complexity by means of quantization and pruning. Such
48+
technologies can be categorized according to whether retraining is
49+
required.
50+
51+
2. **Operator fusion:** Technologies that combine multiple operators
52+
into one by simplifying expressions and fusing attributes, aiming to
53+
reduce the computation complexity and size of the model.
54+
55+
3. **Constant folding:** Forward computation of operators that meet
56+
certain conditions is completed in the offline phase, reducing the
57+
computation complexity and size of a model. This requires that the
58+
inputs of operators be constants in the offline phase.
59+
60+
4. **Data format:** According to the operator library and hardware
61+
restrictions and exploration of the optimal data format of each
62+
layer on the network, data is rearranged or data rearrangement
63+
operators are inserted, in order to reduce the inference latency
64+
during model deployment.
65+
66+
5. **Model obfuscation:** Network nodes or branches are added and
67+
operator names are changed for a trained model, so that it is
68+
difficult for attackers to understand the original model structure
69+
even if they steal the model. An obfuscated model may be directly
70+
executed in the deployment environment, thereby ensuring the
71+
security of the model during execution.
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Security Protection of Models
2+
3+
After training and optimizing models locally, AI service providers
4+
deploy the models on third-party platforms (such as mobile devices, edge
5+
devices, and cloud servers) to provide inference services. The design
6+
and training of the AI models require a large amount of time, data, and
7+
computing power. This is why model and service providers protect the
8+
intellectual property rights of the models (including model structures
9+
and parameters) from being stolen during transfer, storage, and running
10+
in the deployment phase.
11+
12+
## Overview
13+
14+
The security protection of models can be divided into static protection
15+
and dynamic protection. Static protection refers to protecting models
16+
during transfer and storage. At present, it is widely implemented based
17+
on file encryption, in which AI model files are transferred and stored
18+
in ciphertext and are decrypted in the memory before being used for
19+
inference. However, throughout the inference process, models remain in
20+
plaintext in the memory, making it possible for theft. Dynamic
21+
protection refers to protecting models during runtime. Dynamic
22+
protection methods currently available can be classified into three
23+
categories. The first is trusted execution environment-based (TEE-based)
24+
protection. TEEs are usually secure zones isolated on trusted hardware,
25+
and AI model files are stored and transferred in non-secure zones and
26+
running after decryption in the secure zones. Although this method
27+
involves only a short inference latency on the CPU, it requires specific
28+
trusted hardware, making it difficult to implement. In addition, due to
29+
constraints on hardware resources, protecting large-scale deep models is
30+
difficult and heterogeneous hardware acceleration is still challenging.
31+
The second is a cryptographic computing-based protection, which ensures
32+
that models remain in ciphertext during transfer, storage, and running
33+
using cryptographic techniques (such as homomorphic encryption and
34+
secure multi-party computation). Although this method is free from
35+
hardware constraints, it has large computation or communications
36+
overheads and cannot protect model structure information. The third is
37+
obfuscation-based protection. This method scrambles the computational
38+
logic of models with fake nodes, so that attackers cannot understand the
39+
models even if they obtain them. Compared with the former two methods,
40+
obfuscation-based protection brings a smaller overhead to the
41+
performance and neglectable loss of accuracy. Furthermore, it is
42+
hardware-agnostic, and can support protection of very large models. We
43+
will focus on protection using the obfuscation-based method.
44+
45+
## Model Obfuscation
46+
47+
Model obfuscation can automatically obfuscate the computational logic of
48+
plaintext AI models, preventing attackers from understanding the models
49+
even if they obtain them during transfer and storage. In addition,
50+
models can run while still being obfuscated, thereby ensuring the
51+
confidentiality while they are running. Obfuscation does not affect the
52+
inference results and brings only a low performance overhead.
53+
54+
![Procedure of modelobfuscation](../img/ch08/model_obfuscate.png)
55+
:label:`ch-deploy/model_obfuscate`
56+
57+
Figure :numref:`ch-deploy/model_obfuscate` depicts the model obfuscation
58+
procedure, which is described as follows.
59+
60+
1. **Interpret the given model into a computational graph:** Based on
61+
the structure of a trained model, interpret the model file into the
62+
graph expression (computational graph) of the model computational
63+
logic for subsequent operations. The resulting computational graph
64+
contains information such as node identifiers, node operator types,
65+
node parameters, and network structures.
66+
67+
2. **Scramble the network structure of the computational graph[^1]:**
68+
Scramble the relationship between nodes in the computational graph
69+
using graph compression, augmentation, and other techniques in order
70+
to conceal the true computational logic. In graph compression, the
71+
key subgraph structure is matched by checking the entire graph.
72+
These subgraphs are compressed and replaced with a single new
73+
computing node. Graph augmentation adds new input/output edges to
74+
the compressed graph in order to further conceal the dependencies
75+
between nodes. An input/output edge comes from or points to an
76+
existing node in the graph, or comes from or points to the new
77+
obfuscation node in this step.
78+
79+
3. **Anonymize nodes in the computational graph:** Traverse the
80+
computational graph processed in Step (2) and select the nodes to be
81+
protected. For a node to be protected, we can replace the node
82+
identifier, operator type, and other attributes that can describe
83+
the computational logic of the model with non-semantic symbols. For
84+
node identifier anonymization, the anonymized node identifier must
85+
be unique in order to distinguish different nodes. For operator type
86+
anonymization, to avoid operator type explosion caused by
87+
large-scale computational graph anonymization, we can divide nodes
88+
with the same operator type into several disjoint sets, and replace
89+
the operator type of nodes in the same set with the same symbol.
90+
Step (5) ensures that the model can be identified and executed after
91+
node anonymization.
92+
93+
4. **Scramble weights of the computational graph:** Add random noise
94+
and mapping functions to the weights to be protected. The random
95+
noise and mapping functions can vary with weights. Step (6) ensures
96+
that the noise of weights does not change the model execution
97+
result. The computational graph processed after Steps (2), (3),
98+
and (4) are then saved as a model file for subsequent operations.
99+
100+
5. **Transform operator interfaces:** Steps (5) and (6) transform
101+
operators to be protected in order to generate candidate obfuscated
102+
operators. An original operator may correspond to multiple
103+
obfuscated operators. The quantity of candidate obfuscated operators
104+
depends on how many sets the nodes are grouped into in Step (3). In
105+
this step, the operator interfaces are transformed based on the
106+
anonymized operator types and operator input/output relationship
107+
obtained after Steps (2), (3), and (4). Such transformation can be
108+
implemented by changing the input, output, or interface name.
109+
Changing the input and output involves modification on the input and
110+
output data, making the form of the obfuscated operator different
111+
from that of the original operator. The added data includes the data
112+
dependency introduced by graph augmentation in Step (2) and the
113+
random noise introduced by weight obfuscation in Step (4). The
114+
operator name is changed to the name of the anonymized operator
115+
obtained in Step (3) to ensure that the model can still be
116+
identified and executed after the nodes are anonymized and that the
117+
operator name does not reveal the computational logic.
118+
119+
6. **Transform the operator implementation:** Transform the operator
120+
code implementation by encrypting strings, adding redundant code,
121+
and employing other code obfuscation techniques in order to keep the
122+
computational logic consistent between the original operator and
123+
obfuscated operator while also making the logic more difficult to
124+
understand. A combination of different code obfuscation techniques
125+
may be applied to different operators in order to realize the code
126+
implementation transformation. In addition to equivalent code
127+
transformation, the obfuscated operators further implement some
128+
additional computational logic. For example, in Step (4), noise has
129+
been added to the weights of an operator. The obfuscated operator
130+
also implements an inverse mapping function of the weight noise,
131+
dynamically eliminating noise in the operator execution process and
132+
ensuring that the computation result is the same as the original
133+
model. The generated obfuscated operators can then be saved as a
134+
library file for subsequent operations.
135+
136+
7. **Deploy the model and operator library:** Deploy the obfuscated
137+
model and corresponding operator library file on the desired device.
138+
139+
8. **Load the obfuscated model:** Parse the obfuscated model file and
140+
obtain the graph expression of the model computational logic, that
141+
is, the obfuscated computational graph obtained after Step (2), (3),
142+
and (4).
143+
144+
9. **Initialize the computational graph:** Initialize the computational
145+
graph to generate an execution task sequence. According to security
146+
configuration options, if runtime model security needs to be
147+
protected, the obfuscated graph should be directly initialized to
148+
generate an execution task sequence. Each compute unit in the
149+
sequence corresponds to execution of one obfuscated operator or
150+
original operator. If security protection is required during only
151+
model transfer and storage, restore the obfuscated graph in the
152+
memory to the source graph, and then initialize the source graph to
153+
generate an execution task sequence. Each unit in the sequence
154+
corresponds to the execution of an original operator. In this way,
155+
performance overheads during inference can be further reduced.
156+
157+
10. **Execute inference tasks:** The model executes the compute units
158+
sequentially on the input of the AI application in order to obtain
159+
an inference result. If a compute unit corresponds to an obfuscated
160+
operator, the obfuscated operator library is invoked. Otherwise, the
161+
original operator library is invoked.
162+
163+
[^1]: Scrambling refers to adding noise to the computational graph.
164+
Common methods include adding redundant nodes and edges and merging
165+
some subgraphs.

0 commit comments

Comments
 (0)