Skip to content

Commit d550498

Browse files
committed
upload section
1 parent 9269fbc commit d550498

File tree

1 file changed

+198
-0
lines changed

1 file changed

+198
-0
lines changed
Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# Model Inference
2+
3+
After conversion and compression, a trained model needs to be deployed
4+
on the computation hardware in order to execute inference. Such
5+
execution involves the following steps:
6+
7+
1. Preprocessing: Process raw data to suit the network input.
8+
9+
2. Inference execution: Deploy the model resulting from offline
10+
conversion on the device to execute inference and compute the output
11+
based on the input.
12+
13+
3. Postprocessing: Further process the output of the model, for
14+
example, by threshold filtering.
15+
16+
## Preprocessing and Postprocessing
17+
18+
**1. Preprocessing**
19+
20+
Raw data, such as images, voices, and texts, is so disordered that
21+
machine learning models cannot identify or extract useful information
22+
from it. Preprocessing is intended to convert such into tensors that
23+
work for machine learning networks, eliminate irrelevant information,
24+
restore useful true information, enhance the detectability of relevant
25+
information, and simplify the data as much as possible. In this way,
26+
reliability indicators related to feature extraction, image
27+
segmentation, matching, and recognition of the models can be improved.
28+
29+
The following techniques are often used in data preprocessing:
30+
31+
1. Feature encoding: Encode the raw data that describes features into
32+
numbers and input them to machine learning models which can process
33+
only numerical values. Common encoding approaches include
34+
discretization, ordinal encoding, one-hot encoding, and binary
35+
encoding.
36+
37+
2. Normalization: Modify features to be on the same scale without
38+
changing the correlation between them, eliminating the impact of
39+
dimensions between data indicators. Common approaches include
40+
Min-Max normalization that normalizes the data range, and Z-score
41+
normalization that normalizes data distribution.
42+
43+
3. Outliner processing: An outlier is a data point that is distant from
44+
all others in distribution. Elimination of outliers can improve the
45+
accuracy of a model.
46+
47+
**2. Postprocessing**
48+
49+
After model inference, the output data is transferred to users for
50+
postprocessing. Common postprocessing techniques include:
51+
52+
1. Discretization of contiguous data: Assume we expect to predict
53+
discrete data, such as the quantity of a good, using a model, but a
54+
regression model only provides contiguous prediction values, which
55+
have to be rounded or bounded.
56+
57+
2. Data visualization: This technique uses graphics and tables to
58+
represent data so that we can find relationships in the data in
59+
order to support analysis strategy selection.
60+
61+
3. Prediction range widening: Most values predicted by a regression
62+
model are concentrated in the center, and few are in the tails. For
63+
example, abnormal values of hospital laboratory data are used to
64+
diagnose diseases. To increase the accuracy of prediction, we can
65+
enlarge the values in both tails by widening the prediction range
66+
and multiplying the values that deviate from the normal range by a
67+
coefficient to
68+
69+
## Parallel Computing
70+
:label:`ch-deploy/parallel-inference`
71+
72+
Most inference models have a multi-thread mechanism that leverages the
73+
capabilities of multiple cores in order to achieve performance
74+
improvements. In this mechanism, the input data of operators is
75+
partitioned, and multiple threads are used to process different data
76+
partitions. This allows operators to be computed in parallel, thereby
77+
multiplying the operator performance.
78+
79+
![Data partitioning for matrixmultiplication](../img/ch08/ch09-parallel.png)
80+
:label:`ch09_parallel`
81+
82+
In Figure :numref:`ch09_parallel`, the matrix in the multiplication can be
83+
partitioned according to the rows of matrix A. Three threads can then be
84+
used to compute A1 \* B, A2 \* B, and A3 \* B (one thread per
85+
computation), implementing multi-thread parallel execution of the matrix
86+
multiplication.
87+
88+
To facilitate parallel computing of operators and avoid the overhead of
89+
frequent thread creation and destruction, inference frameworks usually
90+
have a thread pooling mechanism. There are two common practices:
91+
92+
1. Open Multi-Processing (OpenMP) API: OpenMP is an API that supports
93+
concurrency through memory sharing across multiple platforms. It
94+
provides interfaces that are commonly used to implement operator
95+
parallelism. An example of such an interface is `parallel for`,
96+
which allows `for` loops to be concurrently executed by multiple
97+
threads.
98+
99+
2. Framework-provided thread pools: Such pools are more lightweight and
100+
targeted at the AI domain compared with OpenMP interfaces, and can
101+
deliver better performance.
102+
103+
## Operator Optimization
104+
:label:`ch-deploy/kernel-optimization`
105+
106+
When deploying an AI model, we want model training and inference to be
107+
performed as fast as possible in order to obtain better performance. For
108+
a deep learning network, the scheduling of the framework takes a short
109+
period of time, whereas operator execution is often a bottleneck for
110+
performance. This section introduces how to optimize operators from the
111+
perspectives of hardware instructions and algorithms.
112+
113+
**1. Hardware instruction optimization**
114+
115+
Given that most devices have CPUs, the time that CPUs spend processing
116+
operators has a direct impact on the performance. Here we look at the
117+
methods for optimizing hardware instructions on ARM CPUs.
118+
119+
**1) Assembly language**
120+
121+
High-level programming languages such as C++ and Java are compiled as
122+
machine instruction code sequences by compilers, which often have a
123+
direct influence on which capabilities these languages offer. Assembly
124+
languages are close to machine code and can implement any instruction
125+
code sequence in one-to-one mode. Programs written in assembly languages
126+
occupy less memory, and are faster and more efficient than those written
127+
in high-level languages.
128+
129+
In order to exploit the advantages of both types of languages, we can
130+
write the parts of a program that require better performance in assembly
131+
languages and the other parts in high-level languages. Because
132+
convolution and matrix multiplication operators in deep learning involve
133+
a large amount of computation, using assembly languages for code
134+
necessary to perform such computation can improve model training and
135+
inference performance by dozens or even hundreds of times.
136+
137+
Next, we use ARMv8 CPUs to illustrate the optimization related to
138+
hardware instructions.
139+
140+
**2) Registers and NEON instructions**
141+
142+
Each ARMv8 CPU has 32 NEON registers, that is, v0 to v31. As shown in
143+
Figure :numref:`ch-deploy/register`, NEON register v0 can store 128
144+
bits, which is equal to the capacity of 4 float32, 8 float16, or 16
145+
int8.
146+
147+
![Structure of the NEON register v0 of an ARMv8CPU](../img/ch08/ch09-register.png)
148+
:label:`ch-deploy/register`
149+
150+
The single instruction multiple data (SIMD) method can be used to
151+
improve the data access and computing speed on this CPU. Compared with
152+
single data single instruction (SISD), the NEON instruction can process
153+
multiple data values in the NEON register at a time. For example, the
154+
`fmla` instruction for floating-point data is used as
155+
`fmla v0.4s, v1.4s, v2.4s`. As depicted in Figure
156+
:numref:`ch-deploy/fmla`, the products of the corresponding
157+
floating-point values in registers v1 and v2 are added to the value in
158+
v0.
159+
160+
![fmla instructioncomputing](../img/ch08/ch09-fmla.png)
161+
:label:`ch-deploy/fmla`
162+
163+
**3) Assembly language optimization**
164+
165+
For assembly language programs with known functions, computational
166+
instructions are usually fixed. In this case, non-computational
167+
instructions are the source the performance bottleneck. The structure of
168+
computer storage devices resembles a pyramid, as shown in Figure
169+
:numref:`ch-deploy/fusion-storage`. The top layer has the fastest
170+
speed but the smallest space; conversely, the bottom layer has the
171+
largest space but the slowest speed. L1 to L3 are referred to as caches.
172+
When accessing data, the CPU first attempts to access the data from one
173+
of its caches. If the data is not found, the CPU then accesses an
174+
external main memory. Cache hit rate is introduced to measure the
175+
proportion of data that is accessed from the cache. In this sense, the
176+
cache hit rate must be maximized to improve the program performance.
177+
178+
There are some techniques to improve the cache hit rate and optimize the
179+
assembly performance:
180+
181+
1. Loop unrolling: Use as many registers as possible to achieve better
182+
performance at the cost of increasing the code size.
183+
184+
2. Instruction reordering: Reorder the instructions of different
185+
execution units to improve the pipeline utilization, thereby
186+
allowing instructions that incur latency to be executed first. In
187+
addition to reducing the latency, this method also reduces data
188+
dependency before and after the instruction.
189+
190+
3. Register blocking: Block NEON registers appropriately to reduce the
191+
number of idle registers and reuse more registers.
192+
193+
4. Data rearrangement: Rearrange the computational data to ensure
194+
contiguous memory reads and writes and improve the cache hit rate.
195+
196+
5. Instruction prefetching: Load the required data from the main memory
197+
to the cache in advance to reduce the access latency.
198+

0 commit comments

Comments
 (0)