|
| 1 | +# Model Inference |
| 2 | + |
| 3 | +After conversion and compression, a trained model needs to be deployed |
| 4 | +on the computation hardware in order to execute inference. Such |
| 5 | +execution involves the following steps: |
| 6 | + |
| 7 | +1. Preprocessing: Process raw data to suit the network input. |
| 8 | + |
| 9 | +2. Inference execution: Deploy the model resulting from offline |
| 10 | + conversion on the device to execute inference and compute the output |
| 11 | + based on the input. |
| 12 | + |
| 13 | +3. Postprocessing: Further process the output of the model, for |
| 14 | + example, by threshold filtering. |
| 15 | + |
| 16 | +## Preprocessing and Postprocessing |
| 17 | + |
| 18 | +**1. Preprocessing** |
| 19 | + |
| 20 | +Raw data, such as images, voices, and texts, is so disordered that |
| 21 | +machine learning models cannot identify or extract useful information |
| 22 | +from it. Preprocessing is intended to convert such into tensors that |
| 23 | +work for machine learning networks, eliminate irrelevant information, |
| 24 | +restore useful true information, enhance the detectability of relevant |
| 25 | +information, and simplify the data as much as possible. In this way, |
| 26 | +reliability indicators related to feature extraction, image |
| 27 | +segmentation, matching, and recognition of the models can be improved. |
| 28 | + |
| 29 | +The following techniques are often used in data preprocessing: |
| 30 | + |
| 31 | +1. Feature encoding: Encode the raw data that describes features into |
| 32 | + numbers and input them to machine learning models which can process |
| 33 | + only numerical values. Common encoding approaches include |
| 34 | + discretization, ordinal encoding, one-hot encoding, and binary |
| 35 | + encoding. |
| 36 | + |
| 37 | +2. Normalization: Modify features to be on the same scale without |
| 38 | + changing the correlation between them, eliminating the impact of |
| 39 | + dimensions between data indicators. Common approaches include |
| 40 | + Min-Max normalization that normalizes the data range, and Z-score |
| 41 | + normalization that normalizes data distribution. |
| 42 | + |
| 43 | +3. Outliner processing: An outlier is a data point that is distant from |
| 44 | + all others in distribution. Elimination of outliers can improve the |
| 45 | + accuracy of a model. |
| 46 | + |
| 47 | +**2. Postprocessing** |
| 48 | + |
| 49 | +After model inference, the output data is transferred to users for |
| 50 | +postprocessing. Common postprocessing techniques include: |
| 51 | + |
| 52 | +1. Discretization of contiguous data: Assume we expect to predict |
| 53 | + discrete data, such as the quantity of a good, using a model, but a |
| 54 | + regression model only provides contiguous prediction values, which |
| 55 | + have to be rounded or bounded. |
| 56 | + |
| 57 | +2. Data visualization: This technique uses graphics and tables to |
| 58 | + represent data so that we can find relationships in the data in |
| 59 | + order to support analysis strategy selection. |
| 60 | + |
| 61 | +3. Prediction range widening: Most values predicted by a regression |
| 62 | + model are concentrated in the center, and few are in the tails. For |
| 63 | + example, abnormal values of hospital laboratory data are used to |
| 64 | + diagnose diseases. To increase the accuracy of prediction, we can |
| 65 | + enlarge the values in both tails by widening the prediction range |
| 66 | + and multiplying the values that deviate from the normal range by a |
| 67 | + coefficient to |
| 68 | + |
| 69 | +## Parallel Computing |
| 70 | +:label:`ch-deploy/parallel-inference` |
| 71 | + |
| 72 | +Most inference models have a multi-thread mechanism that leverages the |
| 73 | +capabilities of multiple cores in order to achieve performance |
| 74 | +improvements. In this mechanism, the input data of operators is |
| 75 | +partitioned, and multiple threads are used to process different data |
| 76 | +partitions. This allows operators to be computed in parallel, thereby |
| 77 | +multiplying the operator performance. |
| 78 | + |
| 79 | + |
| 80 | +:label:`ch09_parallel` |
| 81 | + |
| 82 | +In Figure :numref:`ch09_parallel`, the matrix in the multiplication can be |
| 83 | +partitioned according to the rows of matrix A. Three threads can then be |
| 84 | +used to compute A1 \* B, A2 \* B, and A3 \* B (one thread per |
| 85 | +computation), implementing multi-thread parallel execution of the matrix |
| 86 | +multiplication. |
| 87 | + |
| 88 | +To facilitate parallel computing of operators and avoid the overhead of |
| 89 | +frequent thread creation and destruction, inference frameworks usually |
| 90 | +have a thread pooling mechanism. There are two common practices: |
| 91 | + |
| 92 | +1. Open Multi-Processing (OpenMP) API: OpenMP is an API that supports |
| 93 | + concurrency through memory sharing across multiple platforms. It |
| 94 | + provides interfaces that are commonly used to implement operator |
| 95 | + parallelism. An example of such an interface is `parallel for`, |
| 96 | + which allows `for` loops to be concurrently executed by multiple |
| 97 | + threads. |
| 98 | + |
| 99 | +2. Framework-provided thread pools: Such pools are more lightweight and |
| 100 | + targeted at the AI domain compared with OpenMP interfaces, and can |
| 101 | + deliver better performance. |
| 102 | + |
| 103 | +## Operator Optimization |
| 104 | +:label:`ch-deploy/kernel-optimization` |
| 105 | + |
| 106 | +When deploying an AI model, we want model training and inference to be |
| 107 | +performed as fast as possible in order to obtain better performance. For |
| 108 | +a deep learning network, the scheduling of the framework takes a short |
| 109 | +period of time, whereas operator execution is often a bottleneck for |
| 110 | +performance. This section introduces how to optimize operators from the |
| 111 | +perspectives of hardware instructions and algorithms. |
| 112 | + |
| 113 | +**1. Hardware instruction optimization** |
| 114 | + |
| 115 | +Given that most devices have CPUs, the time that CPUs spend processing |
| 116 | +operators has a direct impact on the performance. Here we look at the |
| 117 | +methods for optimizing hardware instructions on ARM CPUs. |
| 118 | + |
| 119 | +**1) Assembly language** |
| 120 | + |
| 121 | +High-level programming languages such as C++ and Java are compiled as |
| 122 | +machine instruction code sequences by compilers, which often have a |
| 123 | +direct influence on which capabilities these languages offer. Assembly |
| 124 | +languages are close to machine code and can implement any instruction |
| 125 | +code sequence in one-to-one mode. Programs written in assembly languages |
| 126 | +occupy less memory, and are faster and more efficient than those written |
| 127 | +in high-level languages. |
| 128 | + |
| 129 | +In order to exploit the advantages of both types of languages, we can |
| 130 | +write the parts of a program that require better performance in assembly |
| 131 | +languages and the other parts in high-level languages. Because |
| 132 | +convolution and matrix multiplication operators in deep learning involve |
| 133 | +a large amount of computation, using assembly languages for code |
| 134 | +necessary to perform such computation can improve model training and |
| 135 | +inference performance by dozens or even hundreds of times. |
| 136 | + |
| 137 | +Next, we use ARMv8 CPUs to illustrate the optimization related to |
| 138 | +hardware instructions. |
| 139 | + |
| 140 | +**2) Registers and NEON instructions** |
| 141 | + |
| 142 | +Each ARMv8 CPU has 32 NEON registers, that is, v0 to v31. As shown in |
| 143 | +Figure :numref:`ch-deploy/register`, NEON register v0 can store 128 |
| 144 | +bits, which is equal to the capacity of 4 float32, 8 float16, or 16 |
| 145 | +int8. |
| 146 | + |
| 147 | + |
| 148 | +:label:`ch-deploy/register` |
| 149 | + |
| 150 | +The single instruction multiple data (SIMD) method can be used to |
| 151 | +improve the data access and computing speed on this CPU. Compared with |
| 152 | +single data single instruction (SISD), the NEON instruction can process |
| 153 | +multiple data values in the NEON register at a time. For example, the |
| 154 | +`fmla` instruction for floating-point data is used as |
| 155 | +`fmla v0.4s, v1.4s, v2.4s`. As depicted in Figure |
| 156 | +:numref:`ch-deploy/fmla`, the products of the corresponding |
| 157 | +floating-point values in registers v1 and v2 are added to the value in |
| 158 | +v0. |
| 159 | + |
| 160 | + |
| 161 | +:label:`ch-deploy/fmla` |
| 162 | + |
| 163 | +**3) Assembly language optimization** |
| 164 | + |
| 165 | +For assembly language programs with known functions, computational |
| 166 | +instructions are usually fixed. In this case, non-computational |
| 167 | +instructions are the source the performance bottleneck. The structure of |
| 168 | +computer storage devices resembles a pyramid, as shown in Figure |
| 169 | +:numref:`ch-deploy/fusion-storage`. The top layer has the fastest |
| 170 | +speed but the smallest space; conversely, the bottom layer has the |
| 171 | +largest space but the slowest speed. L1 to L3 are referred to as caches. |
| 172 | +When accessing data, the CPU first attempts to access the data from one |
| 173 | +of its caches. If the data is not found, the CPU then accesses an |
| 174 | +external main memory. Cache hit rate is introduced to measure the |
| 175 | +proportion of data that is accessed from the cache. In this sense, the |
| 176 | +cache hit rate must be maximized to improve the program performance. |
| 177 | + |
| 178 | +There are some techniques to improve the cache hit rate and optimize the |
| 179 | +assembly performance: |
| 180 | + |
| 181 | +1. Loop unrolling: Use as many registers as possible to achieve better |
| 182 | + performance at the cost of increasing the code size. |
| 183 | + |
| 184 | +2. Instruction reordering: Reorder the instructions of different |
| 185 | + execution units to improve the pipeline utilization, thereby |
| 186 | + allowing instructions that incur latency to be executed first. In |
| 187 | + addition to reducing the latency, this method also reduces data |
| 188 | + dependency before and after the instruction. |
| 189 | + |
| 190 | +3. Register blocking: Block NEON registers appropriately to reduce the |
| 191 | + number of idle registers and reuse more registers. |
| 192 | + |
| 193 | +4. Data rearrangement: Rearrange the computational data to ensure |
| 194 | + contiguous memory reads and writes and improve the cache hit rate. |
| 195 | + |
| 196 | +5. Instruction prefetching: Load the required data from the main memory |
| 197 | + to the cache in advance to reduce the access latency. |
| 198 | + |
0 commit comments