@@ -196,3 +196,50 @@ assembly performance:
1961965 . Instruction prefetching: Load the required data from the main memory
197197 to the cache in advance to reduce the access latency.
198198
199+ ** 2. Algorithm optimization**
200+
201+ For most AI models, 90% or more of the inference time of the entire
202+ network is spent on computing convolution and matrix multiplication
203+ operators. This section focuses on the optimization of convolution
204+ operator algorithms, which can be applied to various hardware devices.
205+ The computation of convolution can be converted into the multiplication
206+ of two matrices, and we have elaborated on the optimization of the GEMM
207+ algorithm in Section :ref:` ch-deploy/parallel-inference ` . For different hardware,
208+ appropriate matrix blocking can optimize data load/store efficiency and
209+ instruction parallelism. This helps to maximize the utilization of the
210+ hardware's computing power, thereby improving the inference performance.
211+
212+ ** (1) Img2col**
213+
214+ Img2col is often used to convert convolution into matrix multiplication.
215+ Convolutional layers typically operate on 4D inputs in NHWC format.
216+ Figure :numref:` ch-deploy/conv_nhwc ` is a diagram of convolution. The
217+ input shape is (1, IH, IW, IC), the convolution kernel shape is (OC, KH,
218+ KW, IC), and the output shape is (1, OH, OW, OC).
219+
220+ ![ Generalconvolution] ( ../img/ch08/ch09-conv_nhwc.png )
221+ :label : ` ch-deploy/conv_nhwc `
222+
223+ As shown in Figure
224+ :numref:` ch-deploy/img2col_input ` , the Img2col rules for
225+ convolution are as follows: The input is reordered to obtain the matrix
226+ on the right. The number of rows corresponds to the number of OH \* OW
227+ outputs. For a row vector, Img2col processes KH \* KW data points of
228+ each input channel in sequence, from the first channel to channel IC.
229+
230+ ![ Img2col on the convolutioninput] ( ../img/ch08/ch09-img2col_input.png )
231+ :label : ` ch-deploy/img2col_input `
232+
233+ As shown in Figure
234+ :numref:` ch-deploy/img2col_weight ` , the weights are rearranged.
235+ One convolution kernel is expanded into one column of the weight matrix.
236+ This means that there are OC columns in total. On each column vector, KH
237+ \* KW data values on the first input channel are arranged first, and
238+ then on subsequent channels until the channel IC. In this manner, the
239+ convolution operation is converted into the multiplication of two
240+ matrices. In practice, the data rearrangement of Img2col and GEMM is
241+ performed simultaneously to save time.
242+
243+ ![ Img2col on the convolutionkernel] ( ../img/ch08/ch09-img2col_weight.png )
244+ :label : ` ch-deploy/img2col_weight `
245+
0 commit comments