You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/documents/Examples_Tutorials/Examples_Tutorials.rst
+21-24Lines changed: 21 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,25 +18,25 @@ Our assumption is that you are familiar with:
18
18
19
19
- **Caffe framework basics**
20
20
21
-
The development process of MLI-based embedded application is depicted with diagram:
21
+
The proposed development process of MLI-based embedded application is depicted with diagram:
22
22
23
23
.. image:: ../images/1_depl_process.png
24
24
:align:center
25
25
:alt:MLI-Based Application Development Process
26
26
27
27
..
28
28
29
-
1. Model definition and training in some appropriate framework. Ensure that you consider all limitations of the target platform here including memory restriction, MHz budget, and quantization effect in some cases.
29
+
1. Model definition and training in some appropriate framework. Ensure that you consider all limitations of the target platform here including memory restrictionand frequency budget.
30
30
31
-
2. Model deployment implies construction of tested and verified ML module with a defined interface. Hence, wrap the module into file-to-file application for convenient debugging and verification.
31
+
2. Model deployment implies construction of tested and verified ML module with a defined interface. It is recommended to wrap the module into file-to-file application for convenient debugging and verification.
32
32
MLI CIFAR-10 example is exactly of this “unit-testing” kind of applications.
33
33
34
34
3. Integrate this module into the target embedded application code with real data.
35
35
36
36
This tutorial focuses on the second step – model deployment.
37
37
Manual deployment consists of two main parts:
38
38
39
-
- Deploying data — This is obvious because training implies tuning of model parameters.
39
+
- Deploying data — Training implies tuning of model parameters.
40
40
41
41
- Deploying operations — The model consists of not only parameters but also algorithm that uses some basic operations or machine learning primitives.
42
42
@@ -121,7 +121,7 @@ Using defined pieces of Python code, you can extract all the required data from
121
121
Collect Data Range Statistic for Each Layer
122
122
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
123
123
124
-
Quantization process is not only meant to convert weights data to fixed point representation, but also meant to define ranges of all the intermediate data for each layer. For this purpose, run the model on some representative data subset and gather statistics for all intermediate results. It is better to use all training subsets, or even all the dataset.
124
+
Quantization process is not only meant to convert weights data to fixed point representation, but also meant to define ranges of all the intermediate data for each layer. For this purpose, run the model on some representative data subset and gather statistics for all intermediate results. It is recommended to use full training subset.
125
125
126
126
To accomplish this using previously defined instruments, see this sample code:
127
127
@@ -180,10 +180,10 @@ MLI supports fixed point format defined by Q-notation (see section MLI Fixed-Poi
1. For a convolution layer, define the number of integer bits as in the previous example.
256
-
257
-
2. For each output value, the compute the number of required sequential accumulations: 32[number of channels] * (5*5) [kernel size] +1 [bias] = 801 operations. Hence, 10 extra bits are required for accumulation.
258
-
259
-
3. Since the number of extra bits is less than the allocated bits for integer - 9, increase number of integer bits for layer input.
260
-
261
-
For the following fully connected layer, 11 extra bits are required and you need to distribute 2 bits. It’s recommended to do it evenly between operands. Note that number of convolution’s output fractional bits also needs to be changed to be aligned with next fully connected input.
250
+
For convolution layer X, number of integer bits are defined as before. And for each output value, the following number of sequential accumulations is required: 32[number of channels] * (5*5) [kernel size] +1 [bias] = 801 operations. 10 extra bits are required for accumulation while only 9 are available. For this reason, the number of integer bits for layer input are increased.
251
+
252
+
For the following fully connected layer, 11 extra bits are required and 2 bits need to be distributed. It’s recommended to do this evenly between operands. Note that number of convolution’s output fractional bits also needs to be changed to be aligned with next fully connected input.
262
253
263
254
For 8-bit operands,you do not need to perform this adjustment unless your MAC series is more than 131072 operations in which case, apply similar approach. After considering accumulator restrictions for CIFAR-10 example with 16-bit operands, you get the following table:
264
255
@@ -293,15 +284,15 @@ For 8-bit operands,you do not need to perform this adjustment unless your MAC se
293
284
294
285
295
286
.. note::
296
-
Defining Q format in this way, you can guarantee that accumulator is not saturated while a single output is being calculated. But the restriction may be loosened if you are sure about your data. For example, look at the final fully connected layer above: 9 bits are enough if we do not consider bias addition. Analyze how likely is it that for 1 extra addition result will overflow the defined range. Moreover, saturation of results might have a minor effect on the network accuracy.
287
+
Defining Q format in this way, you can guarantee that accumulator is not saturated while a single output is being calculated. But the restriction may be loosened if you are sure about your data. For example, look at the final fully connected layer above: 9 bits (512 MACs) are enough if we do not consider bias addition. Analyze how likely is it that for 1 extra addition result will overflow the defined range. Moreover, saturation of results might have a minor effect on the network accuracy.
297
288
..
298
289
299
290
Quantize Weights According to Defined Q-Format
300
291
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
301
292
302
293
After extracting coefficients in numpy array objects and defining Qm.n format for data, define MLI structures for kernels and export the quantized data.
303
294
304
-
Consider a static allocation of data. To extract weights, you may make pre-processor quantize data for you in compile-time by wrapping each coefficient into some macro. It is slower and uses more memory resources of your machine for compilation, but it is worth if the model is not so big.
295
+
Consider a static allocation of data. To extract weights, you may make pre-processor quantize data for you in compile-time by wrapping each coefficient into macro function. It is slower and uses more memory resources of your machine for compilation, but it is worth if the model is not so big.
305
296
306
297
.. code:: c++
307
298
@@ -321,7 +312,7 @@ Consider a static allocation of data. To extract weights, you may make pre-proce
321
312
};
322
313
..
323
314
324
-
Alternatively, you may quantize data externally Layer 1_conv in the same way and just put it into code.
315
+
Alternatively, you can quantize data externally in the same way and just put it into code.
325
316
326
317
.. code:: c++
327
318
@@ -413,6 +404,8 @@ Transpose data by permute layer with appropriate parameters:
@@ -472,6 +465,8 @@ Parameters of all convolutions in the model are the same, so you may use the onl
472
465
473
466
..
474
467
468
+
MLI Pooling behavior differs from Caffe default behavior. In Caffe, padding is implied for some combinations of layer parameters, even if not specified. You should indicate padding clearly because it is meant in the Caffe. It was done for compatibility with other frameworks.
469
+
475
470
.. table:: Example Pooling Layer with Padding
476
471
:widths: 20, 130
477
472
@@ -534,6 +529,8 @@ Consider the last two operations:
534
529
535
530
..
536
531
532
+
Fully connected (referred as Inner Product in Caffe) and softmax don’t require any specific analysis.
533
+
537
534
.. table:: Example of Function Choosing Optimal Specialization
0 commit comments