[Docs] Add english documents for BF16 and oneDNN optimization. (#554)

Duyi-Wang · web-flow · commit 67c5b592fbec · 2022-11-16T15:23:57.000+08:00
diff --git a/docs/docs_en/Operator-Optimization.md b/docs/docs_en/Operator-Optimization.md
@@ -0,0 +1,94 @@
+# Optimization of Operator
+
+## Hardware and Software Configuration
+
+Hardware: [Alibaba Cloud ECS general purpose instance family with high clock speeds - **ecs.hfg7.2xlarge**](https://help.aliyun.com/document_detail/25378.html?spm=5176.2020520101.vmBInfo.instanceType.4a944df5PvCcED#hfg7).
+
+CPU number: 8 cores
+
+Baseline version: Tensorflow v1.15.5
+
+Optimized version: DeepRec
+
+Gcc version 7.5.0
+
+## Performance Data
+
+| Op Name           | Input Tensor Shape                                       | Baseline Perf (latency/ms) | Optimized Perf (latency/ms) | Speedup |
+| ----------------- | -------------------------------------------------------- | -------------------------- | --------------------------- | ------- |
+| Select            | condition: (1024, 64), x: (1024, 64), y: (1024, 64)      | 2.080                      | 0.564                       | +3.68X  |
+| Dynamic_stitch    | indices: (40, 2500), data: (40, 2500, 64)                | 82.14                      | 24.77                       | +3.31X  |
+| Transpose         | data: (1024, 64)                                         | 1.504                      | 0.366                       | +4.11X  |
+| Tile              | input: (512, 50), multiples: (2, 50)                     | 1.68                       | 0.125                       | +13.44X |
+| BiasAddGrad       | data: (51200, 512)                                       | 26.84                      | 1.67                        | +16.07X |
+| SparseSegmentMean | data: (51200, 128), indices: (51200), seg index: (51200) | 1.93                       | 0.445                       | +4.34X  |
+| Unique            |                                                          |                            |                             |         |
+| Gather            |                                                          |                            |                             |         |
+| BiasAdd           |                                                          |                            |                             |         |
+| where             |                                                          |                            |                             |         |
+| DynamicPartition  |                                                          |                            |                             |         |
+| SparseConcat      |                                                          |                            |                             |         |
+
+## Case Studies：Select
+
+The computing process of operator Select：
+
+![select.png](../docs_zh/Operator-Optimization/select.png)
+
+TensorFlow original implementation：Broadcast + Elementwise Select
+
+```
+template <typename Device, typename T, int NDIMS>
+struct BCastSelectFunctorBase {
+  void operator()(const Device& d,
+                  typename TTypes<T, NDIMS>::Tensor output_tensor,
+                  typename TTypes<bool, NDIMS>::ConstTensor cond_tensor,
+                  typename TTypes<T, NDIMS>::ConstTensor then_tensor,
+                  typename TTypes<T, NDIMS>::ConstTensor else_tensor,
+                  typename Eigen::array<Eigen::DenseIndex, NDIMS> cond_bcast,
+                  typename Eigen::array<Eigen::DenseIndex, NDIMS> then_bcast,
+                  typename Eigen::array<Eigen::DenseIndex, NDIMS> else_bcast) {
+    output_tensor.device(d) = cond_tensor.broadcast(cond_bcast)
+                                  .select(then_tensor.broadcast(then_bcast),
+                                          else_tensor.broadcast(else_bcast));
+  }
+};
+```
+
+PAI-TF (Merged to Community)：Row Select ,Optimized redundant broadcast operations in the original TensorFlow version.。
+
+```
+    if (c[i]) {
+        for (size_t j = 0; j < batch_size; ++j) {
+        output[offset + j] = t[offset + j];
+        }
+    } else {
+        for (size_t j = 0; j < batch_size; ++j) {
+        output[offset + j] = e[offset + j];
+        }
+    }
+```
+
+DeepRec: vectorized Row Select, used AVX512 mask vectorisation instructions for the further optimizing  of select operation, which improved the performance of this operator by 3.68x.
+
+```
+    __mmask16 cmask = (c[i] == false) ? 0xffff : 0x0000;  // select t/e
+    size_t ofs = 0;
+
+    for (size_t j = 0; j < quotient; ++j) {
+        __m512 src = _mm512_loadu_ps(t + offset + ofs);
+        __m512 tmp = _mm512_mask_loadu_ps(src, cmask, e + offset + ofs);
+        _mm512_storeu_ps(output + offset + ofs,  tmp);
+        ofs +=  float_alignment;
+    }
+
+    if (remainder != 0) {
+        __mmask16 mask = (remainder >= float_alignment)
+            ? 0xffff : 0xffff >> (float_alignment - remainder);
+        cmask &= mask;
+        __m512 src  = _mm512_mask_loadu_ps(_mm512_setzero_ps(), mask, t + offset + ofs);
+        __m512 tmp = _mm512_mask_loadu_ps(src, cmask, e + offset + ofs);
+        _mm512_mask_storeu_ps(output + offset + ofs, mask, tmp);
+    }
+```
+
diff --git a/docs/docs_en/bf16.md b/docs/docs_en/bf16.md
@@ -0,0 +1,82 @@
+# BFloat16
+
+BFloat16 (BF16) is a computational format and the instruction for accelerating deep learning training and inference, which is supported on the third-generation Intel® Xeon® Scalable processor Cooper Lake [AliCloud hfg7 specification family](https://help.aliyun.com/document_detail/25378.html?spm=5176.2020520101.vmBInfo.instanceType.4a944df5PvCcED#hfg7) and its successor processors. Below shows the comparison with other commonly used data formats:
+
+![img_1.png](../docs_zh/oneDNN/BF16.png)
+
+## Requirments and methods
+
+Requirements：The cloud instance requires to be the third-generation Intel® Xeon® Scalable processor Cooper Lake [AliCloud hfg7 specification family](https://help.aliyun.com/document_detail/25378.html?spm=5176.2020520101.vmBInfo.instanceType.4a944df5PvCcED#hfg7). It also requires to use DeepRec which is compiled and optimized by oneDNN in order to provide BF16 instruction acceleration, details of which can be found in the oneDNN section.
+
+Method：As the recommended scenarios are extremely demanding in terms of model accuracy, in order to improve model performance while taking into account model accuracy, users could control the BF16 computing graph freely in the following way:
+
+- Step 1: Add `.keep_weights(dtype=tf.float32)` after `tf.variable_scope(…)` to keep the current weights as FP32.
+- Step 2: Add `tf.cast(…, dtype=tf.bfloat16)` to transfer the input tensors to BF16 type.
+- Step3: Add `tf.cast(…, dtype=tf.float32)` to transfer the output tensors to FP32 type.
+
+```
+with tf.variable_scope(…).keep_weights(dtype=tf.float32):
+  inputs_bf16 = tf.cast(inputs, dtype=tf.bfloat16)
+  … // BF16 graph, FP32 weights
+  outputs = tf.cast(outputs_bf16, dtype=tf.float32)
+```
+
+Example：
+
+```
+import tensorflow as tf
+
+inputs = tf.ones([4, 8], tf.float32)
+
+with tf.variable_scope('dnn', reuse=tf.AUTO_REUSE).keep_weights(dtype=tf.float32):
+  # cast inputs to BF16
+  inputs = tf.cast(inputs, dtype=tf.bfloat16)
+  outputs = tf.layers.dense(inputs, units=8, activation=tf.nn.relu)
+  outputs = tf.layers.dense(inputs, units=1, activation=None)
+  # cast ouputs to FP32
+  outputs = tf.cast(outputs, dtype=tf.float32)
+
+  outputs = tf.nn.softmax(outputs)
+
+with tf.Session() as sess:
+  sess.run(tf.global_variables_initializer())
+  print(sess.run(outputs))
+```
+
+Special Reminder: according to the experience of parameters tuning, usually the last layer of DNN network in a multi-layer DNN network has the most impact on the accuracy of the model, which occupies a lower computational ratio. So the last layer of DNN network can be converted to FP32 type to run, which can improve the computational performance of the model training while preserving the accuracy of the model.
+
+To maintain consistency with the accuracy of the model without BF16 optimization, DeepRec provides the `keep_weights(dtype=dtypes.float32)` method in variable_scope. With this method, all variables in this variable field will be saved in FP32 format, which significantly reduce the cumulative summation error of variables. And the cast operation is automatically added to the graph, converting it to BF16 format for computation. To reduce the extra computational overhead of the cast operation introduced, DeepRec automatically fuses the cast operator with its nearest operator to improve the operation performance. DeepRec will perform the following fusion operations on cast-related operators.
+
+- MatMul + Cast
+- Concat + Cast
+- Split + Cast
+
+## Performance comparison
+
+Use models in DeepRec Modelzoo to compare the DeepRec with BF16 and FP32 to see the performance improvement. Models in Modelzoo can enable the BF16 feature by adding `--bf16` parameter.
+
+Use Aliyun ECS cloud server as benchmark machine, Intel 3rd Xeon Scalable Processor(Cooper Lake) specification [ecs.hfg7.2xlarge](https://help.aliyun.com/document_detail/25378.html?spm=5176.2020520101.vmBInfo.instanceType.4a944df5PvCcED#hfg7)
+
+- Hardware configuration：
+  - Intel(R) Xeon(R) Platinum 8369HC CPU @ 3.30GHz
+  - CPU(s): 8
+  - Socket(s): 1
+  - Core(s) per socket: 4
+  - Thread(s) per core: 2
+  - Memory: 32G
+- Software configuration：
+  - kernel: 4.18.0-348.2.1.el8_5.x86_64
+  - OS: CentOS Linux release 8.5.2111
+  - GCC: 8.5.0
+  - Docker: 20.10.12
+  - Python: 3.6.8
+
+Performance Result：
+
+| **Throughput** | **WDL**  | **DeepFM** | **DSSM**  |
+| -------------- | -------- | ---------- | --------- |
+| FP32           | 15792.49 | 30718.6    | 114436.87 |
+| FP32+BF16      | 22633.8  | 34554.67   | 125995.83 |
+| Speedup        | 1.43x    | 1.12x      | 1.10x     |
+
+BF16 has little effect on the AUC metric of model training, more details of the difference can be found in the documentation of each model in the model zoo.
diff --git a/docs/docs_en/index.md b/docs/docs_en/index.md
@@ -12,3 +12,20 @@ Sparse model is a type of deep learning model that accounts for a relatively hig
 
 DeepRec has been deeply cultivated since 2016, which supports core businesses such as Taobao Search, recommendation and advertising. It precipitates a list of features on basic frameworks and has excellent performance in sparse models training. Facing a wide variety of external needs and the environment of deep learning framework embracing open source, DeepeRec open source is conducive to establishing standardized interfaces, cultivating user habits, greatly reducing the cost of external customers working on cloud and establishing the brand value.
 
+# Getting started
+
+# Features
+```{toctree}
+:maxdepth: 2
+:caption: Operator & Hardware Acceleration
+
+oneDNN
+Operator-Optimization
+```
+
+```{toctree}
+:maxdepth: 2
+:caption: Model Quantification
+
+BFloat16
+```
diff --git a/docs/docs_en/oneDNN.md b/docs/docs_en/oneDNN.md
@@ -0,0 +1,48 @@
+# oneDNN
+
+## Introduction
+
+[oneDNN](https://github.com/oneapi-src/oneDNN) is the open source cross-platform performance acceleration library for deep learning from Intel, The [documentation](https://oneapi-src.github.io/oneDNN/) guides you to find out which primitives are supported. OneDNN has been integrated into DeepRec, which can be enabled by adding the compiling option in the compile command. `--config=mkl_threadpool` is used to enable oneDNN accelerated arithmetic computation. Adding the compiling option `--config=opt` will enable the optimization of `--copt=-march=native`, which can further accelerate arithmetic performance on the CPU which supports AVX512, for example, Skylake, Caslake and Icelake.
+
+
+
+Tips: MKL was first renamed as DNNL and then renamed as oneDNN. Tensorflow initially used MKL to accelerate the computation of the operators, and in subsequent versions of iteration, oneDNN gradually take the place of MKL, but the macro definitions were still retained. 
+
+
+
+Macro definition of oneDNN in DeepRec:
+
+| Macro Definition                 |  Values（Bold for Default）            | Explanation                                                  |
+| :------------------------------- | --------------------------------------------- | ------------------------------------------------------------ |
+| TF_MKL_PRIMITIVE_ONLY_FOR_RECO   | **1/true**, 0/false                           | 1: Only replace the [operators](https://github.com/alibaba/DeepRec/blob/main/tensorflow/core/graph/mkl_layout_pass.cc#L824-L840) which supported by oneDNN in recommendation models; 0: Replace all of the operators to that supported by oneDNN. |
+| TF_MKL_OPTIMIZE_PRIMITIVE_MEMUSE | **1/true**, 0/false                           | 1: Reduce the use of main memory by releasing the primitives; 0: Don't release primitives. |
+| TF_DISABLE_MKL                   | **0**, 1                                      | 0: Enable MKL; 1: Disable MKL                                |
+| TF_MKL_NUM_INTRAOP               | Integer, such as 14 ,**Not set by default**   | Integer：set the number of intra threads used by oneDNN；Not set：number of TF intra threads used most. |
+| ONEDNN_VERBOSE                   | **0**/1/2                                     | Print the [level](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html) of log output by oneDNN primitive. |
+| DNNL_MAX_CPU_ISA                 | **ALL**, AVX512_CORE_AMX, AVX512_CORE_BF16, … | The[ highest ISA](https://oneapi-src.github.io/oneDNN/v2.4/dev_guide_cpu_dispatcher_control.html#run-time-controls) used by oneDNN (for versions less than 2.5.0) |
+| ONEDNN_MAX_CPU_ISA               | **ALL**, AVX512_CORE_AMX, AVX512_CORE_BF16, … | The [highest ISA](https://oneapi-src.github.io/oneDNN/v2.4/dev_guide_cpu_dispatcher_control.html#run-time-controlsused) by oneDNN (for versions more than or equal to 2.5.0) |
+
+Primitives supported by oneDNN:
+
+| Primitive                              | Available Types               | Available Backward Operations     |
+| -------------------------------------- | --------------------------- | --------------------------------- |
+| Matrix Multiplication                  | f32, bf16, f16, u8, s8      | Scale, Zero, Eltwise, Sum, Binary |
+| Inner Product                          | f32, bf16, f16, u8, s8      | Scale, Eltwise, Sum, Binary       |
+| Layer Normalization                    | f32, bf16, f16              | /                                 |
+| Batch Normalization                    | f32, bf16, f16, s8          | Eltwise                           |
+| Local Response Normalization (LRN)     | f32, bf16, f16              | /                                 |
+| Binary (+, =, *, /, >, <, min, max...) | f32, bf16, f16, u8, s8      | Scale, Eltwise, Sum, Binary       |
+| Eltwise (relu, gelu, tanh, linear...)  | f32, s32, bf16, f16, u8, s8 | Binary                            |
+| PReLU                                  | f32, s32, bf16, s8, u8      | /                                 |
+| Sum                                    | f32, s32, bf16, f16, u8, s8 | /                                 |
+| Reduction                              | f32, bf16, u8, s8           | Eltwise, Sum, Binary              |
+| Softmax                                | f32, bf16, f16              | /                                 |
+| LogSoftmax                             | f32, bf16                   | /                                 |
+| Reorder                                | f32, s32, bf16, f16, u8, s8 | Scale, Sum                        |
+| Concat                                 | f32, s32, bf16, f16, u8, s8 | /                                 |
+| Convolution                            | f32, bf16, f16, u8, s8      | Scale, Zero, Eltwise, Sum, Binary |
+| Pooling                                | f32, s32, bf16, f16, u8, s8 | Binary                            |
+| RNN (LSTM, GRU, Vanilla RNN...)        | f32, bf16, f16, u8, s8      | /                                 |
+| Resampling                             | f32, s32, bf16, f16, s8, u8 | Eltwise, Sum, Binary              |
+| Shuffle                                | f32, s32, bf16, s8, u8      | /                                 |
+