Update bfloat16 support for Intel CPUs

gaurides · gaurides · commit 771f9b9afc32 · 2023-03-23T18:32:30.000-07:00
diff --git a/site/en/guide/mixed_precision.ipynb b/site/en/guide/mixed_precision.ipynb
@@ -70,7 +70,7 @@
       "source": [
         "## Overview\n",
         "\n",
-        "Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy. This guide describes how to use the Keras mixed precision API to speed up your models. Using this API can improve performance by more than 3 times on modern GPUs and 60% on TPUs."
+        "Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy. This guide describes how to use the Keras mixed precision API to speed up your models. Using this API can improve performance by more than 3 times on modern GPUs, 60% on TPUs and more than 2 times on latest Intel CPUs."
       ]
     },
     {
@@ -81,7 +81,7 @@
       "source": [
         "Today, most models use the float32 dtype, which takes 32 bits of memory. However, there are two lower-precision dtypes, float16 and bfloat16, each which take 16 bits of memory instead. Modern accelerators can run operations faster in the 16-bit dtypes, as they have specialized hardware to run 16-bit computations and 16-bit dtypes can be read from memory faster.\n",
         "\n",
-        "NVIDIA GPUs can run operations in float16 faster than in float32, and TPUs can run operations in bfloat16 faster than float32. Therefore, these lower-precision dtypes should be used whenever possible on those devices. However, variables and a few computations should still be in float32 for numeric reasons so that the model trains to the same quality. The Keras mixed precision API allows you to use a mix of either float16 or bfloat16 with float32, to get the performance benefits from float16/bfloat16 and the numeric stability benefits from float32.\n",
+        "NVIDIA GPUs can run operations in float16 faster than in float32, and TPUs and supporting Intel CPUs can run operations in bfloat16 faster than float32. Therefore, these lower-precision dtypes should be used whenever possible on those devices. However, variables and a few computations should still be in float32 for numeric reasons so that the model trains to the same quality. The Keras mixed precision API allows you to use a mix of either float16 or bfloat16 with float32, to get the performance benefits from float16/bfloat16 and the numeric stability benefits from float32.\n",
         "\n",
         "Note: In this guide, the term \"numeric stability\" refers to how a model's quality is affected by the use of a lower-precision dtype instead of a higher precision dtype. An operation is \"numerically unstable\" in float16 or bfloat16 if running it in one of those dtypes causes the model to have worse evaluation accuracy or other metrics compared to running the operation in float32."
       ]
@@ -118,9 +118,11 @@
       "source": [
         "## Supported hardware\n",
         "\n",
-        "While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs and Cloud TPUs. NVIDIA GPUs support using a mix of float16 and float32, while TPUs support a mix of bfloat16 and float32.\n",
+        "While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs, Cloud TPUs and recent Intel CPUs. NVIDIA GPUs support using a mix of float16 and float32, while TPUs and Intel CPUs support a mix of bfloat16 and float32.\n",
         "\n",
-        "Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed precision because they have special hardware units, called Tensor Cores, to accelerate float16 matrix multiplications and convolutions. Older GPUs offer no math performance benefit for using mixed precision, however memory and bandwidth savings can enable some speedups. You can look up the compute capability for your GPU at NVIDIA's [CUDA GPU web page](https://developer.nvidia.com/cuda-gpus). Examples of GPUs that will benefit most from mixed precision include RTX GPUs, the V100, and the A100."
+        "Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed precision because they have special hardware units, called Tensor Cores, to accelerate float16 matrix multiplications and convolutions. Older GPUs offer no math performance benefit for using mixed precision, however memory and bandwidth savings can enable some speedups. You can look up the compute capability for your GPU at NVIDIA's [CUDA GPU web page](https://developer.nvidia.com/cuda-gpus). Examples of GPUs that will benefit most from mixed precision include RTX GPUs, the V100, and the A100.\n",
+       "\n",
+       "Among Intel CPUs, starting with the 4th Gen Intel Xeon Processors (code name Sapphire Rapids), will see the greatest performance benefit from mixed precision as they can accelerate bfloat16 computations using AMX instructions (requires Tensorflow 2.12 or later)."
       ]
     },
     {
@@ -129,7 +131,7 @@
         "id": "-q2hisD60F0_"
       },
       "source": [
-        "Note: If running this guide in Google Colab, the GPU runtime typically has a P100 connected. The P100 has compute capability 6.0 and is not expected to show a significant speedup.\n",
+        "Note: If running this guide in Google Colab, the GPU runtime typically has a P100 connected. The P100 has compute capability 6.0 and is not expected to show a significant speedup. And if running on CPU runtime, there may be a slow down as the runtime likely has a CPU without AMX.\n",
         "\n",
         "You can check your GPU type with the following. The command only exists if the\n",
         "NVIDIA drivers are installed, so the following will raise an error otherwise."
@@ -154,7 +156,7 @@
       "source": [
         "All Cloud TPUs support bfloat16.\n",
         "\n",
-        "Even on CPUs and older GPUs, where no speedup is expected, mixed precision APIs can still be used for unit testing, debugging, or just to try out the API. On CPUs, mixed precision will run significantly slower, however."
+        "Even on older Intel CPUs, other x86 CPUs without AMX, and older GPUs, where no speedup is expected, mixed precision APIs can still be used for unit testing, debugging, or just to try out the API. However, mixed_bfloat16 on CPUs without AMX instructions and mixed_float16 on all x86 CPUs will run significantly slower."
       ]
     },
     {
@@ -235,7 +237,7 @@
         "id": "MOFEcna28o4T"
       },
       "source": [
-        "As mentioned before, the `mixed_float16` policy will most significantly improve performance on NVIDIA GPUs with compute capability of at least 7.0. The policy will run on other GPUs and CPUs but may not improve performance. For TPUs, the `mixed_bfloat16` policy should be used instead."
+        "As mentioned before, the `mixed_float16` policy will most significantly improve performance on NVIDIA GPUs with compute capability of at least 7.0. The policy will run on other GPUs and CPUs but may not improve performance. For TPUs and CPUs, the `mixed_bfloat16` policy should be used instead."
       ]
     },
     {
@@ -480,7 +482,9 @@
       "source": [
         "## Loss scaling\n",
         "\n",
-        "Loss scaling is a technique which `tf.keras.Model.fit` automatically performs with the `mixed_float16` policy to avoid numeric underflow. This section describes what loss scaling is and the next section describes how to use it with a custom training loop."
+        "Loss scaling is a technique which `tf.keras.Model.fit` automatically performs with the `mixed_float16` policy to avoid numeric underflow. This section describes what loss scaling is and the next section describes how to use it with a custom training loop.\n",
+        "\n",
+        "Note: When using `mixed_bfloat16` policy, there is no need to do loss scaling."
       ]
     },
     {
@@ -806,16 +810,17 @@
       "source": [
         "## Summary\n",
         "\n",
-        "- You should use mixed precision if you use TPUs or NVIDIA GPUs with at least compute capability 7.0, as it will improve performance by up to 3x.\n",
+        "- You should use mixed precision if you use TPUs, NVIDIA GPUs with at least compute capability 7.0, or Intel CPUs with support for AMX instructions, as it will improve performance by up to 3x.\n",
         "- You can use mixed precision with the following lines:\n",
         "\n",
         "  ```python\n",
-        "  # On TPUs, use 'mixed_bfloat16' instead\n",
+        "  # On TPUs and CPUs, use 'mixed_bfloat16' instead\n",
         "  mixed_precision.set_global_policy('mixed_float16')\n",
         "  ```\n",
         "\n",
         "* If your model ends in softmax, make sure it is float32. And regardless of what your model ends in, make sure the output is float32.\n",
         "* If you use a custom training loop with `mixed_float16`, in addition to the above lines, you need to wrap your optimizer with a `tf.keras.mixed_precision.LossScaleOptimizer`. Then call `optimizer.get_scaled_loss` to scale the loss, and `optimizer.get_unscaled_gradients` to unscale the gradients.\n",
+        "* If you use a custom training loop with `mixed_bfloat16`, setting the global_policy mentioned above is sufficient.\n",
         "* Double the training batch size if it does not reduce evaluation accuracy\n",
         "* On GPUs, ensure most tensor dimensions are a multiple of $8$ to maximize performance\n",
         "\n",