Merge pull request #2419 from AI-Hypercomputer:mohit/quant_doc

Google-ML-Automation · Google-ML-Automation · commit df45a16fca8f · 2025-10-28T17:18:35.000-07:00
PiperOrigin-RevId: 825272158
diff --git a/docs/explanations/quantization.md b/docs/explanations/quantization.md
@@ -1,5 +1,5 @@
 <!--
- Copyright 2024 Google LLC
+ Copyright 2024-2025 Google LLC
 
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
@@ -16,55 +16,78 @@
 
 # Quantization
 
-MaxText supports quantization via both the [AQT](https://github.com/google/aqt) and [Qwix](https://github.com/google/qwix) libraries. Qwix is the recommended approach, providing a non-intrusive way to apply various quantization techniques, including Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ).
+Quantization in deep learning is the process of reducing the precision of numbers used to represent a model's weights and/or activations. Instead of using higher-precision floating-point formats like 32-bit floats (`float32`) or 16-bit brain floats (`bfloat16`), quantization maps these values to lower-precision numerical formats, most commonly 8-bit integers (`int8`) or floats (`fp8`).
 
-## Why quantize?
+MaxText supports quantization via both the [AQT](https://github.com/google/aqt) and [Qwix](https://github.com/google/qwix) libraries. Qwix is the recommended approach, providing a non-intrusive way to apply Quantized Training (QT).
 
-*   **Reduced model size**: Lower precision numbers require less storage, making models easier to store and deploy.
-*   **Faster inference**: Operations on lower-precision data are computationally less expensive, which can lead to faster inference times.
-*   **Lower memory usage**: Reduced precision for weights and activations decreases the memory footprint, allowing for the deployment of larger models on hardware with limited memory.
+## Why use quantization? 
 
-## Quantizing using AQT
+The drive to use lower-precision formats like `int8` or `fp8` stems from significant performance advantages:
 
-Jax supports AQT. You can read more about AQT on this [Google Cloud blog](https://cloud.google.com/blog/products/compute/accurate-quantized-training-aqt-for-tpu-v5e).
-You can turn on the quantization by adding the following flag `--quantization` and passing one of the following values:
+**Faster computation**: Hardware accelerators like TPUs and GPUs often have specialized instructions for integer arithmetic. Operations on lower-precision data like `int8` or `fp8` can be significantly faster than on BF16 or FP32. For example, matrix multiplications with these formats can often be 2x or more faster on hardware supporting native lower-precision tensor cores.
 
-- 'int8' for dynamic range quantization using 8-bits
-- 'int8w' for weights only quantization using 8-bits
-- 'int4w' for weights only quantization using 4-bits
-- 'intmp' for mixed precision weight only quantization based on config file
-- 'fp8' for 8-bit floating-point GeMMs on NVIDIA GPUs.
+**Reduced memory footprint**: Storing weights and activations in `int8` or `fp8` requires 2x less memory compared to `bfloat16`. This reduces:
+- **HBM usage**: Less memory is needed on the accelerator itself.
+- **Communication costs**: Less data needs to be transferred between memory and compute units, or across devices in distributed training, which makes these transfers faster and consumes less bandwidth.
+- **Reduced power consumption**: Lower precision operations and reduced memory access lead to less energy usage, which is crucial for deploying models on edge devices and for sustainable AI.
+
+The primary trade-off with quantization is between the model accuracy and computational performance:
+
+* Reduced Dynamic Range & Precision: Lower-precision formats like `int8` or `fp8` can represent a much smaller range of values and with less precision than BF16. This can be problematic for models with wide distributions of weights or activations, potentially clipping large values or losing fine-grained details.
+* Impact on Gradients: Gradients during backpropagation can have very different, often wider, distributions than weights or activations, making them more sensitive to quantization errors.
+* Convergence Issues: The approximations introduced by quantization can sometimes hinder the model's ability to converge during training.
+
+To overcome the challenges of quantization, libraries like Google's Accurate Quantized Training (AQT) and its successor Qwix (used in MaxText) employ a suite of advanced techniques. These methods ensure that models can be trained with low-precision arithmetic without significant loss in accuracy and with stable convergence.
+
+## How Quantized Training (QT) works with Qwix
+
+Quantized Training (QT) incorporates the effects of quantization into the training loop. This allows the model to learn and adapt to the reduced precision of quantized weights and activations.
+
+Here’s how it works:
+
+1.  **Forward Pass**: During the forward pass, high-precision weights and activations are converted to a lower-precision format. This step simulates the information loss that occurs during quantization. The model then performs its computations using these lower-precision representations before they are converted back to a higher precision for the rest of the network. This process forces the model to become robust to the noise and reduced range of quantized values.
+
+2.  **Backward Pass**: Standard backpropagation cannot flow through the non-differentiable quantization operations (like rounding). To solve this, QT uses the **Straight-Through Estimator (STE)**. The STE essentially "ignores" the non-differentiable quantization step during the backward pass, passing the gradients through as if the operation was an identity function. This allows the high-precision weights to be updated based on the loss, enabling the model to learn effectively.
+
+By integrating the quantization simulation directly into the training, the model learns to minimize the impact of precision loss, resulting in a more accurate quantized model.
+
+## Using Quantization in MaxText
+
+You can enable quantization in MaxText by setting flags in your configuration file (e.g., `base.yml`) or via the command line. MaxText supports two quantization libraries: Qwix (recommended) and AQT.
 
+### Configuration Flags
 
+The primary flags to control quantization are:
 
-## How QAT works with Qwix
+*   `use_qwix_quantization`: A boolean flag.
+    *   Set to `True` to enable quantization using the Qwix library.
+    *   Set to `False` (or omit) to use the AQT library if `quantization` is set.
+*   `quantization`: A string that specifies the type of quantization to apply. The accepted values depend on whether you are using Qwix or AQT.
+*   `quantization_calibration_method`: The calibration method for weights and activations (e.g., `"absmax"`). This is mainly for Qwix.
 
-The core idea behind QAT is to insert "fake quantization" operations into the model's computation graph. During the training forward pass, these operations simulate the effect of quantizing weights and activations to a lower precision. For the backward pass, Qwix uses the Straight-Through Estimator (STE) to approximate the gradients, allowing the model to learn effectively despite the non-differentiable nature of quantization.
+### Qwix Quantization (Recommended)
 
-## Using Qwix in MaxText
+To use Qwix, you must set `use_qwix_quantization=True`. Qwix is a powerful and non-intrusive library for Quantized Training.
 
-You can enable quantization in MaxText by setting flags in your configuration file (e.g., `base.yml`) or via the command line.
+#### `quantization` values for Qwix
 
-### Configuration flags
+Common options for the `quantization` flag when using Qwix include:
 
-*   `use_qwix_quantization`: Must be set to `True` to enable quantization using the Qwix library.
-*   `quantization`: Specifies the type of quantization to apply. Common options include:
-    *   `"int8"`: 8-bit integer quantization.
-    *   `"fp8"`: 8-bit floating-point quantization.
-    *   `"fp8_full"`: FP8 quantization with static scaling.
-    *   `"fp8_gpu"`: FP8 for NVIDIA GPUs.
-    *   `"fp8_nanoo"`: FP8 for AMD MI300/MI325 GPUs.
-*   `quantization_calibration_method`: The calibration method for weights and activations (e.g., `"absmax"`).
+*   `"int8"`: 8-bit integer quantization.
+*   `"fp8"`: 8-bit floating-point quantization.
+*   `"fp8_full"`: FP8 quantization with static scaling.
+*   `"fp8_gpu"`: FP8 for NVIDIA GPUs.
+*   `"fp8_nanoo"`: FP8 for AMD MI300/MI325 GPUs.
 
-### Example command
+#### Example command for Qwix
 
 Here is an example of how to run a training job with int8 quantization enabled via Qwix:
 
 ```bash
-python3 -m MaxText.train src/MaxText/configs/base.yml ... use_qwix_quantization=True quantization='int8'
+python3 -m MaxText.train src/MaxText/configs/base.yml run_name=$YOUR_JOB_NAME base_output_directory=gs://<my-bucket> dataset_type=synthetic use_qwix_quantization=true quantization='int8'
 ```
 
-## The Qwix interception API
+#### The Qwix Interception API
 
 MaxText integrates Qwix using its powerful and non-intrusive Interception API. This approach allows you to enable QAT for your models without modifying the original model source code. You don't need to manually replace `nn.Dense` with `QuantizedDense` or other quantized layer types.
 
@@ -96,3 +119,26 @@ This rule is then used within a `QtProvider` to quantize the model automatically
 ```python
 model = qwix.quantize_model(model, qwix.QtProvider(rule))
 ```
+
+### AQT Quantization
+
+If `use_qwix_quantization` is `False` or not set, you can still apply quantization using the AQT library by setting the `quantization` flag. You can read more about AQT on this [Google Cloud blog](https://cloud.google.com/blog/products/compute/accurate-quantized-training-aqt-for-tpu-v5e).
+
+#### `quantization` values for AQT
+
+When using AQT, you can pass one of the following values to the `quantization` flag:
+
+- 'int8' for dynamic range quantization using 8-bits
+- 'int8w' for weights only quantization using 8-bits
+- 'int4w' for weights only quantization using 4-bits
+- 'intmp' for mixed precision weight only quantization based on config file
+- 'fp8' for 8-bit floating-point GeMMs on NVIDIA GPUs.
+
+#### Example command for AQT
+
+```bash
+python3 -m MaxText.train src/MaxText/configs/base.yml run_name=$YOUR_JOB_NAME base_output_directory=gs://<my-bucket> dataset_type=synthetic use_qwix_quantization=false quantization='int8'
+```
+Note that `use_qwix_quantization` is not set to `True`.
+
+For further reading, please refer to the [Qwix Read the Docs website](https://qwix.readthedocs.io/en/latest/get_started.html#).