You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> - Users should provide a tokentized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
23
+
> - Users should provide a tokenized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
24
24
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
25
25
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
Copy file name to clipboardExpand all lines: examples/FP8_QUANT/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,7 +38,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
38
38
```
39
39
40
40
> [!NOTE]
41
-
> - The quantized model and tokenizer will be saved to `output_dir`, but some additional temperary storage space may be needed.
41
+
> - The quantized model and tokenizer will be saved to `output_dir`, but some additional temporary storage space may be needed.
42
42
> - Runtime ~ 1 min on A100. (model download time not included)
43
43
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-3.0-8b-instruct` instead
> - Users should provide a tokentized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
31
+
> - Users should provide a tokenized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
32
32
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
33
33
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
Copy file name to clipboardExpand all lines: examples/PTQ_INT8/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ This is an example of [block sequential PTQ](https://arxiv.org/abs/2102.05426).
10
10
11
11
-[FMS Model Optimizer requirements](../../README.md#requirements)
12
12
- The inferencing step requires Nvidia GPUs with compute capability > 8.0 (A100 family or higher)
13
-
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferrably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
13
+
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
14
14
-[Ninja](https://ninja-build.org/)
15
15
-`PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel)
> The script can take up to 20 mins to run (on a single A100). By default, it is configured for detailed logging.You can disable the logging by removing the `with_tracking` and `report_to` flags in the script.
46
+
> The script can take up to 20 mins to run (on a single A100). By default, it is configured for detailed logging.You can disable the logging by removing the `with_tracking` and `report_to` flags in the script.
47
47
48
48
#### **2. Apply PTQ** on the fine-tuned model, which converts the precision data to 8-bit integer (INT8):
Copy file name to clipboardExpand all lines: examples/QAT_INT8/README.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Model Optimization Using Quantization-Aware Training (QAT)
2
2
3
-
FMS Model Optimizer supports [quantization](https://www.ibm.com/think/topics/quantization) of models which will enable the utilization of reduced-precision numerical format and specialiazed hardware to accelerate inference performance (i.e., make "calling a model" faster).
3
+
FMS Model Optimizer supports [quantization](https://www.ibm.com/think/topics/quantization) of models which will enable the utilization of reduced-precision numerical format and specialized hardware to accelerate inference performance (i.e., make "calling a model" faster).
4
4
5
5
Generally speaking, matrix multiplication (matmul) is the main operation in a neural network. The goal of quantization is to convert a floating-point (FP) matmul into an integer (INT) matmul, which runs much faster and requires lower energy consumption. A simplified example would be:
- where $X$, $W$ are FP tensors whose elements are all within a certain range, e.g. $[-5.0, 5.0]$, $@$ is matmul operation, $\lfloor \rceil$ is rounding operation, scaling factor $s_x, s_w$ in this case is simply $5/127$.
10
10
- On the right hand side, after scaling and rounding the tensors will only contain integers in the range of $[-127, 127]$, which can be stored as a 8-bit integer.
11
11
- We may now use an INT8 matmul instead of a FP32 matmul to perform the task then multiply the scaling factors afterward.
12
-
-**Important** The benefit from INT matmul should outweight the overhead from scaling, rounding, and descaling. But rounding will inevitably introduce approximation errors. Luckily, we can mitigate the errors by taking these quantization related operations into account during the training process, hence the Quantization-aware training ([QAT](https://arxiv.org/pdf/1712.05877))!
12
+
-**Important** The benefit from INT matmul should outweigh the overhead from scaling, rounding, and descaling. But rounding will inevitably introduce approximation errors. Luckily, we can mitigate the errors by taking these quantization related operations into account during the training process, hence the Quantization-aware training ([QAT](https://arxiv.org/pdf/1712.05877))!
13
13
14
14
In the following example, we will first create a fine-tuned FP16 model, and then quantize this model from FP16 to INT8 using QAT. Once the model is tuned and QAT'ed, you can observe the accuracy and the acceleration at inference time of the model.
15
15
@@ -18,7 +18,7 @@ In the following example, we will first create a fine-tuned FP16 model, and then
18
18
19
19
-[FMS Model Optimizer requirements](../../README.md#requirements)
20
20
- The inferencing step requires Nvidia GPUs with compute capability > 8.0 (A100 family or higher)
21
-
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferrably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
21
+
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
22
22
-[Ninja](https://ninja-build.org/)
23
23
-`PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel)
> The script can take up to 40 mins to run (on a single A100). By default, it is configured for detailed logging.You can disable the logging by removing the `with_tracking` and `report_to` flags in the script. This can reduce the runtime by around 20 mins.
53
+
> The script can take up to 40 mins to run (on a single A100). By default, it is configured for detailed logging.You can disable the logging by removing the `with_tracking` and `report_to` flags in the script. This can reduce the runtime by around 20 mins.
54
54
55
55
#### **2. Apply QAT** on the fine-tuned model, which converts the precision data to 8-bit integer (INT8):
56
56
@@ -96,7 +96,7 @@ Checkout [Example Test Results](#example-test-results) to compare against your r
96
96
97
97
## Example Test Results
98
98
99
-
For comparsion purposes, here are some of the results we found during testing when tested with PyTorch 2.3.1:
99
+
For comparison purposes, here are some of the results we found during testing when tested with PyTorch 2.3.1:
100
100
101
101
- Accuracy could vary ~ +-0.2 from run to run.
102
102
-`INT8` matmuls are ~2x faster than `FP16` matmuls, However, `INT8` models will have additional overhead compared to `FP16` models. For example, converting FP tensors to INT before INT matmul.
@@ -190,7 +190,7 @@ return # Stop the run here, no further training loop
190
190
191
191
In this example:
192
192
193
-
- By default, QAT will run `calibration` to initialize the quantization related parameters (with a small number of training data). At the end of QAT, these paramaters are saved with the ckpt, as we DO NOT want to run calibration at deployment stage. Hence, `qcfg['qmodel_calibration'] = 0`.
194
-
- Quantization related parameters will not be automatically loaded by the HiggingFace method, as those are not part of the original BERT model. Hence calling `qmodel_prep(..., ckpt_reload=[path to qat ckpt])`.
193
+
- By default, QAT will run `calibration` to initialize the quantization related parameters (with a small number of training data). At the end of QAT, these parameters are saved with the checkpoint, as we DO NOT want to run calibration at deployment stage. Hence, `qcfg['qmodel_calibration'] = 0`.
194
+
- Quantization related parameters will not be automatically loaded by the HuggingFace method, as those are not part of the original BERT model. Hence calling `qmodel_prep(..., ckpt_reload=[path to qat ckpt])`.
195
195
- By replacing `QLinear` layers with `QLinearINT8Deploy`, it will call the external kernel instead of `torch.matmul`.
196
196
-`torch.compile` with `reduce-overhead` option will use CUDAGRAPH and achieve the most ideal speed-up. However, some models may not be fully compatible with this option.
0 commit comments