Skip to content

Commit e77c693

Browse files
Fix typos
Signed-off-by: Thara Palanivel <[email protected]>
1 parent 6e2ba25 commit e77c693

File tree

5 files changed

+12
-12
lines changed

5 files changed

+12
-12
lines changed

examples/DQ_SQ/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ seq_len = 2048
2020
get_tokenized_data("wiki", num_samples, seq_len, tokenizer, path_to_save='data')
2121
```
2222
> [!NOTE]
23-
> - Users should provide a tokentized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
23+
> - Users should provide a tokenized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
2424
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
2525
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
2626

examples/FP8_QUANT/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
3838
```
3939
4040
> [!NOTE]
41-
> - The quantized model and tokenizer will be saved to `output_dir`, but some additional temperary storage space may be needed.
41+
> - The quantized model and tokenizer will be saved to `output_dir`, but some additional temporary storage space may be needed.
4242
> - Runtime ~ 1 min on A100. (model download time not included)
4343
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-3.0-8b-instruct` instead
4444

examples/GPTQ/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
2828
get_tokenized_data("wiki", num_samples, seq_len, tokenizer, gptq_style=True, path_to_save='data')
2929
```
3030
> [!NOTE]
31-
> - Users should provide a tokentized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
31+
> - Users should provide a tokenized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
3232
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
3333
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
3434

examples/PTQ_INT8/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ This is an example of [block sequential PTQ](https://arxiv.org/abs/2102.05426).
1010

1111
- [FMS Model Optimizer requirements](../../README.md#requirements)
1212
- The inferencing step requires Nvidia GPUs with compute capability > 8.0 (A100 family or higher)
13-
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferrably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
13+
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
1414
- [Ninja](https://ninja-build.org/)
1515
- `PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel)
1616

@@ -43,7 +43,7 @@ python run_qa_no_trainer_ptq.py \
4343
```
4444

4545
> [!TIP]
46-
> The script can take up to 20 mins to run (on a single A100). By default, it is configured for detailed logging.You can disable the logging by removing the `with_tracking` and `report_to` flags in the script.
46+
> The script can take up to 20 mins to run (on a single A100). By default, it is configured for detailed logging. You can disable the logging by removing the `with_tracking` and `report_to` flags in the script.
4747
4848
#### **2. Apply PTQ** on the fine-tuned model, which converts the precision data to 8-bit integer (INT8):
4949

examples/QAT_INT8/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Model Optimization Using Quantization-Aware Training (QAT)
22

3-
FMS Model Optimizer supports [quantization](https://www.ibm.com/think/topics/quantization) of models which will enable the utilization of reduced-precision numerical format and specialiazed hardware to accelerate inference performance (i.e., make "calling a model" faster).
3+
FMS Model Optimizer supports [quantization](https://www.ibm.com/think/topics/quantization) of models which will enable the utilization of reduced-precision numerical format and specialized hardware to accelerate inference performance (i.e., make "calling a model" faster).
44

55
Generally speaking, matrix multiplication (matmul) is the main operation in a neural network. The goal of quantization is to convert a floating-point (FP) matmul into an integer (INT) matmul, which runs much faster and requires lower energy consumption. A simplified example would be:
66

@@ -9,7 +9,7 @@ $$X@W \approx \lfloor \frac{X}{s_x} \rceil @ \lfloor \frac{W}{s_w} \rceil*s_xs_w
99
- where $X$, $W$ are FP tensors whose elements are all within a certain range, e.g. $[-5.0, 5.0]$, $@$ is matmul operation, $\lfloor \rceil$ is rounding operation, scaling factor $s_x, s_w$ in this case is simply $5/127$.
1010
- On the right hand side, after scaling and rounding the tensors will only contain integers in the range of $[-127, 127]$, which can be stored as a 8-bit integer.
1111
- We may now use an INT8 matmul instead of a FP32 matmul to perform the task then multiply the scaling factors afterward.
12-
- **Important** The benefit from INT matmul should outweight the overhead from scaling, rounding, and descaling. But rounding will inevitably introduce approximation errors. Luckily, we can mitigate the errors by taking these quantization related operations into account during the training process, hence the Quantization-aware training ([QAT](https://arxiv.org/pdf/1712.05877))!
12+
- **Important** The benefit from INT matmul should outweigh the overhead from scaling, rounding, and descaling. But rounding will inevitably introduce approximation errors. Luckily, we can mitigate the errors by taking these quantization related operations into account during the training process, hence the Quantization-aware training ([QAT](https://arxiv.org/pdf/1712.05877))!
1313

1414
In the following example, we will first create a fine-tuned FP16 model, and then quantize this model from FP16 to INT8 using QAT. Once the model is tuned and QAT'ed, you can observe the accuracy and the acceleration at inference time of the model.
1515

@@ -18,7 +18,7 @@ In the following example, we will first create a fine-tuned FP16 model, and then
1818

1919
- [FMS Model Optimizer requirements](../../README.md#requirements)
2020
- The inferencing step requires Nvidia GPUs with compute capability > 8.0 (A100 family or higher)
21-
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferrably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
21+
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
2222
- [Ninja](https://ninja-build.org/)
2323
- `PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel)
2424

@@ -50,7 +50,7 @@ python run_qa_no_trainer_qat.py \
5050
```
5151

5252
> [!TIP]
53-
> The script can take up to 40 mins to run (on a single A100). By default, it is configured for detailed logging.You can disable the logging by removing the `with_tracking` and `report_to` flags in the script. This can reduce the runtime by around 20 mins.
53+
> The script can take up to 40 mins to run (on a single A100). By default, it is configured for detailed logging. You can disable the logging by removing the `with_tracking` and `report_to` flags in the script. This can reduce the runtime by around 20 mins.
5454
5555
#### **2. Apply QAT** on the fine-tuned model, which converts the precision data to 8-bit integer (INT8):
5656

@@ -96,7 +96,7 @@ Checkout [Example Test Results](#example-test-results) to compare against your r
9696

9797
## Example Test Results
9898

99-
For comparsion purposes, here are some of the results we found during testing when tested with PyTorch 2.3.1:
99+
For comparison purposes, here are some of the results we found during testing when tested with PyTorch 2.3.1:
100100

101101
- Accuracy could vary ~ +-0.2 from run to run.
102102
- `INT8` matmuls are ~2x faster than `FP16` matmuls, However, `INT8` models will have additional overhead compared to `FP16` models. For example, converting FP tensors to INT before INT matmul.
@@ -190,7 +190,7 @@ return # Stop the run here, no further training loop
190190

191191
In this example:
192192

193-
- By default, QAT will run `calibration` to initialize the quantization related parameters (with a small number of training data). At the end of QAT, these paramaters are saved with the ckpt, as we DO NOT want to run calibration at deployment stage. Hence, `qcfg['qmodel_calibration'] = 0`.
194-
- Quantization related parameters will not be automatically loaded by the HiggingFace method, as those are not part of the original BERT model. Hence calling `qmodel_prep(..., ckpt_reload=[path to qat ckpt])`.
193+
- By default, QAT will run `calibration` to initialize the quantization related parameters (with a small number of training data). At the end of QAT, these parameters are saved with the checkpoint, as we DO NOT want to run calibration at deployment stage. Hence, `qcfg['qmodel_calibration'] = 0`.
194+
- Quantization related parameters will not be automatically loaded by the HuggingFace method, as those are not part of the original BERT model. Hence calling `qmodel_prep(..., ckpt_reload=[path to qat ckpt])`.
195195
- By replacing `QLinear` layers with `QLinearINT8Deploy`, it will call the external kernel instead of `torch.matmul`.
196196
- `torch.compile` with `reduce-overhead` option will use CUDAGRAPH and achieve the most ideal speed-up. However, some models may not be fully compatible with this option.

0 commit comments

Comments
 (0)