Skip to content

Commit 6e2ba25

Browse files
Minor edits to improve readability
Signed-off-by: Thara Palanivel <[email protected]>
1 parent 0e62df5 commit 6e2ba25

File tree

6 files changed

+82
-81
lines changed

6 files changed

+82
-81
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
66

77
## Highlights
88

9-
- **Python API to enable model quantization:** With addition of a few lines of codes, module-level and/or function-level operations replacement will be performed.
10-
- **Robust:** Verified for INT 8/4/2-bit quantization on Vision/Speech/NLP/Object Detection/LLM
11-
- **Flexible:** This package can analyze the network using PyTorch Dynamo, apply best practices, such as clip_val initialization, layer-level precision setting, optimizer param group setting, etc. Users can also easily customize any of the settings through a JSON config file, and even bypass the Dynamo tracing if preferred.
12-
- **State-of-the-art INT and FP quantization techniques:** For weights and activations, such as SAWB+ and PACT+, comparable or better than other published works.
13-
- **Supports key compute-intensive operations:** Conv2d, Linear, LSTM, MM, BMM
9+
- **Python API to enable model quantization:** With the addition of a few lines of codes, module-level and/or function-level operations replacement will be performed.
10+
- **Robust:** Verified for INT 8/4-bit quantization on important vision/speech/NLP/object detection LLMs
11+
- **Flexible:** Options to analyze the network using PyTorch Dynamo, apply best practices, such as clip_val initialization, layer-level precision setting, optimizer param group setting, etc. during quantization.
12+
- **State-of-the-art INT and FP quantization techniques** for weights and activations, such as SmoothQuant, SAWB+ and PACT+.
13+
- **Supports key compute-intensive operations** like Conv2d, Linear, LSTM, MM and BMM
1414

1515
## Supported Models
1616

examples/DQ_SQ/README.md

Lines changed: 29 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@ Direct quantization enables the quantization of large language models (LLMs) wit
33

44
Here, we provide an example of direct quantization. In this case, we demonstrate DQ of `llama3-8b` model into INT8 and FP8 for weights, activations, and/or KV-cache. This example is referred to as the **experimental FP8** in the other [FP8 example](../FP8_QUANT/README.md), which means the quantization configurations and corresponding behavior can be studied this way, but the saved model cannot be directly served by `vllm` as the moment.
55

6-
## Requirement
6+
## Requirements
77
- [FMS Model Optimizer requirements](../../README.md#requirements)
88

9-
## Steps
9+
## Quickstart
1010

1111
**1. Prepare Data** for calibration process by converting into its tokenized form. An example of tokenization using `LLAMA-3-8B`'s tokenizer is below.
1212

@@ -48,45 +48,45 @@ python -m fms_mo.run_quant \
4848
**3. Compare the Perplexity score** For user convenience, the code will print out perplexity (controlled by `eval_ppl` flag) at the end of the run, so no additional steps needed (if the logging level is set to `INFO` in terminal). You can check output in the logging file. `./fms_mo.log`.
4949

5050
## Example Test Results
51-
The perplexity of the INT8 and FP8 quantized models on the wikitext dataset is shown below:
51+
The perplexity of the INT8 and FP8 quantized models on the `wikitext` dataset is shown below:
5252

5353
| Model |Type |QA |QW |DQ |SQ |Perplexity|
5454
|:---------:|:---:|:------------:|:------------:|:--:|:--:|:--------:|
5555
|`Llama3-8b`|INT8 |maxpertoken |maxperCh |yes |yes |6.21 |
5656
| |FP8 |fp8_e4m3_scale|fp8_e4m3_scale|yes |yes |6.19 |
5757

58-
## Example explained
58+
## Code Walkthrough
5959

6060
**1. KV caching**
6161

62-
In large language models (LLMs), key/value pairs are frequently cached during token generation, a process known as KV caching, to prevent redundant computations due to the autoregressive nature of token generation. However, the size of the KV cache increases with both batch size and context length, which can slow down model inference due to the need to access a large amount of data in memory. Quantizing the KV cache effectively reduces this memory bandwidth limitation, improving inference speed. To study the quantization behavior of KV cache, we can simply set the nbits_kvcache argument to 8 bit, then the KV cache will be quantized together with weights and activations. In addition, the `bmm1_qm1_mode`, `bmm1_qm2_mode`, and `bmm2_qm2_mode` [arguments](../../fms_mo/training_args.py) must be set to the same quantizer mode as `qa_mode`. **NOTE**: `bmm2_qm1_mode` should be kept as `minmax`.
62+
In large language models (LLMs), key/value pairs are frequently cached during token generation, a process known as KV caching, to prevent redundant computations due to the autoregressive nature of token generation. However, the size of the KV cache increases with both batch size and context length, which can slow down model inference due to the need to access a large amount of data in memory. Quantizing the KV cache effectively reduces this memory bandwidth limitation, improving inference speed. To study the quantization behavior of KV cache, we can simply set the `nbits_kvcache` argument to 8-bit, then the KV cache will be quantized together with weights and activations. In addition, the `bmm1_qm1_mode`, `bmm1_qm2_mode`, and `bmm2_qm2_mode` [arguments](../../fms_mo/training_args.py) must be set to the same quantizer mode as `qa_mode`. **NOTE**: `bmm2_qm1_mode` should be kept as `minmax`.
6363

64-
The effect of setting the nbits_kvcache to 8 and its relevant code sections are:
64+
The effect of setting the `nbits_kvcache` to 8 and its relevant code sections are:
6565

6666
- Enables eager attention for the quantization of attention operations, including KV cache.
67-
```python
68-
#for attention or kv-cache quantization, need to use eager attention
69-
attn_bits = [fms_mo_args.nbits_bmm1, fms_mo_args.nbits_bmm2, fms_mo_args.nbits_kvcache]
70-
if any(attn_bits) != 32:
71-
attn_implementation = "eager"
72-
else:
73-
attn_implementation = None
74-
```
67+
```python
68+
# For attention or kv-cache quantization, need to use eager attention
69+
attn_bits = [fms_mo_args.nbits_bmm1, fms_mo_args.nbits_bmm2, fms_mo_args.nbits_kvcache]
70+
if any(attn_bits) != 32:
71+
attn_implementation = "eager"
72+
else:
73+
attn_implementation = None
74+
```
7575
- Enables Dynamo for quantized model preparation. We use PyTorch's Dynamo tracer to identify the bmm and KV cache inside the attention block.
76-
```python
77-
if any(x != 32 for x in attn_bits):
78-
logger.info("Quantize attention bmms or kvcache, use dynamo for prep")
79-
use_layer_name_pattern_matching = False
80-
qcfg["qlayer_name_pattern"] = []
81-
assert (
82-
qcfg["qlayer_name_pattern"] == []
83-
), "ensure nothing in qlayer_name_pattern when use dynamo"
84-
use_dynamo = True
85-
else:
86-
logger.info("Do not quantize attention bmms")
87-
use_layer_name_pattern_matching = True
88-
use_dynamo = False
89-
```
76+
```python
77+
if any(x != 32 for x in attn_bits):
78+
logger.info("Quantize attention bmms or kvcache, use dynamo for prep")
79+
use_layer_name_pattern_matching = False
80+
qcfg["qlayer_name_pattern"] = []
81+
assert (
82+
qcfg["qlayer_name_pattern"] == []
83+
), "ensure nothing in qlayer_name_pattern when use dynamo"
84+
use_dynamo = True
85+
else:
86+
logger.info("Do not quantize attention bmms")
87+
use_layer_name_pattern_matching = True
88+
use_dynamo = False
89+
```
9090

9191
**2. Define quantization config** including quantizers and hyperparameters. Here we simply use the default [dq recipe](../../fms_mo/recipies/dq.json).
9292

@@ -154,7 +154,7 @@ model.save_pretrained(output_dir, use_safetensors=True)
154154
tokenizer.save_pretrained(output_dir)
155155
```
156156

157-
**6. Check perplexity** (a simple way to evaluate the model quality.)
157+
**6. Check perplexity** (simple method to evaluate the model quality)
158158

159159
``` python
160160
if fms_mo_args.eval_ppl:

examples/FP8_QUANT/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ There are two types of FP8 support in FMS Model Optimizer:
77

88
This is an example of mature FP8, which under the hood leverages some functionalities in [llm-compressor](https://github.com/vllm-project/llm-compressor), a third-party library, to perform FP8 quantization. An example for the experimental FP8 can be found [here](../DQ_SQ/README.md)
99

10-
## Requirement
10+
## Requirements
1111

1212
- [FMS Model Optimizer requirements](../../README.md#requirements)
1313
- Nvidia A100 family or higher
@@ -16,16 +16,16 @@ This is an example of mature FP8, which under the hood leverages some functional
1616
```bash
1717
pip install llmcompressor
1818
```
19-
- To evaluate the FP8 quantized model, [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) and [vllm](https://github.com/vllm-project/vllm) libraries are also required.
19+
- To evaluate the FP8 quantized model, [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) and [vllm](https://github.com/vllm-project/vllm) libraries are also required.
2020
```bash
21-
pip install vllm lm_eval==0.4.3
21+
pip install vllm lm_eval
2222
```
2323

2424
> [!CAUTION]
2525
> `vllm` may require a specific PyTorch version that is different from what is installed in your current environment and it may force install without asking. Make sure it's compatible with your settings or create a new environment if needed.
2626
27-
## Steps
28-
Three simple steps to perform FP8 quantization using FMS Model Optimizer:
27+
## Quickstart
28+
This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with FP8 being the focus of this example. The steps involved are:
2929
3030
1. **FP8 quantization through CLI**. Other arguments could be found here [FP8Args](../../fms_mo/training_args.py#L84).
3131
@@ -60,7 +60,7 @@ Three simple steps to perform FP8 quantization using FMS Model Optimizer:
6060
> [!NOTE]
6161
> FP16 model file size on storage is ~16.07 GB while FP8 is ~8.6 GB.
6262
63-
3. **Evaluate the quantized model** performance on a selected NLP task (lambada_openai) using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library. The evaluation metrics on this task are perplexity and accuracy. The model will be run on GPU.
63+
3. **Evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
6464

6565
```bash
6666
lm_eval --model vllm \
@@ -88,7 +88,7 @@ Three simple steps to perform FP8 quantization using FMS Model Optimizer:
8888
| | |none | 5|perplexity||3.8915|± |0.3727|
8989
```
9090

91-
## Example Explained
91+
## Code Walkthrough
9292

9393
1. The non-quantized pre-trained model is loaded using model wrapper from `llm-compressor`. The corresponding tokenizer is constructed as well.
9494

examples/GPTQ/README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@ For generative LLMs, very often the bottleneck of inference is no longer the com
77

88
- [FMS Model Optimizer requirements](../../README.md#requirements)
99
- `auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
10-
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness/tree/main)
11-
```
12-
pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
13-
```
10+
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
11+
```
12+
pip install lm-eval
13+
```
1414
1515
1616
## Quickstart
17-
The end-to-end example utilizes the common set of interfaces provided by fms_mo for easily applying multiple quantization algorithms with GPTQ being the focus of this example. The steps involved are:
17+
This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with GPTQ being the focus of this example. The steps involved are:
1818
1919
1. **Convert the dataset into its tokenized form.** An example of tokenization using `LLAMA-3-8B`'s tokenizer is below.
2020
@@ -68,7 +68,7 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
6868
torch.int32 672 3521.904640
6969
```
7070
71-
4. Further to **evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
71+
4. **Evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
7272
7373
```bash
7474
lm_eval --model hf \
@@ -79,7 +79,7 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
7979
--batch_size auto
8080
```
8181
82-
## Summary of results
82+
## Example Test Results
8383
8484
- Unquantized Model
8585
```bash
@@ -98,20 +98,20 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
9898
```
9999

100100

101-
- Quantized model with `desc_act` set to True (could improve the model quality, but at the cost of inference speed.)
101+
- Quantized model with `desc_act` set to `True` (could improve the model quality, but at the cost of inference speed.)
102102
```bash
103103
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
104104
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
105105
| LLAMA3-8B |lambada_openai| 1|none | 5|acc ||0.6193 |± |0.0068|
106106
| | | |none | 5|perplexity||5.8879 |± |0.1546|
107107
```
108108
> [!NOTE]
109-
> There are some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
109+
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
110110
111111

112112
## Code Walkthrough
113113

114-
1. Command line arguments will be used to create a GPTQ quantization config. (Information about the required arguments and their default values can be found in [fms_mo/training_args.py](../../fms_mo/training_args.py) )
114+
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
115115

116116
```python
117117
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
@@ -122,7 +122,7 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
122122
damp_percent=gptq_args.damp_percent)
123123
```
124124

125-
2. Load the pre_trained model with `auto_gptq` class/wrapper. (tokenizer is optional because we already tokenized the data in a previous step.)
125+
2. Load the pre_trained model with `auto_gptq` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
126126

127127
```python
128128
model = AutoGPTQForCausalLM.from_pretrained(

examples/PTQ_INT8/README.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,13 @@ This is an example of [block sequential PTQ](https://arxiv.org/abs/2102.05426).
99
## Requirements
1010

1111
- [FMS Model Optimizer requirements](../../README.md#requirements)
12-
- The inferencing step requires Nvidia GPUs with compute capability > 8.0 (A100 family or higher).
12+
- The inferencing step requires Nvidia GPUs with compute capability > 8.0 (A100 family or higher)
1313
- NVIDIA cutlass package (Need to clone the source, not pip install). Preferrably place in user's home directory: `cd ~ && git clone https://github.com/NVIDIA/cutlass.git`
1414
- [Ninja](https://ninja-build.org/)
1515
- `PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel)
1616

1717

18-
## Steps
18+
## Quickstart
1919

2020
> [!NOTE]
2121
> This example is based on the HuggingFace [Transformers Question answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering). Unlike our [QAT example](../QAT_INT8/README.md), which utilizes the training loop of the original code, our PTQ function will control the loop and the program will end before entering the original loop. Make sure the model doesn't get "tuned" twice!
@@ -87,6 +87,8 @@ python run_qa_no_trainer_ptq.py \
8787
--do_lowering
8888
```
8989

90+
Checkout [Example Test Results](#example-test-results) to compare against your results.
91+
9092
## Example Test Results
9193

9294
The table below shows results obtained for the conditions listed:
@@ -104,13 +106,13 @@ The table below shows results obtained for the conditions listed:
104106
`Nouterloop` and `ptq_nbatch` are PTQ specific hyper-parameter.
105107
Above experiments were run on v100 machine.
106108

107-
## Example Explained
109+
## Code Walkthrough
108110

109111
In this section, we will deep dive into what happens during the example steps.
110112

111113
There are three parts to the example:
112114

113-
**1. Fine-tuned a model** with 16-bit floating point (FP16) precision:
115+
**1. Fine-tune a model with 16-bit floating point (FP16) precision**
114116

115117
Fine-tunes a BERT model on the question answering dataset, SQuAD. This step is based on the HuggingFace [Transformers Question answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering). It was modified to collect additional training information in case we would like to tweak the hyper-parameters later.
116118

@@ -124,7 +126,7 @@ In a nutshell, PTQ simply quantizes the weight and activation tensors in a block
124126
from fms_mo import qmodel_prep, qconfig_init
125127

126128
# Create a config dict using a default recipe and CLI args
127-
# if same item exists in both, args take precedence over recipe.
129+
# If same item exists in both, args take precedence over recipe.
128130
qcfg = qconfig_init(recipe = 'ptq_int8', args=args)
129131
qcfg["tb_writer"] = accelerator.get_tracker("tensorboard", unwrap=True)
130132
qcfg["loader.batchsize"] = args.per_device_train_batch_size
@@ -146,13 +148,13 @@ logger.info(f"--- Accuracy of {args.model_name_or_path} before QAT/PTQ")
146148
> [!NOTE]
147149
> This step will compile an external kernel for INT matmul, which currently only works with `PyTorch 2.3.1`.
148150
149-
Here is snippet of example code of the evaluation:
151+
Here is an example code snippet used for evaluation:
150152

151153
```python
152154
from fms_mo.modules.linear import QLinear, QLinearINT8Deploy
153155
# ...
154156

155-
# only need 1 batch (not a list) this time, will be used by `torch.compile` as well.
157+
# Only need 1 batch (not a list) this time, will be used by `torch.compile` as well.
156158
exam_inp = next(iter(train_dataloader))
157159

158160
qcfg = qconfig_init(recipe = 'qat_int8', args=args)
@@ -176,5 +178,5 @@ with torch.no_grad():
176178

177179
# ...
178180

179-
return # stop the run here, no further training loop
181+
return # Stop the run here, no further training loop
180182
```

0 commit comments

Comments
 (0)