Merge pull request #19 from tharapalanivel/readme_edits

chichun-charlie-liu · web-flow · commit a6b2a74b64c0 · 2024-12-04T13:27:08.000-05:00
Improve README readability
diff --git a/README.md b/README.md
@@ -6,11 +6,11 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
 
 ## Highlights
 
-- **Python API to enable model quantization:** With addition of a few lines of codes, module-level and/or function-level operations replacement will be performed.
-- **Robust:** Verified for INT 8/4/2-bit quantization on Vision/Speech/NLP/Object Detection/LLM
-- **Flexible:** This package can analyze the network using PyTorch Dynamo, apply best practices, such as clip_val initialization, layer-level precision setting, optimizer param group setting, etc. Users can also easily customize any of the settings through a JSON config file, and even bypass the Dynamo tracing if preferred.
-- **State-of-the-art INT and FP quantization techniques:** For weights and activations, such as SAWB+ and PACT+, comparable or better than other published works.
-- **Supports key compute-intensive operations:** Conv2d, Linear, LSTM, MM, BMM
+- **Python API to enable model quantization:** With the addition of a few lines of codes, module-level and/or function-level operations replacement will be performed.
+- **Robust:** Verified for INT 8/4-bit quantization on important vision/speech/NLP/object detection/LLMs.
+- **Flexible:** Options to analyze the network using PyTorch Dynamo, apply best practices, such as clip_val initialization, layer-level precision setting, optimizer param group setting, etc. during quantization.
+- **State-of-the-art INT and FP quantization techniques** for weights and activations, such as SmoothQuant, SAWB+ and PACT+.
+- **Supports key compute-intensive operations** like Conv2d, Linear, LSTM, MM and BMM
 
 ## Supported Models
 
diff --git a/examples/DQ_SQ/README.md b/examples/DQ_SQ/README.md
@@ -3,10 +3,10 @@ Direct quantization enables the quantization of large language models (LLMs) wit
 
 Here, we provide an example of direct quantization. In this case, we demonstrate DQ of `llama3-8b` model into INT8 and FP8 for weights, activations, and/or KV-cache. This example is referred to as the **experimental FP8** in the other [FP8 example](../FP8_QUANT/README.md), which means the quantization configurations and corresponding behavior can be studied this way, but the saved model cannot be directly served by `vllm` as the moment.
 
-## Requirement
+## Requirements
 - [FMS Model Optimizer requirements](../../README.md#requirements)
 
-## Steps
+## Quickstart
 
 **1. Prepare Data** for calibration process by converting into its tokenized form. An example of tokenization using `LLAMA-3-8B`'s tokenizer is below.
 
@@ -20,7 +20,7 @@ seq_len = 2048
 get_tokenized_data("wiki", num_samples, seq_len, tokenizer, path_to_save='data')
 ```
 > [!NOTE]
-> - Users should provide a tokentized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
+> - Users should provide a tokenized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
 > - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
 > - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
 
@@ -48,45 +48,45 @@ python  -m fms_mo.run_quant \
 **3. Compare the Perplexity score** For user convenience, the code will print out perplexity (controlled by `eval_ppl` flag) at the end of the run, so no additional steps needed (if the logging level is set to `INFO` in terminal). You can check output in the logging file. `./fms_mo.log`.
 
 ## Example Test Results
-The perplexity of the INT8 and FP8 quantized models on the wikitext dataset is shown below:
+The perplexity of the INT8 and FP8 quantized models on the `wikitext` dataset is shown below:
 
 | Model     |Type |QA            |QW            |DQ  |SQ  |Perplexity|
 |:---------:|:---:|:------------:|:------------:|:--:|:--:|:--------:|
 |`Llama3-8b`|INT8 |maxpertoken   |maxperCh      |yes |yes |6.21      |
 |           |FP8  |fp8_e4m3_scale|fp8_e4m3_scale|yes |yes |6.19      |
 
-## Example explained
+## Code Walkthrough
 
 **1. KV caching**
 
-In large language models (LLMs), key/value pairs are frequently cached during token generation, a process known as KV caching, to prevent redundant computations due to the autoregressive nature of token generation. However, the size of the KV cache increases with both batch size and context length, which can slow down model inference due to the need to access a large amount of data in memory. Quantizing the KV cache effectively reduces this memory bandwidth limitation, improving inference speed. To study the quantization behavior of KV cache, we can simply set the nbits_kvcache argument to 8 bit, then the KV cache will be quantized together with weights and activations. In addition, the `bmm1_qm1_mode`, `bmm1_qm2_mode`, and `bmm2_qm2_mode` [arguments](../../fms_mo/training_args.py) must be set to the same quantizer mode as `qa_mode`. **NOTE**: `bmm2_qm1_mode` should be kept as `minmax`.
+In large language models (LLMs), key/value pairs are frequently cached during token generation, a process known as KV caching, to prevent redundant computations due to the autoregressive nature of token generation. However, the size of the KV cache increases with both batch size and context length, which can slow down model inference due to the need to access a large amount of data in memory. Quantizing the KV cache effectively reduces this memory bandwidth limitation, improving inference speed. To study the quantization behavior of KV cache, we can simply set the `nbits_kvcache` argument to 8-bit, then the KV cache will be quantized together with weights and activations. In addition, the `bmm1_qm1_mode`, `bmm1_qm2_mode`, and `bmm2_qm2_mode` [arguments](../../fms_mo/training_args.py) must be set to the same quantizer mode as `qa_mode`. **NOTE**: `bmm2_qm1_mode` should be kept as `minmax`.
 
-The effect of setting the nbits_kvcache to 8 and its relevant code sections are:
+The effect of setting the `nbits_kvcache` to 8 and its relevant code sections are:
 
 - Enables eager attention for the quantization of attention operations, including KV cache.
-```python
-#for attention or kv-cache quantization, need to use eager attention
-attn_bits = [fms_mo_args.nbits_bmm1, fms_mo_args.nbits_bmm2, fms_mo_args.nbits_kvcache]
-if any(attn_bits) != 32:
-    attn_implementation = "eager"
-else:
-    attn_implementation = None
-```
+    ```python
+    # For attention or kv-cache quantization, need to use eager attention
+    attn_bits = [fms_mo_args.nbits_bmm1, fms_mo_args.nbits_bmm2, fms_mo_args.nbits_kvcache]
+    if any(attn_bits) != 32:
+        attn_implementation = "eager"
+    else:
+        attn_implementation = None
+    ```
 -  Enables Dynamo for quantized model preparation. We use PyTorch's Dynamo tracer to identify the bmm and KV cache inside the attention block.
-```python
-if any(x != 32 for x in attn_bits):
-    logger.info("Quantize attention bmms or kvcache, use dynamo for prep")
-    use_layer_name_pattern_matching = False
-    qcfg["qlayer_name_pattern"] = []
-    assert (
-        qcfg["qlayer_name_pattern"] == []
-    ), "ensure nothing in qlayer_name_pattern when use dynamo"
-    use_dynamo = True
-else:
-    logger.info("Do not quantize attention bmms")
-    use_layer_name_pattern_matching = True
-    use_dynamo = False
-```
+    ```python
+    if any(x != 32 for x in attn_bits):
+        logger.info("Quantize attention bmms or kvcache, use dynamo for prep")
+        use_layer_name_pattern_matching = False
+        qcfg["qlayer_name_pattern"] = []
+        assert (
+            qcfg["qlayer_name_pattern"] == []
+        ), "ensure nothing in qlayer_name_pattern when use dynamo"
+        use_dynamo = True
+    else:
+        logger.info("Do not quantize attention bmms")
+        use_layer_name_pattern_matching = True
+        use_dynamo = False
+    ```
 
 **2. Define quantization config** including quantizers and hyperparameters. Here we simply use the default [dq recipe](../../fms_mo/recipies/dq.json).
 
@@ -154,7 +154,7 @@ model.save_pretrained(output_dir, use_safetensors=True)
 tokenizer.save_pretrained(output_dir)
 ```
 
-**6. Check perplexity** (a simple way to evaluate the model quality.)
+**6. Check perplexity** (simple method to evaluate the model quality)
 
 ``` python
 if fms_mo_args.eval_ppl:
diff --git a/examples/FP8_QUANT/README.md b/examples/FP8_QUANT/README.md
@@ -7,25 +7,25 @@ There are two types of FP8 support in FMS Model Optimizer:
 
 This is an example of mature FP8, which under the hood leverages some functionalities in [llm-compressor](https://github.com/vllm-project/llm-compressor), a third-party library, to perform FP8 quantization. An example for the experimental FP8 can be found [here](../DQ_SQ/README.md)
 
-## Requirement
+## Requirements
 
-- FMS Model Optimizer requirements](../../README.md#requirements)
+- [FMS Model Optimizer requirements](../../README.md#requirements)
 - Nvidia A100 family or higher
 - The [llm-compressor](https://github.com/vllm-project/llm-compressor) library can be installed using pip:
 
     ```bash
     pip install llmcompressor
     ```
-- To evaluate the FP8 quantized model, [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) and [vllm](https://github.com/vllm-project/vllm) libraries are also required.
+- To evaluate the FP8 quantized model, [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) and [vllm](https://github.com/vllm-project/vllm) libraries are also required.
     ```bash
-    pip install vllm lm_eval==0.4.3
+    pip install vllm lm_eval
     ```
 
 > [!CAUTION]
 > `vllm` may require a specific PyTorch version that is different from what is installed in your current environment and it may force install without asking. Make sure it's compatible with your settings or create a new environment if needed.
 
-## Steps
-Three simple steps to perform FP8 quantization using FMS Model Optimizer:
+## Quickstart
+This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with FP8 being the focus of this example. The steps involved are:
 
 1. **FP8 quantization through CLI**. Other arguments could be found here [FP8Args](../../fms_mo/training_args.py#L84).
 
@@ -38,7 +38,7 @@ Three simple steps to perform FP8 quantization using FMS Model Optimizer:
     ```
 
 > [!NOTE]
-> - The quantized model and tokenizer will be saved to `output_dir`, but some additional temperary storage space may be needed.
+> - The quantized model and tokenizer will be saved to `output_dir`, but some additional temporary storage space may be needed.
 > - Runtime ~ 1 min on A100. (model download time not included)
 > - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-3.0-8b-instruct` instead
 
@@ -60,7 +60,7 @@ Three simple steps to perform FP8 quantization using FMS Model Optimizer:
 > [!NOTE]
 > FP16 model file size on storage is ~16.07 GB while FP8 is ~8.6 GB.
 
-3. **Evaluate the quantized model** performance on a selected NLP task (lambada_openai) using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library. The evaluation metrics on this task are perplexity and accuracy. The model will be run on GPU.
+3. **Evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
 
     ```bash
     lm_eval --model vllm \
@@ -88,7 +88,7 @@ Three simple steps to perform FP8 quantization using FMS Model Optimizer:
         |              |       |none  |     5|perplexity|↓  |3.8915|±  |0.3727|
     ```
 
-## Example Explained
+## Code Walkthrough
 
 1. The non-quantized pre-trained model is loaded using model wrapper from `llm-compressor`. The corresponding tokenizer is constructed as well.
 
diff --git a/examples/GPTQ/README.md b/examples/GPTQ/README.md
@@ -5,16 +5,16 @@ For generative LLMs, very often the bottleneck of inference is no longer the com
 
 ## Requirements
 
-- FMS Model Optimizer requirements](../../README.md#requirements)
+- [FMS Model Optimizer requirements](../../README.md#requirements)
 - `auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
-- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness/tree/main)
-```
-pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
-```
+- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
+    ```
+    pip install lm-eval
+    ```
 
 
 ## Quickstart
-The end-to-end example utilizes the common set of interfaces provided by fms_mo for easily applying multiple quantization algorithms with  GPTQ being the focus of this example. The steps involved are:
+This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with GPTQ being the focus of this example. The steps involved are:
 
 1. **Convert the dataset into its tokenized form.** An example of tokenization using `LLAMA-3-8B`'s tokenizer is below.
 
@@ -28,7 +28,7 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
     get_tokenized_data("wiki", num_samples, seq_len, tokenizer, gptq_style=True, path_to_save='data')
     ```
 > [!NOTE]
-> - Users should provide a tokentized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
+> - Users should provide a tokenized data file based on their need. This is just one example to demonstrate what data format `fms_mo` is expecting.
 > - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
 > - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
 
@@ -68,7 +68,7 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
     torch.int32      672  3521.904640
     ```
 
-4. Further to **evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
+4. **Evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
 
     ```bash
     lm_eval --model hf \
@@ -79,7 +79,7 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
             --batch_size auto
     ```
 
-## Summary of results
+## Example Test Results
 
 - Unquantized Model
 ```bash
@@ -98,20 +98,20 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
 ```
 
 
-- Quantized model with `desc_act` set to True (could improve the model quality, but at the cost of inference speed.)
+- Quantized model with `desc_act` set to `True` (could improve the model quality, but at the cost of inference speed.)
 ```bash
     |Model       |    Tasks     |Version|Filter|n-shot|  Metric  |   |Value  |   |Stderr|
     |------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
     | LLAMA3-8B  |lambada_openai|      1|none  |     5|acc       |↑  |0.6193 |±  |0.0068|
     |            |              |       |none  |     5|perplexity|↓  |5.8879 |±  |0.1546|
 ```
 > [!NOTE]
-> There are some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
+> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
 
 
 ## Code Walkthrough
 
-1.  Command line arguments will be used to create a GPTQ quantization config. (Information about the required arguments and their default values can be found in [fms_mo/training_args.py](../../fms_mo/training_args.py) )
+1.  Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
 
     ```python
     from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
@@ -122,7 +122,7 @@ The end-to-end example utilizes the common set of interfaces provided by fms_mo
                 damp_percent=gptq_args.damp_percent)
     ```
 
-2. Load the pre_trained model with `auto_gptq` class/wrapper. (tokenizer is optional because we already tokenized the data in a previous step.)
+2. Load the pre_trained model with `auto_gptq` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
 
     ```python
     model = AutoGPTQForCausalLM.from_pretrained(
diff --git a/examples/PTQ_INT8/README.md b/examples/PTQ_INT8/README.md
diff --git a/examples/QAT_INT8/README.md b/examples/QAT_INT8/README.md