You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/fms_mo_design.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:
82
82
83
83
### GPTQ (weight-only compression, or sometimes referred to as W4A16)
84
84
85
-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
85
+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)
Copy file name to clipboardExpand all lines: examples/GPTQ/README.md
+22-18Lines changed: 22 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,12 @@
1
1
# Generative Pre-Trained Transformer Quantization (GPTQ) of LLAMA-3-8B Model
2
2
3
3
4
-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `auto_gptq`, a third party library, to perform quantization.
4
+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `gptqmodel`, a third party library, to perform quantization.
5
5
6
6
## Requirements
7
7
8
8
-[FMS Model Optimizer requirements](../../README.md#requirements)
9
-
-`auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
9
+
-`gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
10
10
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
11
11
```
12
12
pip install lm-eval
@@ -32,7 +32,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
32
32
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
33
33
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
34
34
35
-
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
35
+
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
36
36
37
37
```bash
38
38
python -m fms_mo.run_quant \
@@ -49,8 +49,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
49
49
> - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.
50
50
51
51
> [!TIP]
52
-
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try install `auto-gptq` from [source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source).
53
-
> 2. If you need to work on a custom model that is not supported by AutoGPTQ, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#customize-model).
52
+
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try installing `gptqmodel` from [source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file).
53
+
> 2. If you need to work on a custom model that is not supported by GPTQModel, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file#how-to-add-support-for-a-new-model).
54
54
55
55
3. **Inspect the GPTQ checkpoint**
56
56
```python
@@ -114,21 +114,25 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
114
114
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
115
115
116
116
```python
117
-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
118
-
quantize_config = BaseQuantizeConfig(
119
-
bits=gptq_args.bits,
120
-
group_size=gptq_args.group_size,
121
-
desc_act=gptq_args.desc_act,
122
-
damp_percent=gptq_args.damp_percent)
117
+
from gptqmodel import GPTQModel, QuantizeConfig
118
+
119
+
quantize_config = QuantizeConfig(
120
+
bits=gptq_args.bits,
121
+
group_size=gptq_args.group_size,
122
+
desc_act=gptq_args.desc_act,
123
+
damp_percent=gptq_args.damp_percent,
124
+
)
125
+
123
126
```
124
127
125
-
2. Load the pre_trained model with`auto_gptq`class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
128
+
2. Load the pre_trained model with`gptqmodel`class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
126
129
127
130
```python
128
-
model = AutoGPTQForCausalLM.from_pretrained(
129
-
model_args.model_name_or_path,
130
-
quantize_config=quantize_config,
131
-
torch_dtype=model_args.torch_dtype)
131
+
model = GPTQModel.from_pretrained(
132
+
model_args.model_name_or_path,
133
+
quantize_config=quantize_config,
134
+
torch_dtype=model_args.torch_dtype,
135
+
)
132
136
```
133
137
134
138
3. Load the tokenized dataset from disk.
@@ -143,9 +147,9 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
0 commit comments