You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/GPTQ/README.md
+53-28Lines changed: 53 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com
7
7
8
8
-[FMS Model Optimizer requirements](../../README.md#requirements)
9
9
-`gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
10
+
- It is advised to install from source if you plan to use `GPTQv2`
10
11
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
11
12
```
12
13
pip install lm-eval
@@ -32,7 +33,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
32
33
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
33
34
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
34
35
35
-
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
36
+
2. **Quantize the model** using the data generated above, the following command will kick off the `GPTQv1' quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
36
37
37
38
```bash
38
39
python -m fms_mo.run_quant \
@@ -41,9 +42,10 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
41
42
--quant_method gptq \
42
43
--output_dir Meta-Llama-3-8B-GPTQ \
43
44
--bits 4 \
44
-
--group_size 128
45
+
--group_size 128 \
46
+
45
47
```
46
-
The model that can be found in the specified output directory (`Meta-Llama-3-8B-GPTQ` in our case) can be deployed and inferenced via `vLLM`.
48
+
The model that can be found in the specified output directory (`Meta-Llama-3-8B-GPTQ` in our case) can be deployed and inferenced via `vLLM`. To enable `GPTQv2`, set the `quant_method` argument to `gptqv2`.
47
49
48
50
> [!NOTE]
49
51
> - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.
@@ -82,44 +84,67 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
107
117
108
118
109
119
## Code Walk-through
110
120
111
-
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
121
+
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py). `GPTQv1` and `GPTQv2` is supported.
112
122
113
-
```python
114
-
from gptqmodel import GPTQModel, QuantizeConfig
123
+
- To use `GPTQv1`, set the parameter `quant_method` to `gptq` in the command line.
115
124
116
-
quantize_config = QuantizeConfig(
117
-
bits=gptq_args.bits,
118
-
group_size=gptq_args.group_size,
119
-
desc_act=gptq_args.desc_act,
120
-
damp_percent=gptq_args.damp_percent,
121
-
)
125
+
```python
126
+
from gptqmodel import GPTQModel, QuantizeConfig
127
+
128
+
quantize_config = QuantizeConfig(
129
+
bits=gptq_args.bits,
130
+
group_size=gptq_args.group_size,
131
+
desc_act=gptq_args.desc_act,
132
+
damp_percent=gptq_args.damp_percent,
133
+
)
134
+
```
135
+
- To use `GPTQv2`, simply set `quant_method` to `gptqv2`in the command line. Under the hood, two additional arguments will be added to QuantizeConfig, i.e. `v2` = `True` and `v2_memory_device` = `cpu`.
122
136
137
+
```python
138
+
from gptqmodel import GPTQModel, QuantizeConfig
139
+
140
+
quantize_config = QuantizeConfig(
141
+
bits=gptq_args.bits,
142
+
group_size=gptq_args.group_size,
143
+
desc_act=gptq_args.desc_act,
144
+
damp_percent=gptq_args.damp_percent,
145
+
v2=True,
146
+
v2_memory_device='cpu',
147
+
)
123
148
```
124
149
125
150
2. Load the pre_trained model with `gptqmodel` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
@@ -158,4 +183,4 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
158
183
tokenizer.save_pretrained(output_dir) # optional
159
184
```
160
185
> [!NOTE]
161
-
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100.
186
+
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100 with `GPTQv1`.
0 commit comments