Skip to content

Commit 50aea62

Browse files
Move to gptqmodel
Signed-off-by: Thara Palanivel <[email protected]>
1 parent 2e1e58d commit 50aea62

File tree

11 files changed

+58
-55
lines changed

11 files changed

+58
-55
lines changed

.pylintrc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ ignore-patterns=^\.#
6363
# (useful for modules/projects where namespaces are manipulated during runtime
6464
# and thus existing member attributes cannot be deduced by static analysis). It
6565
# supports qualified module names, as well as Unix pattern matching.
66-
ignored-modules=auto_gptq,
66+
ignored-modules=gptqmodel,
6767
exllama_kernels,
6868
exllamav2_kernels,
6969
llmcompressor,

.spellcheck-en-custom.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
activations
22
ADR
33
Args
4-
AutoGPTQ
54
autoregressive
65
backpropagation
76
bmm
@@ -31,8 +30,8 @@ frac
3130
gptq
3231
GPTQ
3332
GPTQArguments
33+
GPTQModel
3434
graphviz
35-
GPTQ
3635
hyperparameters
3736
Inductor
3837
inferenced

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
4444
*Optional packages based on optimization functionality required:*
4545

4646
- **GPTQ** is a popular compression method for LLMs:
47-
- [auto_gptq](https://pypi.org/project/auto-gptq/) or build from [source](https://github.com/AutoGPTQ/AutoGPTQ)
47+
- [gptqmodel](https://pypi.org/project/gptqmodel/) or build from [source](https://github.com/ModelCloud/GPTQModel)
4848
- If you want to experiment with **INT8** deployment in [QAT](./examples/QAT_INT8/) and [PTQ](./examples/PTQ_INT8/) examples:
4949
- Nvidia GPU with compute capability > 8.0 (A100 family or higher)
5050
- [Ninja](https://ninja-build.org/)

docs/fms_mo_design.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:
8282

8383
### GPTQ (weight-only compression, or sometimes referred to as W4A16)
8484

85-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
85+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)
8686

8787

8888
## Specification

examples/GPTQ/README.md

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Generative Pre-Trained Transformer Quantization (GPTQ) of LLAMA-3-8B Model
22

33

4-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `auto_gptq`, a third party library, to perform quantization.
4+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `gptqmodel`, a third party library, to perform quantization.
55

66
## Requirements
77

88
- [FMS Model Optimizer requirements](../../README.md#requirements)
9-
- `auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
9+
- `gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
1010
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
1111
```
1212
pip install lm-eval
@@ -32,7 +32,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
3232
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
3333
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
3434
35-
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
35+
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
3636
3737
```bash
3838
python -m fms_mo.run_quant \
@@ -49,8 +49,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
4949
> - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.
5050
5151
> [!TIP]
52-
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try install `auto-gptq` from [source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source).
53-
> 2. If you need to work on a custom model that is not supported by AutoGPTQ, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#customize-model).
52+
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try installing `gptqmodel` from [source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file).
53+
> 2. If you need to work on a custom model that is not supported by GPTQModel, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file#how-to-add-support-for-a-new-model).
5454
5555
3. **Inspect the GPTQ checkpoint**
5656
```python
@@ -114,21 +114,25 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
114114
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
115115

116116
```python
117-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
118-
quantize_config = BaseQuantizeConfig(
119-
bits=gptq_args.bits,
120-
group_size=gptq_args.group_size,
121-
desc_act=gptq_args.desc_act,
122-
damp_percent=gptq_args.damp_percent)
117+
from gptqmodel import GPTQModel, QuantizeConfig
118+
119+
quantize_config = QuantizeConfig(
120+
bits=gptq_args.bits,
121+
group_size=gptq_args.group_size,
122+
desc_act=gptq_args.desc_act,
123+
damp_percent=gptq_args.damp_percent,
124+
)
125+
123126
```
124127

125-
2. Load the pre_trained model with `auto_gptq` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
128+
2. Load the pre_trained model with `gptqmodel` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
126129

127130
```python
128-
model = AutoGPTQForCausalLM.from_pretrained(
129-
model_args.model_name_or_path,
130-
quantize_config=quantize_config,
131-
torch_dtype=model_args.torch_dtype)
131+
model = GPTQModel.from_pretrained(
132+
model_args.model_name_or_path,
133+
quantize_config=quantize_config,
134+
torch_dtype=model_args.torch_dtype,
135+
)
132136
```
133137

134138
3. Load the tokenized dataset from disk.
@@ -143,9 +147,9 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
143147
```python
144148
model.quantize(
145149
data,
146-
use_triton=gptq_args.use_triton,
150+
backend=BACKEND.TRITON if gptq_args.use_triton else BACKEND.AUTO,
147151
batch_size=gptq_args.batch_size,
148-
cache_examples_on_gpu=gptq_args.cache_examples_on_gpu,
152+
calibration_enable_gpu_cache=gptq_args.cache_examples_on_gpu,
149153
)
150154
```
151155

fms_mo/custom_ext_kernels/utils.py

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515

1616
"""This file contains external kernel registrations, compilation, and packing functions.
17-
Some functions may require additional packages, e.g. auto_gptq, cutlass (source clone)
17+
Some functions may require additional packages, e.g. gptqmodel, cutlass (source clone)
1818
"""
1919

2020
# pylint: disable=ungrouped-imports,unused-argument,c-extension-no-member
@@ -491,27 +491,27 @@ def create_test_tensors(Nbatch, M, N, K, ele_type, accum_type):
491491

492492

493493
def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
494-
"""Register Exllama kernels borrowed from auto-gptq
494+
"""Register Exllama kernels borrowed from gptqmodel
495495
Args:
496496
qcfg: dict. quant config
497497
run_unit_test: bool. Run unit tests after Op registration. (if unit tests defined.)
498498
499499
NOTE:
500-
1. need to install auto-gptq python package
500+
1. need to install gptqmodel python package
501501
2. Op registration signature changed drastically from torch 2.1 - 2.4. TODO: add 2.4 support
502502
503-
see https://github.com/AutoGPTQ/AutoGPTQ for installation instruction
503+
see https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file for installation instructions
504504
"""
505505
if qcfg is None:
506506
qcfg = {}
507507
elif qcfg:
508-
qcfg["AUTOGPTQ_AVAILABLE"] = False
508+
qcfg["GPTQMODEL_AVAILABLE"] = False
509509

510-
namespace = "autogptq_gemm"
510+
namespace = "gptqmodel_gemm"
511511
# check before compile
512-
if hasattr(torch.ops, namespace) and hasattr(torch.ops.autogptq_gemm, "exv1_i4f16"):
513-
logger.info("Custom AutoGPTQ functions have been loaded already!")
514-
qcfg["AUTOGPTQ_AVAILABLE"] = True
512+
if hasattr(torch.ops, namespace) and hasattr(torch.ops.gptqmodel_gemm, "exv1_i4f16"):
513+
logger.info("Custom GPTQModel functions have been loaded already!")
514+
qcfg["GPTQMODEL_AVAILABLE"] = True
515515
need_registration = False
516516
else:
517517
need_registration = (
@@ -521,7 +521,7 @@ def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
521521

522522
if not need_registration:
523523
logger.warning(
524-
"Please check the installation of AutoGPTQ package."
524+
"Please check the installation of GPTQModel package."
525525
"External kernels cannot be used this time."
526526
)
527527
return
@@ -623,10 +623,10 @@ def exv2_i4f16_fxinputs_abstract(
623623
)
624624

625625
logger.info(
626-
f"New AutoGPTQ gemm functions have been loaded and registered to torch.ops.{namespace}."
626+
f"New GPTQModel gemm functions have been loaded and registered to torch.ops.{namespace}."
627627
)
628628
if qcfg:
629-
qcfg["AUTOGPTQ_AVAILABLE"] = True
629+
qcfg["GPTQMODEL_AVAILABLE"] = True
630630

631631
if run_unit_test:
632632
return NotImplemented
@@ -1110,10 +1110,10 @@ def swap_nnlinear_to_quantlinear(model, qconfig, prefix=None, qlinear2use=None):
11101110
QuantLinear = qlinear2use
11111111
elif exVer == 1:
11121112
# Third Party
1113-
from auto_gptq.nn_modules.qlinear.qlinear_exllama import QuantLinear
1113+
from gptqmodel.nn_modules.qlinear.exllama import ExllamaQuantLinear as QuantLinear
11141114
else:
11151115
# Third Party
1116-
from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import QuantLinear
1116+
from gptqmodel.nn_modules.qlinear.exllamav2 import ExllamaV2QuantLinear as QuantLinear
11171117

11181118
num_swapped = 0
11191119
for n, m in model.named_modules():

fms_mo/fx/utils.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@
4040
# Local
4141
from fms_mo.modules.linear import QLinearExv1WI4AF16, QLinearExv2WI4AF16
4242

43-
autogptq_available = True
43+
gptqmodel_available = True
4444
except ImportError:
45-
autogptq_available = False
45+
gptqmodel_available = False
4646

4747

4848
MIN_BLOCK_SIZE = 5
@@ -90,7 +90,7 @@ def check_qclass_fallback_based_on_min_feat(
9090
]
9191
if cutlass_available:
9292
qclass_has_constraints += [QLinearCutlassI8I32NT]
93-
if autogptq_available:
93+
if gptqmodel_available:
9494
qclass_has_constraints += [QLinearExv1WI4AF16, QLinearExv2WI4AF16]
9595

9696
qclass = type(ref_module)
@@ -128,7 +128,7 @@ def lower_qmodel_to_ext_kernels(
128128
1. user need to define a mapping thru qcfg["ext_kernel_mapping_mod"]
129129
2. to make it simple, only swap user specified qclass, nothing else
130130
3. move the module to GPU before swapping to accelerate scale/zp calculations
131-
4. autogptq_post_init() must be done at model level, or OOM and incorrect results easily
131+
4. gptq_post_init() must be done at model level, or OOM and incorrect results easily
132132
133133
Args:
134134
mod (torch.nn.Module): model to be 'lowered'
@@ -155,7 +155,7 @@ def lower_qmodel_to_ext_kernels(
155155
qclass_must_start_from_cpu = None
156156
using_gptq = False
157157
if (
158-
available_packages["auto_gptq"]
158+
available_packages["gptqmodel"]
159159
and available_packages["exllama_kernels"]
160160
and available_packages["exllamav2_kernels"]
161161
):
@@ -211,9 +211,9 @@ def lower_qmodel_to_ext_kernels(
211211

212212
if using_gptq:
213213
# Third Party
214-
from auto_gptq.modeling._utils import autogptq_post_init
214+
from gptqmodel.utils.model import hf_gptqmodel_post_init as gptq_post_init
215215

216-
mod_tmp = autogptq_post_init(mod_tmp, use_act_order=False) # see Note 4
216+
mod_tmp = gptq_post_init(mod_tmp, use_act_order=False) # see Note 4
217217

218218
mod.to(currDev)
219219
logger.info(mod)

fms_mo/modules/linear.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1402,13 +1402,13 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
14021402

14031403
try:
14041404
# Third Party
1405-
from auto_gptq.nn_modules.qlinear.qlinear_exllama import (
1406-
QuantLinear as QLinearExllamaV1,
1405+
from gptqmodel.nn_modules.qlinear.exllama import (
1406+
ExllamaQuantLinear as QLinearExllamaV1,
14071407
)
1408-
from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import (
1409-
QuantLinear as QLinearExllamaV2,
1408+
from gptqmodel.nn_modules.qlinear.exllamav2 import (
1409+
ExllamaV2QuantLinear as QLinearExllamaV2,
14101410
)
1411-
from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import ext_gemm_half_q_half
1411+
from gptqmodel.nn_modules.qlinear.exllamav2 import ext_gemm_half_q_half
14121412
from exllama_kernels import prepare_buffers, set_tuning_params
14131413
from transformers.pytorch_utils import Conv1D
14141414

@@ -1515,7 +1515,7 @@ def forward(self, x):
15151515
Tensor: Output tensor of shape (batch_size, out_features).
15161516
"""
15171517
with torch.no_grad():
1518-
x = torch.ops.autogptq_gemm.exv1_i4f16(x.half(), self.q4, self.width)
1518+
x = torch.ops.gptqmodel_gemm.exv1_i4f16(x.half(), self.q4, self.width)
15191519

15201520
if self.bias is not None:
15211521
x.add_(self.bias)
@@ -1665,7 +1665,7 @@ def from_fms_mo(cls, fms_mo_qlinear, **kwargs):
16651665
if kwargs.get(
16661666
"useInductor", False
16671667
): # anything other than False or None will use torch wrapped version
1668-
qlinear_ex.extOp = torch.ops.autogptq_gemm.exv2_i4f16
1668+
qlinear_ex.extOp = torch.ops.gptqmodel_gemm.exv2_i4f16
16691669
else:
16701670
qlinear_ex.extOp = ext_gemm_half_q_half
16711671

@@ -1701,7 +1701,7 @@ def forward(self, x, force_cuda=False):
17011701

17021702
except ModuleNotFoundError:
17031703
logger.warning(
1704-
"AutoGPTQ is not properly installed. "
1704+
"GPTQModel is not properly installed. "
17051705
"QLinearExv1WI4AF16 and QLinearExv2WI4AF16 wrappers will not be available."
17061706
)
17071707

fms_mo/run_quant.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ def run_gptq(model_args, data_args, opt_args, gptq_args):
141141
damp_percent=gptq_args.damp_percent,
142142
)
143143

144-
# Add custom model_type mapping to auto_gptq LUT so GPTQModel can recognize them.
144+
# Add custom model_type mapping to gptqmodel LUT so GPTQModel can recognize them.
145145
for mtype, cls in custom_gptq_classes.items():
146146
SUPPORTED_MODELS.append(mtype)
147147
MODEL_MAP[mtype] = cls

fms_mo/training_args.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ class FMSMOArguments(TypeChecker):
172172

173173
@dataclass
174174
class GPTQArguments(TypeChecker):
175-
"""Dataclass for GPTQ related arguments that will be used by auto-gptq."""
175+
"""Dataclass for GPTQ related arguments that will be used by gptqmodel."""
176176

177177
bits: int = field(default=4, metadata={"choices": [2, 3, 4, 8]})
178178
group_size: int = field(default=-1)

0 commit comments

Comments
 (0)