Move to gptqmodel

tharapalanivel · tharapalanivel · commit 50aea628e1ae · 2025-01-23T10:33:19.000-08:00
Signed-off-by: Thara Palanivel &lt;130496890+tharapalanivel@users.noreply.github.com&gt;
diff --git a/.pylintrc b/.pylintrc
@@ -63,7 +63,7 @@ ignore-patterns=^\.#
 # (useful for modules/projects where namespaces are manipulated during runtime
 # and thus existing member attributes cannot be deduced by static analysis). It
 # supports qualified module names, as well as Unix pattern matching.
-ignored-modules=auto_gptq,                
+ignored-modules=gptqmodel,                
                 exllama_kernels,
                 exllamav2_kernels,
                 llmcompressor,
diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt
@@ -1,7 +1,6 @@
 activations
 ADR
 Args
-AutoGPTQ
 autoregressive
 backpropagation
 bmm
@@ -31,8 +30,8 @@ frac
 gptq
 GPTQ
 GPTQArguments
+GPTQModel
 graphviz
-GPTQ
 hyperparameters
 Inductor
 inferenced
diff --git a/README.md b/README.md
@@ -44,7 +44,7 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
 *Optional packages based on optimization functionality required:*
 
 - **GPTQ** is a popular compression method for LLMs: 
-    - [auto_gptq](https://pypi.org/project/auto-gptq/) or build from [source](https://github.com/AutoGPTQ/AutoGPTQ)
+    - [gptqmodel](https://pypi.org/project/gptqmodel/) or build from [source](https://github.com/ModelCloud/GPTQModel)
 - If you want to experiment with **INT8** deployment in [QAT](./examples/QAT_INT8/) and [PTQ](./examples/PTQ_INT8/) examples:
     - Nvidia GPU with compute capability > 8.0 (A100 family or higher)
     - [Ninja](https://ninja-build.org/)
diff --git a/docs/fms_mo_design.md b/docs/fms_mo_design.md
@@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:
 
 ### GPTQ (weight-only compression, or sometimes referred to as W4A16)
 
-For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
+For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)
 
 
 ## Specification
diff --git a/examples/GPTQ/README.md b/examples/GPTQ/README.md
@@ -1,12 +1,12 @@
 # Generative Pre-Trained Transformer Quantization (GPTQ) of LLAMA-3-8B Model
 
 
-For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `auto_gptq`, a third party library, to perform quantization.
+For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `gptqmodel`, a third party library, to perform quantization.
 
 ## Requirements
 
 - [FMS Model Optimizer requirements](../../README.md#requirements)
-- `auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
+- `gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
 - Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
     ```
     pip install lm-eval
@@ -32,7 +32,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 > - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
 > - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
 
-2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
+2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
 
     ```bash
     python -m fms_mo.run_quant \
@@ -49,8 +49,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 > - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.
 
 > [!TIP]
-> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try install `auto-gptq` from [source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source).
-> 2. If you need to work on a custom model that is not supported by AutoGPTQ, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#customize-model).
+> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try installing `gptqmodel` from [source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file).
+> 2. If you need to work on a custom model that is not supported by GPTQModel, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file#how-to-add-support-for-a-new-model).
 
 3. **Inspect the GPTQ checkpoint**
     ```python
@@ -114,21 +114,25 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 1.  Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
 
     ```python
-    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
-    quantize_config = BaseQuantizeConfig(
-                bits=gptq_args.bits,
-                group_size=gptq_args.group_size,
-                desc_act=gptq_args.desc_act,
-                damp_percent=gptq_args.damp_percent)
+    from gptqmodel import GPTQModel, QuantizeConfig
+
+    quantize_config = QuantizeConfig(
+        bits=gptq_args.bits,
+        group_size=gptq_args.group_size,
+        desc_act=gptq_args.desc_act,
+        damp_percent=gptq_args.damp_percent,
+    )
+
     ```
 
-2. Load the pre_trained model with `auto_gptq` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
+2. Load the pre_trained model with `gptqmodel` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
 
     ```python
-    model = AutoGPTQForCausalLM.from_pretrained(
-                model_args.model_name_or_path,
-                quantize_config=quantize_config,
-                torch_dtype=model_args.torch_dtype)
+    model = GPTQModel.from_pretrained(
+        model_args.model_name_or_path,
+        quantize_config=quantize_config,
+        torch_dtype=model_args.torch_dtype,
+    )
     ```
 
 3. Load the tokenized dataset from disk.
@@ -143,9 +147,9 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
     ```python
     model.quantize(
         data,
-        use_triton=gptq_args.use_triton,
+        backend=BACKEND.TRITON if gptq_args.use_triton else BACKEND.AUTO,
         batch_size=gptq_args.batch_size,
-        cache_examples_on_gpu=gptq_args.cache_examples_on_gpu,
+        calibration_enable_gpu_cache=gptq_args.cache_examples_on_gpu,
     )
     ```
 
diff --git a/fms_mo/custom_ext_kernels/utils.py b/fms_mo/custom_ext_kernels/utils.py
@@ -14,7 +14,7 @@
 
 
 """This file contains external kernel registrations, compilation, and packing functions.
-Some functions may require additional packages, e.g. auto_gptq, cutlass (source clone)
+Some functions may require additional packages, e.g. gptqmodel, cutlass (source clone)
 """
 
 # pylint: disable=ungrouped-imports,unused-argument,c-extension-no-member
@@ -491,27 +491,27 @@ def create_test_tensors(Nbatch, M, N, K, ele_type, accum_type):
 
 
 def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
-    """Register Exllama kernels borrowed from auto-gptq
+    """Register Exllama kernels borrowed from gptqmodel
     Args:
         qcfg: dict. quant config
         run_unit_test: bool. Run unit tests after Op registration. (if unit tests defined.)
 
     NOTE:
-        1. need to install auto-gptq python package
+        1. need to install gptqmodel python package
         2. Op registration signature changed drastically from torch 2.1 - 2.4. TODO: add 2.4 support
 
-    see https://github.com/AutoGPTQ/AutoGPTQ for installation instruction
+    see https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file for installation instructions
     """
     if qcfg is None:
         qcfg = {}
     elif qcfg:
-        qcfg["AUTOGPTQ_AVAILABLE"] = False
+        qcfg["GPTQMODEL_AVAILABLE"] = False
 
-    namespace = "autogptq_gemm"
+    namespace = "gptqmodel_gemm"
     # check before compile
-    if hasattr(torch.ops, namespace) and hasattr(torch.ops.autogptq_gemm, "exv1_i4f16"):
-        logger.info("Custom AutoGPTQ functions have been loaded already!")
-        qcfg["AUTOGPTQ_AVAILABLE"] = True
+    if hasattr(torch.ops, namespace) and hasattr(torch.ops.gptqmodel_gemm, "exv1_i4f16"):
+        logger.info("Custom GPTQModel functions have been loaded already!")
+        qcfg["GPTQMODEL_AVAILABLE"] = True
         need_registration = False
     else:
         need_registration = (
@@ -521,7 +521,7 @@ def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
 
         if not need_registration:
             logger.warning(
-                "Please check the installation of AutoGPTQ package."
+                "Please check the installation of GPTQModel package."
                 "External kernels cannot be used this time."
             )
             return
@@ -623,10 +623,10 @@ def exv2_i4f16_fxinputs_abstract(
             )
 
         logger.info(
-            f"New AutoGPTQ gemm functions have been loaded and registered to torch.ops.{namespace}."
+            f"New GPTQModel gemm functions have been loaded and registered to torch.ops.{namespace}."
         )
         if qcfg:
-            qcfg["AUTOGPTQ_AVAILABLE"] = True
+            qcfg["GPTQMODEL_AVAILABLE"] = True
 
     if run_unit_test:
         return NotImplemented
@@ -1110,10 +1110,10 @@ def swap_nnlinear_to_quantlinear(model, qconfig, prefix=None, qlinear2use=None):
         QuantLinear = qlinear2use
     elif exVer == 1:
         # Third Party
-        from auto_gptq.nn_modules.qlinear.qlinear_exllama import QuantLinear
+        from gptqmodel.nn_modules.qlinear.exllama import ExllamaQuantLinear as QuantLinear
     else:
         # Third Party
-        from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import QuantLinear
+        from gptqmodel.nn_modules.qlinear.exllamav2 import ExllamaV2QuantLinear as QuantLinear
 
     num_swapped = 0
     for n, m in model.named_modules():
diff --git a/fms_mo/fx/utils.py b/fms_mo/fx/utils.py
@@ -40,9 +40,9 @@
     # Local
     from fms_mo.modules.linear import QLinearExv1WI4AF16, QLinearExv2WI4AF16
 
-    autogptq_available = True
+    gptqmodel_available = True
 except ImportError:
-    autogptq_available = False
+    gptqmodel_available = False
 
 
 MIN_BLOCK_SIZE = 5
@@ -90,7 +90,7 @@ def check_qclass_fallback_based_on_min_feat(
     ]
     if cutlass_available:
         qclass_has_constraints += [QLinearCutlassI8I32NT]
-    if autogptq_available:
+    if gptqmodel_available:
         qclass_has_constraints += [QLinearExv1WI4AF16, QLinearExv2WI4AF16]
 
     qclass = type(ref_module)
@@ -128,7 +128,7 @@ def lower_qmodel_to_ext_kernels(
     1. user need to define a mapping thru    qcfg["ext_kernel_mapping_mod"]
     2. to make it simple, only swap user specified qclass, nothing else
     3. move the module to GPU before swapping to accelerate scale/zp calculations
-    4. autogptq_post_init() must be done at model level, or OOM and incorrect results easily
+    4. gptq_post_init() must be done at model level, or OOM and incorrect results easily
 
     Args:
         mod (torch.nn.Module): model to be 'lowered'
@@ -155,7 +155,7 @@ def lower_qmodel_to_ext_kernels(
     qclass_must_start_from_cpu = None
     using_gptq = False
     if (
-        available_packages["auto_gptq"]
+        available_packages["gptqmodel"]
         and available_packages["exllama_kernels"]
         and available_packages["exllamav2_kernels"]
     ):
@@ -211,9 +211,9 @@ def lower_qmodel_to_ext_kernels(
 
     if using_gptq:
         # Third Party
-        from auto_gptq.modeling._utils import autogptq_post_init
+        from gptqmodel.utils.model import hf_gptqmodel_post_init as gptq_post_init
 
-        mod_tmp = autogptq_post_init(mod_tmp, use_act_order=False)  # see Note 4
+        mod_tmp = gptq_post_init(mod_tmp, use_act_order=False)  # see Note 4
 
     mod.to(currDev)
     logger.info(mod)
diff --git a/fms_mo/modules/linear.py b/fms_mo/modules/linear.py
@@ -1402,13 +1402,13 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 
 try:
     # Third Party
-    from auto_gptq.nn_modules.qlinear.qlinear_exllama import (
-        QuantLinear as QLinearExllamaV1,
+    from gptqmodel.nn_modules.qlinear.exllama import (
+        ExllamaQuantLinear as QLinearExllamaV1,
     )
-    from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import (
-        QuantLinear as QLinearExllamaV2,
+    from gptqmodel.nn_modules.qlinear.exllamav2 import (
+        ExllamaV2QuantLinear as QLinearExllamaV2,
     )
-    from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import ext_gemm_half_q_half
+    from gptqmodel.nn_modules.qlinear.exllamav2 import ext_gemm_half_q_half
     from exllama_kernels import prepare_buffers, set_tuning_params
     from transformers.pytorch_utils import Conv1D
 
@@ -1515,7 +1515,7 @@ def forward(self, x):
                 Tensor: Output tensor of shape (batch_size, out_features).
             """
             with torch.no_grad():
-                x = torch.ops.autogptq_gemm.exv1_i4f16(x.half(), self.q4, self.width)
+                x = torch.ops.gptqmodel_gemm.exv1_i4f16(x.half(), self.q4, self.width)
 
             if self.bias is not None:
                 x.add_(self.bias)
@@ -1665,7 +1665,7 @@ def from_fms_mo(cls, fms_mo_qlinear, **kwargs):
             if kwargs.get(
                 "useInductor", False
             ):  # anything other than False or None will use torch wrapped version
-                qlinear_ex.extOp = torch.ops.autogptq_gemm.exv2_i4f16
+                qlinear_ex.extOp = torch.ops.gptqmodel_gemm.exv2_i4f16
             else:
                 qlinear_ex.extOp = ext_gemm_half_q_half
 
@@ -1701,7 +1701,7 @@ def forward(self, x, force_cuda=False):
 
 except ModuleNotFoundError:
     logger.warning(
-        "AutoGPTQ is not properly installed. "
+        "GPTQModel is not properly installed. "
         "QLinearExv1WI4AF16 and QLinearExv2WI4AF16 wrappers will not be available."
     )
 
diff --git a/fms_mo/run_quant.py b/fms_mo/run_quant.py
@@ -141,7 +141,7 @@ def run_gptq(model_args, data_args, opt_args, gptq_args):
         damp_percent=gptq_args.damp_percent,
     )
 
-    # Add custom model_type mapping to auto_gptq LUT so GPTQModel can recognize them.
+    # Add custom model_type mapping to gptqmodel LUT so GPTQModel can recognize them.
     for mtype, cls in custom_gptq_classes.items():
         SUPPORTED_MODELS.append(mtype)
         MODEL_MAP[mtype] = cls
diff --git a/fms_mo/training_args.py b/fms_mo/training_args.py
@@ -172,7 +172,7 @@ class FMSMOArguments(TypeChecker):
 
 @dataclass
 class GPTQArguments(TypeChecker):
-    """Dataclass for GPTQ related arguments that will be used by auto-gptq."""
+    """Dataclass for GPTQ related arguments that will be used by gptqmodel."""
 
     bits: int = field(default=4, metadata={"choices": [2, 3, 4, 8]})
     group_size: int = field(default=-1)
diff --git a/tests/build/test_launch_script.py b/tests/build/test_launch_script.py
@@ -87,7 +87,7 @@ def cleanup_env():
 
 @pytest.mark.skipif(
     not available_packages["gptqmodel"],
-    reason="Only runs if auto-gptq package is installed",
+    reason="Only runs if gptqmodel package is installed",
 )
 def test_successful_gptq():
     """Check if we can gptq models"""

Original file line number	Diff line number	Diff line change
`@@ -141,7 +141,7 @@ def run_gptq(model_args, data_args, opt_args, gptq_args):`
`141`	`141`	`damp_percent=gptq_args.damp_percent,`
`142`	`142`	`)`
`143`	`143`
`144`		`- # Add custom model_type mapping to auto_gptq LUT so GPTQModel can recognize them.`
	`144`	`+ # Add custom model_type mapping to gptqmodel LUT so GPTQModel can recognize them.`
`145`	`145`	`for mtype, cls in custom_gptq_classes.items():`
`146`	`146`	`SUPPORTED_MODELS.append(mtype)`
`147`	`147`	`MODEL_MAP[mtype] = cls`