Merge pull request #6 from codewithdark-git/fix/awq-quantized-linear-device-issue

codewithdark-git · web-flow · commit a55dc1e61593 · 2025-05-25T12:49:54.000+05:00
Fix/awq quantized linear device issue
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ The goal of QuantLLM is to **democratize LLM training**, especially in low-resou
 
 | Feature                          | Description |
 |----------------------------------|-------------|
-| ✅ Quantized Model Loading       | Load any HuggingFace model in 4-bit or 8-bit precision with customizable quantization settings |
+| ✅ Quantized Model Loading       | Load HuggingFace models with various quantization techniques (including AWQ, GPTQ, GGUF) in 4-bit or 8-bit precision, featuring customizable settings. |
 | ✅ Advanced Dataset Management   | Load, preprocess, and split datasets with flexible configurations |
 | ✅ LoRA / QLoRA Fine-Tuning      | Memory-efficient fine-tuning with customizable LoRA parameters |
 | ✅ Comprehensive Training        | Advanced training loop with mixed precision, gradient accumulation, and early stopping |
@@ -76,7 +76,7 @@ For detailed usage examples and API documentation, please refer to our:
 
 - [ ] Multi-GPU training support
 - [ ] AutoML for hyperparameter tuning
-- [ ] More quantization methods
+- [ ] Integration of additional advanced quantization algorithms and techniques.
 - [ ] Custom model architecture support
 - [ ] Enhanced logging and visualization
 - [ ] Model compression techniques
diff --git a/docs/api_reference/quantization.rst b/docs/api_reference/quantization.rst
@@ -173,6 +173,9 @@ Main Parameters of `quantize_from_pretrained`
     -   **AWQ Specific Keys:**
         -   `zero_point (bool)`: Enable/disable zero-point for activations. Default: True.
         -   `awq_version (str)`: AWQ algorithm version (e.g., "v1", "v2"). Default: "v2". (Maps to `version` in `AWQQuantizer`).
+        -   `scale_dtype (str)`: Data type for scales (e.g., "fp32", "bf16"). Default: "fp32". (Passed to `AWQQuantizer`).
+        -   `enable_mnn_kernel (bool)`: Enable MNN kernel optimizations, if applicable. Default: False. (Passed to `AWQQuantizer`).
+        -   Note: `batch_size` from the common keys is used by AWQ for its calibration processing.
     -   **GPTQ Specific Keys:**
         -   `actorder (bool)`: Enable activation-order quantization. Default: True.
         -   `percdamp (float)`: Dampening percentage for Hessian update. Default: 0.01.
@@ -260,10 +263,15 @@ AWQ adapts quantization based on activation patterns.
     :inherited-members:
     :undoc-members:
 
+**Inference with AWQ Quantized Models:** Models quantized using `AWQQuantizer` (or via the high-level API with the 'awq' method) are returned as standard Hugging Face `PreTrainedModel` instances. The quantization is handled transparently by the custom `QuantizedLinear` layers. Therefore, inference can be performed using the usual methods like `.generate()` or by directly calling the model, with no special steps required for AWQ-quantized layers.
+
 **Key `__init__` Parameters for `AWQQuantizer`:**
-- ``group_size (int)``: Group size for quantization.
-- ``zero_point (bool)``: Enable zero-point computation for activations.
-- ``version (str)``: AWQ algorithm version.
+- ``group_size (int)``: Size of the quantization group. Default: 128.
+- ``zero_point (bool)``: Whether to use zero-point quantization for activations. Default: True.
+- ``version (str)``: AWQ algorithm version (e.g., "v1", "v2"). Default: "v2".
+- ``scale_dtype (str)``: Data type for scales (e.g., "fp32", "bf16"). Default: "fp32".
+- ``enable_mnn_kernel (bool)``: Whether to enable MNN kernel optimizations, if applicable. Default: False.
+- ``batch_size (int)``: Batch size for calibration data processing during the `quantize` method. Default: 2.
 
 **Usage Example (Direct):**
 
diff --git a/quantllm/quant/awq.py b/quantllm/quant/awq.py
@@ -192,7 +192,7 @@ def _quantize_layer(
                 format="awq"
             )
         )
-        quantized = move_to_device(quantized, target_device)
+        quantized = quantized.to(target_device)
 
         # Ensure layer parameters are on the target_device for computation
         layer = move_to_device(layer, target_device)
diff --git a/quantllm/quant/gguf.py b/quantllm/quant/gguf.py
@@ -232,7 +232,7 @@ def _quantize_layer(
         chunk_size = 1024  # Adjust based on available memory
         
         
-        quantized = move_to_device(quantized, target_device)
+        quantized = quantized.to(target_device)
 
         # Copy bias if exists
         if layer.bias is not None:
diff --git a/quantllm/quant/gptq.py b/quantllm/quant/gptq.py
@@ -202,7 +202,7 @@ def _quantize_layer(self, layer: nn.Linear, H: torch.Tensor) -> QuantizedLinear:
                 calibration="gptq"
             )
         )
-        quantized = move_to_device(quantized, target_device)
+        quantized = quantized.to(target_device)
         
         if layer.bias is not None:
             # layer is already on target_device

Original file line number	Diff line number	Diff line change
`@@ -192,7 +192,7 @@ def _quantize_layer(`
`192`	`192`	`format="awq"`
`193`	`193`	`)`
`194`	`194`	`)`
`195`		`- quantized = move_to_device(quantized, target_device)`
	`195`	`+ quantized = quantized.to(target_device)`
`196`	`196`
`197`	`197`	`# Ensure layer parameters are on the target_device for computation`
`198`	`198`	`layer = move_to_device(layer, target_device)`
Original file line number	Diff line number	Diff line change
`@@ -202,7 +202,7 @@ def _quantize_layer(self, layer: nn.Linear, H: torch.Tensor) -> QuantizedLinear:`
`202`	`202`	`calibration="gptq"`
`203`	`203`	`)`
`204`	`204`	`)`
`205`		`- quantized = move_to_device(quantized, target_device)`
	`205`	`+ quantized = quantized.to(target_device)`
`206`	`206`
`207`	`207`	`if layer.bias is not None:`
`208`	`208`	`# layer is already on target_device`