Fix: Correct device placement for QuantizedLinear in AWQ

google-labs-jules[bot] · google-labs-jules[bot] · commit bfb5167c1ae3 · 2025-05-25T07:20:19.000Z
Addresses an AttributeError in AWQ quantization where QuantizedLinear,
an nn.Module, was incorrectly passed to move_to_device, which expects
a tensor. This change ensures QuantizedLinear modules are moved to the
target device using the correct .to(device) method.

Additionally, this commit includes updates to the documentation:
- Docs for AWQ quantization were updated to include parameters like scale_dtype, enable_mnn_kernel, and batch_size.
- Clarified inference procedures for AWQ-quantized models.
- README.md was updated to list AWQ as a supported method and the roadmap was revised.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ The goal of QuantLLM is to **democratize LLM training**, especially in low-resou
 
 | Feature                          | Description |
 |----------------------------------|-------------|
-| ✅ Quantized Model Loading       | Load any HuggingFace model in 4-bit or 8-bit precision with customizable quantization settings |
+| ✅ Quantized Model Loading       | Load HuggingFace models with various quantization techniques (including AWQ, GPTQ, GGUF) in 4-bit or 8-bit precision, featuring customizable settings. |
 | ✅ Advanced Dataset Management   | Load, preprocess, and split datasets with flexible configurations |
 | ✅ LoRA / QLoRA Fine-Tuning      | Memory-efficient fine-tuning with customizable LoRA parameters |
 | ✅ Comprehensive Training        | Advanced training loop with mixed precision, gradient accumulation, and early stopping |
@@ -76,7 +76,7 @@ For detailed usage examples and API documentation, please refer to our:
 
 - [ ] Multi-GPU training support
 - [ ] AutoML for hyperparameter tuning
-- [ ] More quantization methods
+- [ ] Integration of additional advanced quantization algorithms and techniques.
 - [ ] Custom model architecture support
 - [ ] Enhanced logging and visualization
 - [ ] Model compression techniques
diff --git a/docs/api_reference/quantization.rst b/docs/api_reference/quantization.rst
@@ -173,6 +173,9 @@ Main Parameters of `quantize_from_pretrained`
     -   **AWQ Specific Keys:**
         -   `zero_point (bool)`: Enable/disable zero-point for activations. Default: True.
         -   `awq_version (str)`: AWQ algorithm version (e.g., "v1", "v2"). Default: "v2". (Maps to `version` in `AWQQuantizer`).
+        -   `scale_dtype (str)`: Data type for scales (e.g., "fp32", "bf16"). Default: "fp32". (Passed to `AWQQuantizer`).
+        -   `enable_mnn_kernel (bool)`: Enable MNN kernel optimizations, if applicable. Default: False. (Passed to `AWQQuantizer`).
+        -   Note: `batch_size` from the common keys is used by AWQ for its calibration processing.
     -   **GPTQ Specific Keys:**
         -   `actorder (bool)`: Enable activation-order quantization. Default: True.
         -   `percdamp (float)`: Dampening percentage for Hessian update. Default: 0.01.
@@ -260,10 +263,15 @@ AWQ adapts quantization based on activation patterns.
     :inherited-members:
     :undoc-members:
 
+**Inference with AWQ Quantized Models:** Models quantized using `AWQQuantizer` (or via the high-level API with the 'awq' method) are returned as standard Hugging Face `PreTrainedModel` instances. The quantization is handled transparently by the custom `QuantizedLinear` layers. Therefore, inference can be performed using the usual methods like `.generate()` or by directly calling the model, with no special steps required for AWQ-quantized layers.
+
 **Key `__init__` Parameters for `AWQQuantizer`:**
-- ``group_size (int)``: Group size for quantization.
-- ``zero_point (bool)``: Enable zero-point computation for activations.
-- ``version (str)``: AWQ algorithm version.
+- ``group_size (int)``: Size of the quantization group. Default: 128.
+- ``zero_point (bool)``: Whether to use zero-point quantization for activations. Default: True.
+- ``version (str)``: AWQ algorithm version (e.g., "v1", "v2"). Default: "v2".
+- ``scale_dtype (str)``: Data type for scales (e.g., "fp32", "bf16"). Default: "fp32".
+- ``enable_mnn_kernel (bool)``: Whether to enable MNN kernel optimizations, if applicable. Default: False.
+- ``batch_size (int)``: Batch size for calibration data processing during the `quantize` method. Default: 2.
 
 **Usage Example (Direct):**
 
diff --git a/quantllm/quant/awq.py b/quantllm/quant/awq.py
@@ -192,7 +192,7 @@ def _quantize_layer(
                 format="awq"
             )
         )
-        quantized = move_to_device(quantized, target_device)
+        quantized = quantized.to(target_device)
 
         # Ensure layer parameters are on the target_device for computation
         layer = move_to_device(layer, target_device)

Original file line number	Diff line number	Diff line change
`@@ -192,7 +192,7 @@ def _quantize_layer(`
`192`	`192`	`format="awq"`
`193`	`193`	`)`
`194`	`194`	`)`
`195`		`- quantized = move_to_device(quantized, target_device)`
	`195`	`+ quantized = quantized.to(target_device)`
`196`	`196`
`197`	`197`	`# Ensure layer parameters are on the target_device for computation`
`198`	`198`	`layer = move_to_device(layer, target_device)`