Skip to content

Commit bfb5167

Browse files
Fix: Correct device placement for QuantizedLinear in AWQ
Addresses an AttributeError in AWQ quantization where QuantizedLinear, an nn.Module, was incorrectly passed to move_to_device, which expects a tensor. This change ensures QuantizedLinear modules are moved to the target device using the correct .to(device) method. Additionally, this commit includes updates to the documentation: - Docs for AWQ quantization were updated to include parameters like scale_dtype, enable_mnn_kernel, and batch_size. - Clarified inference procedures for AWQ-quantized models. - README.md was updated to list AWQ as a supported method and the roadmap was revised.
1 parent 8e517be commit bfb5167

File tree

3 files changed

+14
-6
lines changed

3 files changed

+14
-6
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ The goal of QuantLLM is to **democratize LLM training**, especially in low-resou
2121

2222
| Feature | Description |
2323
|----------------------------------|-------------|
24-
| ✅ Quantized Model Loading | Load any HuggingFace model in 4-bit or 8-bit precision with customizable quantization settings |
24+
| ✅ Quantized Model Loading | Load HuggingFace models with various quantization techniques (including AWQ, GPTQ, GGUF) in 4-bit or 8-bit precision, featuring customizable settings. |
2525
| ✅ Advanced Dataset Management | Load, preprocess, and split datasets with flexible configurations |
2626
| ✅ LoRA / QLoRA Fine-Tuning | Memory-efficient fine-tuning with customizable LoRA parameters |
2727
| ✅ Comprehensive Training | Advanced training loop with mixed precision, gradient accumulation, and early stopping |
@@ -76,7 +76,7 @@ For detailed usage examples and API documentation, please refer to our:
7676

7777
- [ ] Multi-GPU training support
7878
- [ ] AutoML for hyperparameter tuning
79-
- [ ] More quantization methods
79+
- [ ] Integration of additional advanced quantization algorithms and techniques.
8080
- [ ] Custom model architecture support
8181
- [ ] Enhanced logging and visualization
8282
- [ ] Model compression techniques

docs/api_reference/quantization.rst

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,9 @@ Main Parameters of `quantize_from_pretrained`
173173
- **AWQ Specific Keys:**
174174
- `zero_point (bool)`: Enable/disable zero-point for activations. Default: True.
175175
- `awq_version (str)`: AWQ algorithm version (e.g., "v1", "v2"). Default: "v2". (Maps to `version` in `AWQQuantizer`).
176+
- `scale_dtype (str)`: Data type for scales (e.g., "fp32", "bf16"). Default: "fp32". (Passed to `AWQQuantizer`).
177+
- `enable_mnn_kernel (bool)`: Enable MNN kernel optimizations, if applicable. Default: False. (Passed to `AWQQuantizer`).
178+
- Note: `batch_size` from the common keys is used by AWQ for its calibration processing.
176179
- **GPTQ Specific Keys:**
177180
- `actorder (bool)`: Enable activation-order quantization. Default: True.
178181
- `percdamp (float)`: Dampening percentage for Hessian update. Default: 0.01.
@@ -260,10 +263,15 @@ AWQ adapts quantization based on activation patterns.
260263
:inherited-members:
261264
:undoc-members:
262265

266+
**Inference with AWQ Quantized Models:** Models quantized using `AWQQuantizer` (or via the high-level API with the 'awq' method) are returned as standard Hugging Face `PreTrainedModel` instances. The quantization is handled transparently by the custom `QuantizedLinear` layers. Therefore, inference can be performed using the usual methods like `.generate()` or by directly calling the model, with no special steps required for AWQ-quantized layers.
267+
263268
**Key `__init__` Parameters for `AWQQuantizer`:**
264-
- ``group_size (int)``: Group size for quantization.
265-
- ``zero_point (bool)``: Enable zero-point computation for activations.
266-
- ``version (str)``: AWQ algorithm version.
269+
- ``group_size (int)``: Size of the quantization group. Default: 128.
270+
- ``zero_point (bool)``: Whether to use zero-point quantization for activations. Default: True.
271+
- ``version (str)``: AWQ algorithm version (e.g., "v1", "v2"). Default: "v2".
272+
- ``scale_dtype (str)``: Data type for scales (e.g., "fp32", "bf16"). Default: "fp32".
273+
- ``enable_mnn_kernel (bool)``: Whether to enable MNN kernel optimizations, if applicable. Default: False.
274+
- ``batch_size (int)``: Batch size for calibration data processing during the `quantize` method. Default: 2.
267275

268276
**Usage Example (Direct):**
269277

quantllm/quant/awq.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@ def _quantize_layer(
192192
format="awq"
193193
)
194194
)
195-
quantized = move_to_device(quantized, target_device)
195+
quantized = quantized.to(target_device)
196196

197197
# Ensure layer parameters are on the target_device for computation
198198
layer = move_to_device(layer, target_device)

0 commit comments

Comments
 (0)