Skip to content

Commit 087d085

Browse files
authored
update doc AWQ quantization (#1795)
1 parent 6ebddf3 commit 087d085

File tree

1 file changed

+12
-4
lines changed

1 file changed

+12
-4
lines changed

docs/quantization.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -165,18 +165,26 @@ In this mode, all model weights are stored in BF16 and all layers are run with t
165165

166166
### 4-bit AWQ
167167

168-
The compute type would be `int32_float16`
169-
170168
**Supported on:**
171169

172170
* NVIDIA GPU with Compute Capability >= 7.5
173171

172+
CTranslate2 internally handles the compute type for AWQ quantization.
174173
In this mode, all model weights are stored in half precision and all layers are run in half precision. Other parameters like scale and zero are stored in ``int32``.
175174

176-
For example,
175+
**Steps to use AWQ Quantization:**
176+
177+
* Download a AWQ quantized model from Hugging Face for example (TheBloke/Llama-2-7B-AWQ){https://huggingface.co/TheBloke/Llama-2-7B-AWQ} or quantize your own model with using this (AutoAWQ example){https://casper-hansen.github.io/AutoAWQ/examples/}.
177178

179+
* Convert AWQ Quantized model to Ctranslate2 model:
178180
```bash
179181
ct2-transformers-converter --model TheBloke/Llama-2-7B-AWQ --copy_files tokenizer.model --output_dir ct2_model
180182
```
181183

182-
We have to quantize the model with AWQ first, then convert it to CT2 format.
184+
* Run inference as usual with Ctranslate2:
185+
```bash
186+
model = ctranslate2.Generator('ct2_model', device='cuda')
187+
outputs = model.generate_batch([tokens])
188+
```
189+
190+
Currently, CTranslate2 only supports the GEMM and GEMV kernels for AWQ quantization.

0 commit comments

Comments
 (0)