Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Commit 62c1b7a

Browse files
committed
add doc
1 parent 5c74843 commit 62c1b7a

File tree

1 file changed

+62
-0
lines changed

1 file changed

+62
-0
lines changed

docs/quantization.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,68 @@ python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "gr
118118
python3 torchchat.py generate llama3 --pte-path llama3.pte --prompt "Hello my name is"
119119
```
120120

121+
## Experimental TorchAO lowbit kernels
122+
123+
### Use
124+
The quantization scheme a8wxdq dynamically quantizes activations to 8 bits, and quantizes the weights in a groupwise manner with a specified bitwidth and groupsize.
125+
It takes arguments bitwidth (2, 3, 4, 5, 6, 7), groupsize, and has_weight_zeros (true, false).
126+
The argument has_weight_zeros indicates whether the weights are quantized with scales only (has_weight_zeros: false) or with both scales and zeros (has_weight_zeros: true).
127+
Roughly speaking, {bitwidth: 4, groupsize: 256, has_weight_zeros: false} is similar to GGML's Q40 quantization scheme.
128+
129+
You should expect high performance on ARM CPU if bitwidth is 2, 3, 4, or 5 and groupsize is divisible by 16. With other platforms and argument choices, a slow fallback kernel will be used. You will see warnings about this during quantization.
130+
131+
### Setup
132+
To use a8wxdq, you must set up the torchao experimental kernels. These will only work on devices with ARM CPUs, for example on Mac computers with Apple Silicon.
133+
134+
From the torchchat root directory, run
135+
```
136+
sh torchchat/utils/scripts/build_torchao_experimental.sh
137+
```
138+
139+
This should take about 10 seconds to complete. Once finished, you can use a8wxdq in torchchat.
140+
141+
Note: if you want to use the new kernels in the AOTI and C++ runners, you must pass the flag link_torchao when running the scripts the build the runners.
142+
143+
```
144+
sh torchchat/utils/scripts/build_native.sh aoti link_torchao
145+
```
146+
147+
```
148+
sh torchchat/utils/scripts/build_native.sh et link_torchao
149+
```
150+
151+
### Examples
152+
153+
#### Eager mode
154+
```
155+
python3 torchchat.py generate llama3 --device cpu --dtype float32 --quantize '{"linear:a8wxdq": {"bitwidth": 4, "groupsize": 256, "has_weight_zeros": false}}'
156+
```
157+
158+
#### torch.compile
159+
```
160+
python3 torchchat.py generate llama3 --device cpu --dtype float32 --quantize '{"linear:a8wxdq": {"bitwidth": 4, "groupsize": 256, "has_weight_zeros": false}}' --compile
161+
```
162+
163+
As with PyTorch in general, you can experiment with performance on a difference number of threads by defining OMP_NUM_THREADS. For example,
164+
165+
```
166+
OMP_NUM_THREADS=6 python3 torchchat.py generate llama3 --device cpu --dtype float32 --quantize '{"linear:a8wxdq": {"bitwidth": 4, "groupsize": 256, "has_weight_zeros": false}}' --compile
167+
```
168+
169+
#### AOTI
170+
```
171+
python torchchat.py export llama3 --device cpu --dtype float32 --quantize '{"linear:a8wxdq": {"bitwidth": 4, "groupsize": 256, "has_weight_zeros": false}}' --output-dso llama3.so
172+
python3 torchchat.py generate llama3 --dso-path llama3_1.so --prompt "Hello my name is"
173+
```
174+
175+
#### ExecuTorch
176+
```
177+
python torchchat.py export llama3 --device cpu --dtype float32 --quantize '{"linear:a8wxdq": {"bitwidth": 4, "groupsize": 256, "has_weight_zeros": false}}' --output-pte llama3.pte
178+
```
179+
180+
Note: only the ExecuTorch C++ runner in torchchat when built using the instructions in the setup can run the exported *.pte file.
181+
Also note that the ExecuTorch op that wraps the new torchao kernel is currently single threaded.
182+
121183
## Quantization Profiles
122184

123185
Four [sample profiles](https://github.com/pytorch/torchchat/tree/main/torchchat/quant_config/) are included with the torchchat distribution: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json`

0 commit comments

Comments
 (0)