You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
python3 torchchat.py generate llama3 --pte-path llama3.pte --prompt "Hello my name is"
119
119
```
120
120
121
+
## Experimental TorchAO lowbit kernels
122
+
123
+
### Use
124
+
The quantization scheme a8wxdq dynamically quantizes activations to 8 bits, and quantizes the weights in a groupwise manner with a specified bitwidth and groupsize.
125
+
It takes arguments bitwidth (2, 3, 4, 5, 6, 7), groupsize, and has_weight_zeros (true, false).
126
+
The argument has_weight_zeros indicates whether the weights are quantized with scales only (has_weight_zeros: false) or with both scales and zeros (has_weight_zeros: true).
127
+
Roughly speaking, {bitwidth: 4, groupsize: 256, has_weight_zeros: false} is similar to GGML's Q40 quantization scheme.
128
+
129
+
You should expect high performance on ARM CPU if bitwidth is 2, 3, 4, or 5 and groupsize is divisible by 16. With other platforms and argument choices, a slow fallback kernel will be used. You will see warnings about this during quantization.
130
+
131
+
### Setup
132
+
To use a8wxdq, you must set up the torchao experimental kernels. These will only work on devices with ARM CPUs, for example on Mac computers with Apple Silicon.
133
+
134
+
From the torchchat root directory, run
135
+
```
136
+
sh torchchat/utils/scripts/build_torchao_experimental.sh
137
+
```
138
+
139
+
This should take about 10 seconds to complete. Once finished, you can use a8wxdq in torchchat.
140
+
141
+
Note: if you want to use the new kernels in the AOTI and C++ runners, you must pass the flag link_torchao when running the scripts the build the runners.
142
+
143
+
```
144
+
sh torchchat/utils/scripts/build_native.sh aoti link_torchao
145
+
```
146
+
147
+
```
148
+
sh torchchat/utils/scripts/build_native.sh et link_torchao
Note: only the ExecuTorch C++ runner in torchchat when built using the instructions in the setup can run the exported *.pte file.
181
+
Also note that the ExecuTorch op that wraps the new torchao kernel is currently single threaded.
182
+
121
183
## Quantization Profiles
122
184
123
185
Four [sample profiles](https://github.com/pytorch/torchchat/tree/main/torchchat/quant_config/) are included with the torchchat distribution: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json`
0 commit comments