You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
<summary>Additional Model Inventory Management Commands</summary>
172
172
173
173
### Where
174
-
This subcommand shows location of a particular model.
174
+
This subcommand shows the location of a particular model.
175
175
```bash
176
176
python3 torchchat.py where llama3.1
177
177
```
@@ -216,7 +216,6 @@ This mode generates text based on an input prompt.
216
216
python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy and his bear"
217
217
```
218
218
219
-
[skip default]: end
220
219
221
220
### Server
222
221
This mode exposes a REST API for interacting with a model.
@@ -286,14 +285,16 @@ First, follow the steps in the Server section above to start a local server. The
286
285
streamlit run torchchat/usages/browser.py
287
286
```
288
287
288
+
[skip default]: end
289
+
289
290
Use the "Max Response Tokens" slider to limit the maximum number of tokens generated by the model for each response. Click the "Reset Chat" button to remove the message history and start a fresh chat.
290
291
291
292
292
293
## Desktop/Server Execution
293
294
294
295
### AOTI (AOT Inductor)
295
296
[AOTI](https://pytorch.org/blog/pytorch2-2/) compiles models before execution for faster inference. The process creates a [DSO](https://en.wikipedia.org/wiki/Shared_library) model (represented by a file with extension `.so`)
296
-
that is then loaded for inference. This can be done with both Python and C++ enviroments.
297
+
that is then loaded for inference. This can be done with both Python and C++ environments.
297
298
298
299
The following example exports and executes the Llama3.1 8B Instruct
299
300
model. The first command compiles and performs the actual export.
Copy file name to clipboardExpand all lines: docs/quantization.md
+17-8Lines changed: 17 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -120,23 +120,32 @@ python3 torchchat.py generate llama3 --pte-path llama3.pte --prompt "Hello my n
120
120
121
121
## Experimental TorchAO lowbit kernels
122
122
123
+
WARNING: These kernels only work on devices with ARM CPUs, for example on Mac computers with Apple Silicon.
124
+
123
125
### Use
124
-
The quantization scheme a8wxdq dynamically quantizes activations to 8 bits, and quantizes the weights in a groupwise manner with a specified bitwidth and groupsize.
126
+
127
+
#### linear:a8wxdq
128
+
The quantization scheme linear:a8wxdq dynamically quantizes activations to 8 bits, and quantizes the weights in a groupwise manner with a specified bitwidth and groupsize.
125
129
It takes arguments bitwidth (1, 2, 3, 4, 5, 6, 7), groupsize, and has_weight_zeros (true, false).
126
130
The argument has_weight_zeros indicates whether the weights are quantized with scales only (has_weight_zeros: false) or with both scales and zeros (has_weight_zeros: true).
127
131
Roughly speaking, {bitwidth: 4, groupsize: 32, has_weight_zeros: false} is similar to GGML's Q4_0 quantization scheme.
128
132
129
-
You should expect high performance on ARM CPU if bitwidth is 1, 2, 3, 4, or 5 and groupsize is divisible by 16. With other platforms and argument choices, a slow fallback kernel will be used. You will see warnings about this during quantization.
133
+
You should expect high performance on ARM CPU if groupsize is divisible by 16. With other platforms and argument choices, a slow fallback kernel will be used. You will see warnings about this during quantization.
134
+
135
+
#### embedding:wx
136
+
The quantization scheme embedding:wx quantizes embeddings in a groupwise manner with the specified bitwidth and groupsize. It takes arguments bitwidth (1, 2, 3, 4, 5, 6, 7) and groupsize. Unlike linear:a8wxdq, embedding:wx always quantizes with scales and zeros.
137
+
138
+
You should expect high performance on ARM CPU if groupsize is divisible by 32. With other platforms and argument choices, a slow fallback kernel will be used. You will see warnings about this during quantization.
130
139
131
140
### Setup
132
-
To use a8wxdq, you must set up the torchao experimental kernels. These will only work on devices with ARM CPUs, for example on Mac computers with Apple Silicon.
141
+
To use linear:a8wxdq and embedding:wx, you must set up the torchao experimental kernels. These will only work on devices with ARM CPUs, for example on Mac computers with Apple Silicon.
133
142
134
143
From the torchchat root directory, run
135
144
```
136
145
sh torchchat/utils/scripts/build_torchao_ops.sh
137
146
```
138
147
139
-
This should take about 10 seconds to complete. Once finished, you can use a8wxdq in torchchat.
148
+
This should take about 10 seconds to complete.
140
149
141
150
Note: if you want to use the new kernels in the AOTI and C++ runners, you must pass the flag link_torchao_ops when running the scripts the build the runners.
142
151
@@ -156,17 +165,17 @@ Below we show how to use the new kernels. Except for ExecuTorch, you can specif
156
165
157
166
#### Eager mode
158
167
```
159
-
OMP_NUM_THREADS=6 python3 torchchat.py generate llama3.1 --device cpu --dtype float32 --quantize '{"linear:a8wxdq": {"bitwidth": 4, "groupsize": 256, "has_weight_zeros": false}}' --prompt "Once upon a time," --num-samples 5
168
+
OMP_NUM_THREADS=6 python3 torchchat.py generate llama3.1 --device cpu --dtype float32 --quantize '{"embedding:wx": {"bitwidth": 2, "groupsize": 32}, "linear:a8wxdq": {"bitwidth": 3, "groupsize": 128, "has_weight_zeros": false}}' --prompt "Once upon a time," --num-samples 5
160
169
```
161
170
162
171
#### torch.compile
163
172
```
164
-
OMP_NUM_THREADS=6 python3 torchchat.py generate llama3.1 --device cpu --dtype float32 --quantize '{"linear:a8wxdq": {"bitwidth": 4, "groupsize": 256, "has_weight_zeros": false}}' --compile --prompt "Once upon a time," --num-samples 5
173
+
OMP_NUM_THREADS=6 python3 torchchat.py generate llama3.1 --device cpu --dtype float32 --quantize '{"embedding:wx": {"bitwidth": 2, "groupsize": 32}, "linear:a8wxdq": {"bitwidth": 3, "groupsize": 128, "has_weight_zeros": false}}' --compile --prompt "Once upon a time," --num-samples 5
Note: only the ExecuTorch C++ runner in torchchat when built using the instructions in the setup can run the exported *.pte file. It will not work with the `python torchchat.py generate` command.
0 commit comments