File tree Expand file tree Collapse file tree 1 file changed +0
-6
lines changed
docs/source/features/quantization Expand file tree Collapse file tree 1 file changed +0
-6
lines changed Original file line number Diff line number Diff line change 2
2
3
3
# AutoAWQ
4
4
5
- :::{warning}
6
- Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
7
- accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
8
- inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
9
- :::
10
-
11
5
To create a new 4-bit quantized model, you can leverage [ AutoAWQ] ( https://github.com/casper-hansen/AutoAWQ ) .
12
6
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~ 70%.
13
7
The main benefits are lower latency and memory usage.
You can’t perform that action at this time.
0 commit comments