@@ -18,27 +18,19 @@ pip install -e .
1818## Usage
1919### quantize model
2020First add a config file named "quant_config.json" to model path.
21- For Baichuan or Llama model , config should be like:
21+ For currenttly supported models , config should be like:
2222
2323``` json
2424{
25- "qkv_proj" : " per-tensor" ,
26- "o_proj" : " per-tensor" ,
27- "gate_up_proj" : " per-tensor" ,
28- "down_proj" : " per-tensor"
29- }
30- ```
31-
32- As for Opt model, config should be like:
33-
34- ``` json
35- {
36- "qkv_proj" : " per-tensor" ,
37- "o_proj" : " per-tensor" ,
25+ "qkv" : " per-tensor" ,
26+ "out" : " per-tensor" ,
3827 "fc1" : " per-tensor" ,
3928 "fc2" : " per-tensor"
4029}
4130```
31+
32+ "qkv" stands for QKV matmul of attention, "out" stands for out matmul of attention.
33+ "fc1" and "fc2" are the layers of the FFNs, which might be referred to as "gate_up" and "down" in Llama-like models.
4234You can set the value to "per-tensor" or "per-token" to perform the quant granularity you want.
4335
4436Once config is set, generate scales and do model quantization with following command:
@@ -72,10 +64,24 @@ Model support list:
7264| ---------| ----------------------------|
7365| LLaMA-2 | 7B/13B/70B |
7466| LLaMA | 7B/13B/30B/65B |
75- | Mistral | Soon |
76- | OPT | 6.7B/13B/30B |
77- | Baichuan-2 | 13B (7B Soon) |
78- | Baichuan | 13B (7B Soon) |
67+ | Mixtral | 8* 7B |
68+ | OPT | 6.7B/13B/30B |
69+ | Baichuan-2 | 7B/13B |
70+ | Baichuan | 7B/13B |
71+
72+ ## Performance and inference efficency
73+ Detailed data comming soon
74+
75+ Cases:
76+
77+ [ codellama-13b with A40] ( https://github.com/vllm-project/vllm/pull/1508#issuecomment-1824133140 ) . Tested with vLLM
78+
79+ [ llama-13b with A100] ( https://github.com/vllm-project/vllm/pull/1508#issuecomment-1853826414 ) . Tested with vLLM
80+
81+
82+
83+
84+
7985
8086## Reference
8187If you find SmoothQuant useful or relevant to your research, please cite their paper:
0 commit comments