Skip to content

Commit 046359a

Browse files
committed
Merge pull request #6 in wm_ai/autosmoothquant from model_support2 to master - <merge-MERGE #PR-6 ~fix baichuan 7B
>
2 parents 735f96c + 23f57a8 commit 046359a

File tree

9 files changed

+275
-229
lines changed

9 files changed

+275
-229
lines changed

README.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,27 +18,19 @@ pip install -e .
1818
## Usage
1919
### quantize model
2020
First add a config file named "quant_config.json" to model path.
21-
For Baichuan or Llama model, config should be like:
21+
For currenttly supported models, config should be like:
2222

2323
```json
2424
{
25-
"qkv_proj": "per-tensor",
26-
"o_proj": "per-tensor",
27-
"gate_up_proj": "per-tensor",
28-
"down_proj": "per-tensor"
29-
}
30-
```
31-
32-
As for Opt model, config should be like:
33-
34-
```json
35-
{
36-
"qkv_proj": "per-tensor",
37-
"o_proj": "per-tensor",
25+
"qkv": "per-tensor",
26+
"out": "per-tensor",
3827
"fc1": "per-tensor",
3928
"fc2": "per-tensor"
4029
}
4130
```
31+
32+
"qkv" stands for QKV matmul of attention, "out" stands for out matmul of attention.
33+
"fc1" and "fc2" are the layers of the FFNs, which might be referred to as "gate_up" and "down" in Llama-like models.
4234
You can set the value to "per-tensor" or "per-token" to perform the quant granularity you want.
4335

4436
Once config is set, generate scales and do model quantization with following command:
@@ -72,10 +64,24 @@ Model support list:
7264
| ---------| ----------------------------|
7365
| LLaMA-2 | 7B/13B/70B |
7466
| LLaMA | 7B/13B/30B/65B |
75-
| Mistral | Soon |
76-
| OPT | 6.7B/13B/30B |
77-
| Baichuan-2 | 13B (7B Soon) |
78-
| Baichuan | 13B (7B Soon) |
67+
| Mixtral | 8*7B |
68+
| OPT | 6.7B/13B/30B |
69+
| Baichuan-2 | 7B/13B |
70+
| Baichuan | 7B/13B |
71+
72+
## Performance and inference efficency
73+
Detailed data comming soon
74+
75+
Cases:
76+
77+
[codellama-13b with A40](https://github.com/vllm-project/vllm/pull/1508#issuecomment-1824133140). Tested with vLLM
78+
79+
[llama-13b with A100](https://github.com/vllm-project/vllm/pull/1508#issuecomment-1853826414). Tested with vLLM
80+
81+
82+
83+
84+
7985

8086
## Reference
8187
If you find SmoothQuant useful or relevant to your research, please cite their paper:

autosmoothquant/examples/smoothquant_model.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ def parse_args():
2828
help='where to save the act scales, activate when generating scales')
2929
parser.add_argument("--scale-input", type=str, default='scales/llama-13b',
3030
help='where to save the act scales, activate when quantizing models')
31-
parser.add_argument('--num-samples', type=int, default=4)
31+
parser.add_argument('--num-samples', type=int, default=512)
3232
parser.add_argument('--seq-len', type=int, default=512)
3333
parser.add_argument("--model-output", type=str, default='quantized_model/llama-13b',
3434
help='where to save the quantized models, activate when quantizing models')
@@ -114,4 +114,4 @@ def main():
114114
int8_model.save_pretrained(output_path)
115115

116116
if __name__ == '__main__':
117-
main()
117+
main()

0 commit comments

Comments
 (0)