Skip to content

Commit 2552ae2

Browse files
rogerxfeng8Zhu Yuhua
andauthored
update the llm and woq doc (#4977) (#4997)
Co-authored-by: Zhu Yuhua <[email protected]>
1 parent 5f3c8c2 commit 2552ae2

File tree

2 files changed

+80
-62
lines changed

2 files changed

+80
-62
lines changed

docs/tutorials/llm.rst

Lines changed: 59 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -24,48 +24,78 @@ LLM Inference
2424
:header-rows: 1
2525

2626
* - Model Family
27-
- Verified < MODEL ID > (Huggingface hub)
27+
- Verified models from Huggingface hub
28+
- Dynamic KV-Cache
29+
- Static KV-Cache
2830
- FP16
29-
- INT4 WOQ
31+
- INT4 WoQ
3032
* - Llama2
31-
- "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-13b-hf", "meta-llama/Llama-2-70b-hf"
33+
- meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf
34+
- ✅
35+
- ✅
3236
- ✅
3337
- ✅
3438
* - Llama3
35-
- "meta-llama/Meta-Llama-3-8B"
39+
- meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct
40+
- ✅
41+
- ✅
3642
- ✅
3743
- ✅
3844
* - Phi-3 mini
39-
- "microsoft/Phi-3-mini-128k-instruct"
45+
- microsoft/Phi-3-mini-4k-instruct, microsoft/Phi-3-mini-128k-instruct
46+
- ✅
47+
- ✅
4048
- ✅
4149
- ✅
4250
* - GPT-J
43-
- "EleutherAI/gpt-j-6b"
51+
- EleutherAI/gpt-j-6b
52+
- ✅
53+
- ✅
4454
- ✅
4555
- ✅
4656
* - Qwen
47-
- "Qwen/Qwen-7B"
57+
- Qwen/Qwen2-VL-7B-Instruct
58+
- ✅
59+
- ✅
60+
- ✅
61+
- ✅
62+
* - GLM-Chat
63+
- THUDM/glm-4-9b-chat
64+
- ✅
4865
- ✅
4966
- ✅
50-
* - OPT
51-
- "facebook/opt-30b", "facebook/opt-1.3b"
5267
- ✅
53-
- ❎
5468
* - Bloom
55-
- "bigscience/bloom-7b1", "bigscience/bloom"
69+
- bigscience/bloom-7b1
70+
- ✅
71+
- ✅
72+
- ✅
73+
-
74+
* - Baichuan2
75+
- baichuan-inc/Baichuan2-13B-Chat
76+
- ✅
77+
- ✅
5678
- ✅
57-
-
58-
* - ChatGLM3-6B
59-
- "THUDM/chatglm3-6b"
79+
-
80+
* - Falcon
81+
- tiiuae/falcon-40b-instruct
6082
- ✅
61-
- ❎
62-
* - Baichuan2-13B
63-
- "baichuan-inc/Baichuan2-13B-Chat"
83+
-
6484
- ✅
65-
- ❎
85+
-
86+
* - OPT
87+
- facebook/opt-6.7b, facebook/opt-30b
88+
- ✅
89+
-
90+
- ✅
91+
-
6692

93+
Platforms
94+
~~~~~~~~~~~~~
95+
All above workloads are validated on Intel® Data Center Max 1550 GPU.
96+
The WoQ (Weight Only Quantization) int4 workloads are also partially validated on Intel® Core™ Ultra series (Lunar Lake) with Intel® Arc™ Graphics. Refer to Weight Only Quantization INT4 section.
6797

68-
*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.
98+
*Note*: The above verified models (including other models in the same model family, like "meta-llama/Llama-2-7b-hf" from Llama family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.
6999

70100
LLM fine-tuning on Intel® Data Center Max 1550 GPU
71101
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -75,55 +105,37 @@ LLM fine-tuning on Intel® Data Center Max 1550 GPU
75105
:header-rows: 1
76106

77107
* - Model Family
78-
- Verified < MODEL ID > (Huggingface hub)
108+
- Verified models from Huggingface hub
79109
- Mixed Precision (BF16+FP32)
80110
- Full fine-tuning
81111
- LoRA
82112
* - Llama2
83-
- "meta-llama/Llama-2-7b-hf"
113+
- meta-llama/Llama-2-7b-hf
84114
- ✅
85115
- ✅
86116
- ✅
87117
* - Llama2
88-
- "meta-llama/Llama-2-70b-hf",
118+
- meta-llama/Llama-2-70b-hf
89119
- ✅
90-
-
120+
-
91121
- ✅
92122
* - Llama3
93-
- "meta-llama/Meta-Llama-3-8B"
123+
- meta-llama/Meta-Llama-3-8B
94124
- ✅
95125
- ✅
96126
- ✅
97127
* - Qwen
98-
- "Qwen/Qwen-7B"
128+
- Qwen/Qwen-1.5B
99129
- ✅
100130
- ✅
101131
- ✅
102132
* - Phi-3-mini 3.8B
103-
- "Phi-3-mini-4k-instruct"
104-
- ✅
105-
- ✅
106-
- ✅
107-
108-
LLM fine-tuning on Intel® Core™ Ultra Processors with Intel® Arc™ Graphics
109-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110-
111-
.. list-table::
112-
:widths: auto
113-
:header-rows: 1
114-
115-
* - Model Family
116-
- Verified < MODEL ID > (Huggingface hub)
117-
- Mixed Precision (BF16+FP32)
118-
- Full fine-tuning
119-
- LoRA
120-
* - Phi-3-mini 3.8B
121-
- "Phi-3-mini-4k-instruct"
133+
- Phi-3-mini-4k-instruct
122134
- ✅
123135
- ✅
124136
- ✅
125137

126-
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.3.110/examples/gpu/llm>`_ for instructions to install/setup environment and example scripts..
138+
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.5.10/examples/gpu/llm>`_ for instructions to install/setup environment and example scripts..
127139

128140
Optimization Methodologies
129141
--------------------------
@@ -198,9 +210,9 @@ Large Language Models (LLMs) have shown remarkable performance in various natura
198210

199211
However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements.
200212

201-
To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike `normal quantization <https://github.com/intel/intel-extension-for-transformers/blob/main/docs/quantization.md>`_, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WOQ), which only quantizes the weights statically. WOQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WOQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.
213+
To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike `normal quantization <https://github.com/intel/intel-extension-for-transformers/blob/main/docs/quantization.md>`_, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WoQ), which only quantizes the weights statically. WoQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WoQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.
202214

203-
For more detailed information, check `WOQ INT4 <llm/int4_weight_only_quantization.html>`_.
215+
For more detailed information, check `WoQ INT4 <llm/int4_weight_only_quantization.html>`_.
204216

205217
.. toctree::
206218
:hidden:

docs/tutorials/llm/int4_weight_only_quantization.md

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,22 @@ To overcome this issue, we propose quantization methods that reduce the size and
1313

1414
| Support Device | RTN* | AWQ* | TEQ* | GPTQ* | AutoRound* | Data type of quantized weight |
1515
|:--------------:|:----------:|:----------:|:----------:|:----:|:----:|:------------------------:|
16-
| GPU | &#10004; | stay tuned* | stay tuned* | stay tuned* | stay tuned* | int4_fullrange |
16+
| GPU | &#10004; | &#10004; | stay tuned* | &#10004; | stay tuned* | int4_fullrange |
1717

18-
| Model | Datatype | Platform | Device | Algorithm |
19-
|:--------------:|:----------:|:----------:|:----------:|:----------:|
20-
| Qwen-7B | INT4 | Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics | Intel® GPU | RTN |
21-
| GPT-J-6B | INT4 | Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics | Intel® GPU | RTN |
18+
| Model | Datatype | Device | Algorithm |
19+
|:--------------:|:----------:|:----------:|:----------:|
20+
| Llama3-8B | INT4 | Intel® iGPU and dGPU | RTN, GPTQ |
21+
| Phi3-mini | INT4 | Intel® iGPU and dGPU | RTN, GPTQ |
22+
| GPT-J-6B | INT4 | Intel® iGPU and dGPU | RTN, GPTQ |
23+
| Qwen2-7B | INT4 | Intel® iGPU and dGPU | RTN, GPTQ |
24+
| GLM-4-9b-chat | INT4 | Intel® iGPU and dGPU | RTN, GPTQ |
2225

23-
> Note: RTN algorithm is supported by Intel® Extension for PyTorch\*. For other algorithms, we mark as 'stay tuned' and highly recommend you waiting for the availability of the INT4 models on the HuggingFace Model Hub, since the LLM quantization procedure is significantly constrained by the machine's host memory and computation capabilities.
26+
Validation Platforms
27+
* Intel® Data Center GPU Max Series
28+
* Intel® Arc™ A-Series Graphics
29+
* Intel® Core™ Ultra series
30+
31+
> Note: For algorithms marked as 'stay tuned' are highly recommended to wait for the availability of the INT4 models on the HuggingFace Model Hub, since the LLM quantization procedure is significantly constrained by the machine's host memory and computation capabilities.
2432
2533
**RTN**[[1]](#1): Rounding to Nearest (RTN) is an intuitively simple method that rounds values to the nearest integer. It boasts simplicity, requiring no additional datasets, and offers fast quantization. Besides, it could be easily applied in other datatype like NF4 (non-uniform). Typically, it performs well on configurations such as W4G32 or W8, but worse than advanced algorithms at lower precision level.
2634

@@ -29,7 +37,7 @@ To overcome this issue, we propose quantization methods that reduce the size and
2937
**TEQ**[[3]](#3): To our knowledge, it is the first trainable equivalent ransformation method (summited for peer review in 202306). However, it requires more memory than other methods as model-wise loss is used and the equivalent transformation imposes certain requirements on model architecture.
3038

3139

32-
**GPTQ**[[4]](#4): GPTQ is a widely adopted method based on the Optimal Brain Surgeon. It quantizes weight block by block and fine-tunes the remaining unquantized ones to mitigate quantization errors. Occasionally, Non-positive semidefinite matrices may occur, necessitating adjustments to hyperparameters.
40+
**GPTQ**[[4]](#4): GPTQ is a widely adopted method based on the Optimal Brain Surgeon. It quantizes weight block by block and fine-tunes the remaining unquantized ones to mitigate quantization errors. Occasionally, non-positive semidefinite matrices may occur, necessitating adjustments to hyperparameters.
3341

3442
**AutoRound**[[5]](#5): AutoRound utilizes sign gradient descent to optimize rounding values and minmax values of weights within just 200 steps, showcasing impressive performance compared to recent methods like GPTQ/AWQ. Additionally, it offers hypeparameters tuning compatibility to further enhance performance. However, due to its reliance on gradient backpropagation, currently it is not quite fit for backends like ONNX.
3543

@@ -91,15 +99,15 @@ In Weight-Only Quantization INT4 case, when using `AutoModelForCausalLM.from_pre
9199
scale_dtype=convert_dtype_str2torch("fp16"))
92100
```
93101

94-
When running on Intel® GPU, it will replace the linear in the model with `WeightOnlyQuantizedLinear`. After that, the model linear weight loaded by `ipex.llm.optimize` is in INT4 format, and it contains not only weight and bias information, but also scales, zero_points, and blocksize information. When optimizing transformers at the front end, Intel® Extension for PyTorch\* will use `WeightOnlyQuantizedLinear` to initialize these information in the model if they are present, otherwise, it will use `IPEXTransformerLinear` to initialize the linear parameters in the model.
102+
When running on Intel® GPU, it will replace the linear in the model with `WeightOnlyQuantizedLinear`. After that, the model linear weight loaded by `ipex.llm.optimize` is in INT4 format, and it contains not only weight and bias information, but also scales, zero_points, and blocksize information. When optimizing transformers at the front end, Intel® Extension for PyTorch\* will use `WeightOnlyQuantizedLinear` to initialize this information in the model if they are present, otherwise, it will use `IPEXTransformerLinear` to initialize the linear parameters in the model.
95103

96104

97105
### Weight-Only Quantization Runtime
98-
On Intel® GPU, after using `ipex.llm.optimize`, Intel® Extension for PyTorch\* will automatically replace the original attention module with `IPEXTransformerAttnOptimizedInt4` and the original mlp module with `IPEXTransformerMLPOptimizedInt4` in the model.
106+
On Intel® GPU, after using `ipex.llm.optimize`, Intel® Extension for PyTorch\* will automatically replace the original attention module with `IPEXTransformerAttnOptimizedInt4` and the original MLP module with `IPEXTransformerMLPOptimizedInt4` in the model.
99107

100108
The major changes between `IPEXTransformerAttnOptimizedInt4` for INT4 scenario and `ipex.llm.optimize` for FP16 scenario include: replace the linear used to calculate qkv with `torch.ops.torch_ipex.mm_qkv_out_int4` and out_linear with `torch.ops.torch_ipex.mm_bias_int4`.
101109

102-
The major changes between `IPEXTransformerMLPOptimizedInt4` for INT4 scenario and `ipex.llm.optimize` for FP16 scenario include: replace the linear used in mlp with `torch.ops.torch_ipex.mm_bias_int4`, if activation is used in the mlp module, then correspondingly, it will be replaced with our fused linear+activation kernel, such as `torch.ops.torch_ipex.mm_silu_mul_int4`.
110+
The major changes between `IPEXTransformerMLPOptimizedInt4` for INT4 scenario and `ipex.llm.optimize` for FP16 scenario include: replace the linear used in MLP with `torch.ops.torch_ipex.mm_bias_int4`, if activation is used in the MLP module, then correspondingly, it will be replaced with our fused linear+activation kernel, such as `torch.ops.torch_ipex.mm_silu_mul_int4`.
103111

104112
### Weight-Only Quantization Linear Dispatch
105113
As explained before, after applying `ipex.llm.optimize`, The linear kernel that Intel® Extension for PyTorch* has registered to substitute the original linear will be used in the model.
@@ -142,13 +150,11 @@ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
142150
prompt = "Once upon a time, there existed a little girl,"
143151
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
144152

145-
# woq_quantization_config = WeightOnlyQuantConfig(compute_dtype="fp16", weight_dtype="int4_fullrange", scale_dtype="fp16", group_size=64)
146-
# qmodel = AutoModelForCausalLM.from_pretrained(model_name, device_map="xpu", quantization_config=woq_quantization_config, trust_remote_code=True)
147-
148-
qmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="xpu", trust_remote_code=True)
153+
woq_quantization_config = WeightOnlyQuantConfig(compute_dtype="fp16", weight_dtype="int4_fullrange", scale_dtype="fp16", group_size=128)
154+
qmodel = AutoModelForCausalLM.from_pretrained(model_name, device_map="xpu", quantization_config=woq_quantization_config, trust_remote_code=True)
149155

150156
# optimize the model with Intel® Extension for PyTorch*, it will improve performance.
151-
qmodel = ipex.llm.optimize(qmodel, inplace=True, dtype=torch.float16, woq=True, device="xpu")
157+
qmodel = ipex.llm.optimize(qmodel, inplace=True, dtype=torch.float16, quantization_config=woq_quantization_config, device="xpu")
152158

153159
output = qmodel.generate(inputs)
154160
```

0 commit comments

Comments
 (0)