You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All above workloads are validated on Intel® Data Center Max 1550 GPU.
96
+
The WoQ (Weight Only Quantization) int4 workloads are also partially validated on Intel® Core™ Ultra series (Lunar Lake) with Intel® Arc™ Graphics. Refer to Weight Only Quantization INT4 section.
67
97
68
-
*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.
98
+
*Note*: The above verified models (including other models in the same model family, like "meta-llama/Llama-2-7b-hf" from Llama family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.
69
99
70
100
LLM fine-tuning on Intel® Data Center Max 1550 GPU
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.3.110/examples/gpu/llm>`_ for instructions to install/setup environment and example scripts..
138
+
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.5.10/examples/gpu/llm>`_ for instructions to install/setup environment and example scripts..
127
139
128
140
Optimization Methodologies
129
141
--------------------------
@@ -198,9 +210,9 @@ Large Language Models (LLMs) have shown remarkable performance in various natura
198
210
199
211
However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements.
200
212
201
-
To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike `normal quantization <https://github.com/intel/intel-extension-for-transformers/blob/main/docs/quantization.md>`_, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WOQ), which only quantizes the weights statically. WOQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WOQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.
213
+
To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike `normal quantization <https://github.com/intel/intel-extension-for-transformers/blob/main/docs/quantization.md>`_, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WoQ), which only quantizes the weights statically. WoQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WoQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.
202
214
203
-
For more detailed information, check `WOQ INT4 <llm/int4_weight_only_quantization.html>`_.
215
+
For more detailed information, check `WoQ INT4 <llm/int4_weight_only_quantization.html>`_.
> Note: RTN algorithm is supported by Intel® Extension for PyTorch\*. For other algorithms, we mark as 'stay tuned' and highly recommend you waiting for the availability of the INT4 models on the HuggingFace Model Hub, since the LLM quantization procedure is significantly constrained by the machine's host memory and computation capabilities.
26
+
Validation Platforms
27
+
* Intel® Data Center GPU Max Series
28
+
* Intel® Arc™ A-Series Graphics
29
+
* Intel® Core™ Ultra series
30
+
31
+
> Note: For algorithms marked as 'stay tuned' are highly recommended to wait for the availability of the INT4 models on the HuggingFace Model Hub, since the LLM quantization procedure is significantly constrained by the machine's host memory and computation capabilities.
24
32
25
33
**RTN**[[1]](#1): Rounding to Nearest (RTN) is an intuitively simple method that rounds values to the nearest integer. It boasts simplicity, requiring no additional datasets, and offers fast quantization. Besides, it could be easily applied in other datatype like NF4 (non-uniform). Typically, it performs well on configurations such as W4G32 or W8, but worse than advanced algorithms at lower precision level.
26
34
@@ -29,7 +37,7 @@ To overcome this issue, we propose quantization methods that reduce the size and
29
37
**TEQ**[[3]](#3): To our knowledge, it is the first trainable equivalent ransformation method (summited for peer review in 202306). However, it requires more memory than other methods as model-wise loss is used and the equivalent transformation imposes certain requirements on model architecture.
30
38
31
39
32
-
**GPTQ**[[4]](#4): GPTQ is a widely adopted method based on the Optimal Brain Surgeon. It quantizes weight block by block and fine-tunes the remaining unquantized ones to mitigate quantization errors. Occasionally, Non-positive semidefinite matrices may occur, necessitating adjustments to hyperparameters.
40
+
**GPTQ**[[4]](#4): GPTQ is a widely adopted method based on the Optimal Brain Surgeon. It quantizes weight block by block and fine-tunes the remaining unquantized ones to mitigate quantization errors. Occasionally, non-positive semidefinite matrices may occur, necessitating adjustments to hyperparameters.
33
41
34
42
**AutoRound**[[5]](#5): AutoRound utilizes sign gradient descent to optimize rounding values and minmax values of weights within just 200 steps, showcasing impressive performance compared to recent methods like GPTQ/AWQ. Additionally, it offers hypeparameters tuning compatibility to further enhance performance. However, due to its reliance on gradient backpropagation, currently it is not quite fit for backends like ONNX.
35
43
@@ -91,15 +99,15 @@ In Weight-Only Quantization INT4 case, when using `AutoModelForCausalLM.from_pre
91
99
scale_dtype=convert_dtype_str2torch("fp16"))
92
100
```
93
101
94
-
When running on Intel® GPU, it will replace the linear in the model with `WeightOnlyQuantizedLinear`. After that, the model linear weight loaded by `ipex.llm.optimize` is in INT4 format, and it contains not only weight and bias information, but also scales, zero_points, and blocksize information. When optimizing transformers at the front end, Intel® Extension for PyTorch\* will use `WeightOnlyQuantizedLinear` to initialize these information in the model if they are present, otherwise, it will use `IPEXTransformerLinear` to initialize the linear parameters in the model.
102
+
When running on Intel® GPU, it will replace the linear in the model with `WeightOnlyQuantizedLinear`. After that, the model linear weight loaded by `ipex.llm.optimize` is in INT4 format, and it contains not only weight and bias information, but also scales, zero_points, and blocksize information. When optimizing transformers at the front end, Intel® Extension for PyTorch\* will use `WeightOnlyQuantizedLinear` to initialize this information in the model if they are present, otherwise, it will use `IPEXTransformerLinear` to initialize the linear parameters in the model.
95
103
96
104
97
105
### Weight-Only Quantization Runtime
98
-
On Intel® GPU, after using `ipex.llm.optimize`, Intel® Extension for PyTorch\* will automatically replace the original attention module with `IPEXTransformerAttnOptimizedInt4` and the original mlp module with `IPEXTransformerMLPOptimizedInt4` in the model.
106
+
On Intel® GPU, after using `ipex.llm.optimize`, Intel® Extension for PyTorch\* will automatically replace the original attention module with `IPEXTransformerAttnOptimizedInt4` and the original MLP module with `IPEXTransformerMLPOptimizedInt4` in the model.
99
107
100
108
The major changes between `IPEXTransformerAttnOptimizedInt4` for INT4 scenario and `ipex.llm.optimize` for FP16 scenario include: replace the linear used to calculate qkv with `torch.ops.torch_ipex.mm_qkv_out_int4` and out_linear with `torch.ops.torch_ipex.mm_bias_int4`.
101
109
102
-
The major changes between `IPEXTransformerMLPOptimizedInt4` for INT4 scenario and `ipex.llm.optimize` for FP16 scenario include: replace the linear used in mlp with `torch.ops.torch_ipex.mm_bias_int4`, if activation is used in the mlp module, then correspondingly, it will be replaced with our fused linear+activation kernel, such as `torch.ops.torch_ipex.mm_silu_mul_int4`.
110
+
The major changes between `IPEXTransformerMLPOptimizedInt4` for INT4 scenario and `ipex.llm.optimize` for FP16 scenario include: replace the linear used in MLP with `torch.ops.torch_ipex.mm_bias_int4`, if activation is used in the MLP module, then correspondingly, it will be replaced with our fused linear+activation kernel, such as `torch.ops.torch_ipex.mm_silu_mul_int4`.
103
111
104
112
### Weight-Only Quantization Linear Dispatch
105
113
As explained before, after applying `ipex.llm.optimize`, The linear kernel that Intel® Extension for PyTorch* has registered to substitute the original linear will be used in the model.
0 commit comments