Feat/qwen1.5 (#385)

tastelikefeet · web-flow · commit 408d5009b046 · 2024-02-06T00:02:22.000+08:00
diff --git a/README.md b/README.md
@@ -62,6 +62,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用
 
 
 ## 🎉 News
+- 🔥2024.02.05: Support qwen1.5 series: like [qwen1.5-0.5b](https://www.modelscope.cn/models/qwen/Qwen1.5-0.5B/summary), [qwen1.5-7b](https://www.modelscope.cn/models/qwen/Qwen1.5-7B/summary),[qwen1.5-14b](https://www.modelscope.cn/models/qwen/Qwen1.5-14B/summary) , etc. To view all supported qwen1.5 models please check [Model List](https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E6%94%AF%E6%8C%81%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%92%8C%E6%95%B0%E6%8D%AE%E9%9B%86.md).
 - 2024.02.01: Support openbmb-minicpm series: [openbmb-minicpm-2b-sft-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/openbmb_minicpm_2b_sft_chat), openbmb-minicpm-2b-chat.
 - 🔥2024.02.01: Support dataset mixture to reduce **Catastrophic Forgetting**. Use `--train_dataset_mix_ratio 2.0` to train! We also provide a common knowledge dataset [ms-bench](https://www.modelscope.cn/datasets/iic/ms_bench/summary).
 - 🔥2024.02.01: Support Agent training! Agent training algorithm comes from this [paper](https://arxiv.org/pdf/2309.00986.pdf). We also introduce the [ms-agent](https://www.modelscope.cn/datasets/iic/ms_agent/summary) dataset. Use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora/sft.sh) to begin an agent training!
diff --git a/README_CN.md b/README_CN.md
@@ -60,6 +60,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 用户可以查看 [SWIFT官方文档](docs/source/GetStarted/快速使用.md) 来了解详细信息。
 
 ## 🎉 新闻
+- 🔥2024.02.05: 支持qwen1.5系列模型: [qwen1.5-0.5b](https://www.modelscope.cn/models/qwen/Qwen1.5-0.5B/summary), [qwen1.5-7b](https://www.modelscope.cn/models/qwen/Qwen1.5-7B/summary),[qwen1.5-14b](https://www.modelscope.cn/models/qwen/Qwen1.5-14B/summary)等, 支持的所有qwen1.5系列模型请查看[模型列表](https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E6%94%AF%E6%8C%81%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%92%8C%E6%95%B0%E6%8D%AE%E9%9B%86.md).
 - 2024.02.01: 支持openbmb-minicpm系列: [openbmb-minicpm-2b-sft-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/openbmb_minicpm_2b_sft_chat), openbmb-minicpm-2b-chat.
 - 🔥2024.02.01: 支持数据集打混来减少 **灾难性遗忘问题**. 使用`--train_dataset_mix_ratio 2.0`开启训练！同时我们也开源了通用知识数据集 [ms-bench](https://www.modelscope.cn/datasets/iic/ms_bench/summary).
 - 🔥2024.02.01: 支持Agent训练！Agent训练算法源自这篇[论文](https://arxiv.org/pdf/2309.00986.pdf). 我们也增加了[ms-agent](https://www.modelscope.cn/datasets/iic/ms_agent/summary)这个优质的agent数据集. 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora/sft.sh)开启Agent训练!
diff --git a/docs/source/LLM/支持的模型和数据集.md b/docs/source/LLM/支持的模型和数据集.md
@@ -30,6 +30,30 @@
 |qwen-72b-chat|[qwen/Qwen-72B-Chat](https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary)|c_attn|qwen|&#x2714;|&#x2714;||
 |qwen-72b-chat-int4|[qwen/Qwen-72B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary)|c_attn|qwen|&#x2714;|&#x2718;|auto_gptq>=0.5|
 |qwen-72b-chat-int8|[qwen/Qwen-72B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary)|c_attn|qwen|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen1half-0_5b|[qwen/Qwen1.5-0_5B](https://modelscope.cn/models/qwen/Qwen1.5-0_5B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-1_8b|[qwen/Qwen1.5-1_8B](https://modelscope.cn/models/qwen/Qwen1.5-1_8B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-4b|[qwen/Qwen1.5-4B](https://modelscope.cn/models/qwen/Qwen1.5-4B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-7b|[qwen/Qwen1.5-7B](https://modelscope.cn/models/qwen/Qwen1.5-7B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-14b|[qwen/Qwen1.5-14B](https://modelscope.cn/models/qwen/Qwen1.5-14B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-72b|[qwen/Qwen1.5-72B](https://modelscope.cn/models/qwen/Qwen1.5-72B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-0_5b-chat|[qwen/Qwen1.5-0_5B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-0_5B-Chat/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-1_8b-chat|[qwen/Qwen1.5-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-1_8B-Chat/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-4b-chat|[qwen/Qwen1.5-4B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-7b-chat|[qwen/Qwen1.5-7B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-14b-chat|[qwen/Qwen1.5-14B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-72b-chat|[qwen/Qwen1.5-72B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|transformers>=4.37|
+|qwen1half-0_5b-chat-int8|[qwen/Qwen1.5-0_5B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-0_5B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-0_5b-chat-int4|[qwen/Qwen1.5-0_5B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-0_5B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-1_8b-chat-int8|[qwen/Qwen1.5-1_8B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-1_8B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-1_8b-chat-int4|[qwen/Qwen1.5-1_8B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-1_8B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-4b-chat-int8|[qwen/Qwen1.5-4B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-4b-chat-int4|[qwen/Qwen1.5-4B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-7b-chat-int8|[qwen/Qwen1.5-7B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-7b-chat-int4|[qwen/Qwen1.5-7B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-14b-chat-int8|[qwen/Qwen1.5-14B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-14b-chat-int4|[qwen/Qwen1.5-14B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-72b-chat-int8|[qwen/Qwen1.5-72B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
+|qwen1half-72b-chat-int4|[qwen/Qwen1.5-72B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|chatml|&#x2714;|&#x2714;|auto_gptq>=0.5, transformers>=4.37|
 |qwen-vl|[qwen/Qwen-VL](https://modelscope.cn/models/qwen/Qwen-VL/summary)|c_attn|default-generation|&#x2714;|&#x2718;||
 |qwen-vl-chat|[qwen/Qwen-VL-Chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary)|c_attn|qwen|&#x2714;|&#x2718;||
 |qwen-vl-chat-int4|[qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)|c_attn|qwen|&#x2714;|&#x2718;|auto_gptq>=0.5|
diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
@@ -51,13 +51,32 @@ class ModelType:
     qwen_72b_chat = 'qwen-72b-chat'
     qwen_72b_chat_int4 = 'qwen-72b-chat-int4'
     qwen_72b_chat_int8 = 'qwen-72b-chat-int8'
-    # qwen2
-    qwen2_beta_0_5b = 'qwen2-beta-0_5b'
-    qwen2_beta_1_8b = 'qwen2-beta-1_8b'
-    qwen2_beta_4b = 'qwen2-beta-4b'
-    qwen2_beta_7b = 'qwen2-beta-7b'
-    qwen2_beta_14b = 'qwen2-beta-14b'
-    qwen2_beta_72b = 'qwen2-beta-72b'
+    # qwen1.5
+    qwen1half_0_5b = 'qwen1half-0_5b'
+    qwen1half_1_8b = 'qwen1half-1_8b'
+    qwen1half_4b = 'qwen1half-4b'
+    qwen1half_7b = 'qwen1half-7b'
+    qwen1half_14b = 'qwen1half-14b'
+    qwen1half_72b = 'qwen1half-72b'
+    qwen1half_0_5b_chat = 'qwen1half-0_5b-chat'
+    qwen1half_1_8b_chat = 'qwen1half-1_8b-chat'
+    qwen1half_4b_chat = 'qwen1half-4b-chat'
+    qwen1half_7b_chat = 'qwen1half-7b-chat'
+    qwen1half_14b_chat = 'qwen1half-14b-chat'
+    qwen1half_72b_chat = 'qwen1half-72b-chat'
+    # qwen1.5 autogptq
+    qwen1half_0_5b_chat_int8 = 'qwen1half-0_5b-chat-int8'
+    qwen1half_0_5b_chat_int4 = 'qwen1half-0_5b-chat-int4'
+    qwen1half_1_8b_chat_int8 = 'qwen1half-1_8b-chat-int8'
+    qwen1half_1_8b_chat_int4 = 'qwen1half-1_8b-chat-int4'
+    qwen1half_4b_chat_int8 = 'qwen1half-4b-chat-int8'
+    qwen1half_4b_chat_int4 = 'qwen1half-4b-chat-int4'
+    qwen1half_7b_chat_int8 = 'qwen1half-7b-chat-int8'
+    qwen1half_7b_chat_int4 = 'qwen1half-7b-chat-int4'
+    qwen1half_14b_chat_int8 = 'qwen1half-14b-chat-int8'
+    qwen1half_14b_chat_int4 = 'qwen1half-14b-chat-int4'
+    qwen1half_72b_chat_int8 = 'qwen1half-72b-chat-int8'
+    qwen1half_72b_chat_int4 = 'qwen1half-72b-chat-int4'
     # qwen-vl
     qwen_vl = 'qwen-vl'
     qwen_vl_chat = 'qwen-vl-chat'
@@ -219,7 +238,7 @@ class LoRATM(NamedTuple):
     chatglm = ['query_key_value']
     llama2 = ['q_proj', 'k_proj', 'v_proj']
     qwen = ['c_attn']
-    qwen2 = llama2
+    qwen1half = llama2
     polylm = ['c_attn']
     bloom = ['query_key_value']
     cogagent = [
@@ -694,53 +713,101 @@ def cross_entropy_forward(self, inputs: Tensor,
 
 
 @register_model(
-    ModelType.qwen2_beta_0_5b,
-    'qwen/Qwen2-beta-0_5B',
-    LoRATM.qwen2,
+    ModelType.qwen1half_0_5b,
+    'qwen/Qwen1.5-0.5B',
+    LoRATM.qwen1half,
     TemplateType.default_generation,
     support_flash_attn=True,
     support_vllm=True,
     requires=['transformers>=4.37'])
 @register_model(
-    ModelType.qwen2_beta_1_8b,
-    'qwen/Qwen2-beta-1_8B',
-    LoRATM.qwen2,
+    ModelType.qwen1half_0_5b_chat,
+    'qwen/Qwen1.5-0.5B-Chat',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    support_flash_attn=True,
+    support_vllm=True,
+    requires=['transformers>=4.37'])
+@register_model(
+    ModelType.qwen1half_1_8b,
+    'qwen/Qwen1.5-1.8B',
+    LoRATM.qwen1half,
     TemplateType.default_generation,
     support_flash_attn=True,
     support_vllm=True,
     requires=['transformers>=4.37'])
 @register_model(
-    ModelType.qwen2_beta_4b,
-    'qwen/Qwen2-beta-4B',
-    LoRATM.qwen2,
+    ModelType.qwen1half_1_8b_chat,
+    'qwen/Qwen1.5-1.8B-Chat',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    support_flash_attn=True,
+    support_vllm=True,
+    requires=['transformers>=4.37'])
+@register_model(
+    ModelType.qwen1half_4b,
+    'qwen/Qwen1.5-4B',
+    LoRATM.qwen1half,
     TemplateType.default_generation,
     support_flash_attn=True,
     support_vllm=True,
     requires=['transformers>=4.37'])
 @register_model(
-    ModelType.qwen2_beta_7b,
-    'qwen/Qwen2-beta-7B',
-    LoRATM.qwen2,
+    ModelType.qwen1half_4b_chat,
+    'qwen/Qwen1.5-4B-Chat',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    support_flash_attn=True,
+    support_vllm=True,
+    requires=['transformers>=4.37'])
+@register_model(
+    ModelType.qwen1half_7b,
+    'qwen/Qwen1.5-7B',
+    LoRATM.qwen1half,
     TemplateType.default_generation,
     support_flash_attn=True,
     support_vllm=True,
     requires=['transformers>=4.37'])
 @register_model(
-    ModelType.qwen2_beta_14b,
-    'qwen/Qwen2-beta-14B',
-    LoRATM.qwen2,
+    ModelType.qwen1half_7b_chat,
+    'qwen/Qwen1.5-7B-Chat',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    support_flash_attn=True,
+    support_vllm=True,
+    requires=['transformers>=4.37'])
+@register_model(
+    ModelType.qwen1half_14b,
+    'qwen/Qwen1.5-14B',
+    LoRATM.qwen1half,
     TemplateType.default_generation,
     support_flash_attn=True,
     support_vllm=True,
     requires=['transformers>=4.37'])
 @register_model(
-    ModelType.qwen2_beta_72b,
-    'qwen/Qwen2-beta-72B',
-    LoRATM.qwen2,
+    ModelType.qwen1half_14b_chat,
+    'qwen/Qwen1.5-14B-Chat',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    support_flash_attn=True,
+    support_vllm=True,
+    requires=['transformers>=4.37'])
+@register_model(
+    ModelType.qwen1half_72b,
+    'qwen/Qwen1.5-72B',
+    LoRATM.qwen1half,
     TemplateType.default_generation,
     support_flash_attn=True,
     support_vllm=True,
     requires=['transformers>=4.37'])
+@register_model(
+    ModelType.qwen1half_72b_chat,
+    'qwen/Qwen1.5-72B-Chat',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    support_flash_attn=True,
+    support_vllm=True,
+    requires=['transformers>=4.37'])
 @register_model(
     ModelType.deepseek_coder_1_3b,
     'deepseek-ai/deepseek-coder-1.3b-base',
@@ -997,6 +1064,174 @@ def get_model_tokenizer_with_flash_attn(model_dir: str,
         **kwargs)
 
 
+@register_model(
+    ModelType.qwen1half_0_5b_chat_int4,
+    'qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 4},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_0_5b_chat_int8,
+    'qwen/Qwen1.5-0.5B-Chat-GPTQ-Int8',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 8},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_1_8b_chat_int4,
+    'qwen/Qwen1.5-1.8B-Chat-GPTQ-Int4',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 4},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_1_8b_chat_int8,
+    'qwen/Qwen1.5-1.8B-Chat-GPTQ-Int8',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 8},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_4b_chat_int4,
+    'qwen/Qwen1.5-4B-Chat-GPTQ-Int4',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 4},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_4b_chat_int8,
+    'qwen/Qwen1.5-4B-Chat-GPTQ-Int8',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 8},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_7b_chat_int4,
+    'qwen/Qwen1.5-7B-Chat-GPTQ-Int4',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 4},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_7b_chat_int8,
+    'qwen/Qwen1.5-7B-Chat-GPTQ-Int8',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 8},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_14b_chat_int4,
+    'qwen/Qwen1.5-14B-Chat-GPTQ-Int4',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 4},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_14b_chat_int8,
+    'qwen/Qwen1.5-14B-Chat-GPTQ-Int8',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 8},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_72b_chat_int4,
+    'qwen/Qwen1.5-72B-Chat-GPTQ-Int4',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 4},
+    support_flash_attn=True,
+    support_vllm=True)
+@register_model(
+    ModelType.qwen1half_72b_chat_int8,
+    'qwen/Qwen1.5-72B-Chat-GPTQ-Int8',
+    LoRATM.qwen1half,
+    TemplateType.chatml,
+    requires=['auto_gptq>=0.5', 'transformers>=4.37'],
+    torch_dtype=torch.float16,
+    function_kwargs={'bits': 8},
+    support_flash_attn=True,
+    support_vllm=True)
+def get_model_tokenizer_with_flash_attn_intx(model_dir: str,
+                                             torch_dtype: Dtype,
+                                             model_kwargs: Dict[str, Any],
+                                             load_model: bool = True,
+                                             model_config=None,
+                                             **kwargs):
+    if model_config is None:
+        model_config = AutoConfig.from_pretrained(
+            model_dir, trust_remote_code=True)
+    use_flash_attn = kwargs.pop('use_flash_attn', False)
+    if version.parse(transformers.__version__) >= version.parse('4.36'):
+        if use_flash_attn:
+            model_config._attn_implementation = 'flash_attention_2'
+    else:
+        model_config._flash_attn_2_enabled = use_flash_attn
+
+    logger.info('use gptq, ignore bnb arguments')
+    bits = kwargs.pop('bits')
+    if version.parse(transformers.__version__) >= version.parse('4.35'):
+        model_kwargs['quantization_config'] = GPTQConfig(
+            bits=bits, use_exllama=False)
+    else:
+        model_kwargs['quantization_config'] = GPTQConfig(
+            bits=bits, disable_exllama=True)
+
+    # fix quantlinear bug
+    from auto_gptq.nn_modules.qlinear.qlinear_cuda_old import QuantLinear
+    __old_forward = QuantLinear.forward
+
+    def _new_forward(self, x):
+        if not self.training or not self.autogptq_cuda_available:
+            return self.__old_forward(x)
+        # fix sft no grad
+        self.autogptq_cuda_available = False
+        res = self.__old_forward(x)
+        self.autogptq_cuda_available = True
+        return res
+
+    if not hasattr(QuantLinear, '__old_forward'):  # avoid double patching
+        QuantLinear.__old_forward = __old_forward
+        QuantLinear.forward = _new_forward
+    get_qwen_function = kwargs.pop('get_qwen_function',
+                                   get_model_tokenizer_with_flash_attn)
+    model, tokenizer = get_qwen_function(model_dir, torch_dtype, model_kwargs,
+                                         load_model, **kwargs)
+    return model, tokenizer
+
+
 @register_model(
     ModelType.internlm2_math_7b,
     'Shanghai_AI_Laboratory/internlm2-math-base-7b',