support paligemma2 (#2735)

Jintao-Huang · web-flow · commit 6e3fa6d2eeca · 2024-12-23T16:33:38.000+08:00
diff --git a/README.md b/README.md
@@ -55,13 +55,13 @@ You can contact us and communicate with us by adding our group:
 
 
 ## 📝 Introduction
-🍲 ms-swift is an official framework provided by the ModelScope community for fine-tuning and deploying large language models and multi-modal large models. It currently supports the training (pre-training, fine-tuning, human alignment), inference, evaluation, quantization, and deployment of over 400 large models and 100+ multi-modal large models. These large language models (LLMs) include models such as Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, DeepSeek, Baichuan2, Gemma2, and TeleChat2. The multi-modal LLMs include models such as Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2.5, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, and GOT-OCR2.
+🍲 ms-swift is an official framework provided by the ModelScope community for fine-tuning and deploying large language models and multi-modal large models. It currently supports the training (pre-training, fine-tuning, human alignment), inference, evaluation, quantization, and deployment of 400+ large models and 150+ multi-modal large models. These large language models (LLMs) include models such as Qwen2.5, Llama3.3, GLM4, Internlm2.5, Yi1.5, Mistral, DeepSeek2.5, Baichuan2, Gemma2, and TeleChat2. The multi-modal LLMs include models such as Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2.5, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, and GOT-OCR2.
 
-🍔 In addition, ms-swift gathers the latest training technologies, including LoRA, QLoRA, Llama-Pro, LongLoRA, GaLore, Q-GaLore, LoRA+, LISA, DoRA, FourierFt, ReFT, UnSloth, and Liger. ms-swift supports accelerating the inference, evaluation, and deployment modules using vLLM and LMDeploy. To help researchers and developers fine-tune and apply large models more easily, ms-swift also provides a Gradio-based Web-UI interface and a wealth of best practices.
+🍔 In addition, ms-swift gathers the latest training technologies, including LoRA, QLoRA, Llama-Pro, LongLoRA, GaLore, Q-GaLore, LoRA+, LISA, DoRA, FourierFt, ReFT, UnSloth, and Liger. ms-swift supports acceleration of inference, evaluation, and deployment modules using vLLM and LMDeploy, and supports the quantization of large models and multi-modal large models using technologies such as GPTQ, AWQ, and BNB. To help researchers and developers fine-tune and apply large models more easily, ms-swift also provides a Gradio-based Web-UI interface and a wealth of best practices.
 
 **Why choose ms-swift?**
 
-- 🍎 **Model Types**: Supports 400+ large language models and **100+ multi-modal large models** and all-to-all models, **providing a comprehensive solution from training to deployment**.
+- 🍎 **Model Types**: Supports 400+ large language models and **150+ multi-modal large models** and all-to-all models, **providing a comprehensive solution from training to deployment**.
 - **Dataset Types**: Comes with 150+ pre-training, fine-tuning, human alignment, multi-modal datasets, and supports custom datasets.
 - **Hardware Support**: Compatible with CPU, RTX series, T4/V100, A10/A100/H100, Ascend NPU, etc.
 - 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
diff --git a/README_CN.md b/README_CN.md
@@ -53,12 +53,12 @@
 <img src="asset/discord_qr.jpg" width="200" height="200">  |  <img src="asset/wechat.png" width="200" height="200">
 
 ## 📝 简介
-🍲 ms-swift是魔搭社区提供的大模型与多模态大模型微调部署框架，现已支持400+大模型与100+多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。其中LLM包括：Qwen2.5、Llama3.2、GLM4、Internlm2.5、Yi1.5、Mistral、DeepSeek、Baichuan2、Gemma2、TeleChat2等模型，多模态LLM包括：Qwen2-VL、Qwen2-Audio、Llama3.2-Vision、Llava、InternVL2.5、MiniCPM-V-2.6、GLM4v、Xcomposer2.5、Yi-VL、DeepSeek-VL2、Phi3.5-Vision、GOT-OCR2等模型。
+🍲 ms-swift是魔搭社区提供的大模型与多模态大模型微调部署框架，现已支持450+大模型与150+多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。其中大模型包括：Qwen2.5、Llama3.3、GLM4、Internlm2.5、Yi1.5、Mistral、DeepSeek2.5、Baichuan2、Gemma2、TeleChat2等模型，多模态大模型包括：Qwen2-VL、Qwen2-Audio、Llama3.2-Vision、Llava、InternVL2.5、MiniCPM-V-2.6、GLM4v、Xcomposer2.5、Yi-VL、DeepSeek-VL2、Phi3.5-Vision、GOT-OCR2等模型。
 
-🍔 除此之外，ms-swift汇集了最新的训练技术，包括LoRA、QLoRA、Llama-Pro、LongLoRA、GaLore、Q-GaLore、LoRA+、LISA、DoRA、FourierFt、ReFT、UnSloth、和Liger等。ms-swift支持使用vLLM和LMDeploy对推理、评测和部署模块进行加速。为了帮助研究者和开发者更轻松地微调和应用大模型，ms-swift还提供了基于Gradio的Web-UI界面及丰富的最佳实践。
+🍔 除此之外，ms-swift汇集了最新的训练技术，包括LoRA、QLoRA、Llama-Pro、LongLoRA、GaLore、Q-GaLore、LoRA+、LISA、DoRA、FourierFt、ReFT、UnSloth、和Liger等。ms-swift支持使用vLLM和LMDeploy对推理、评测和部署模块进行加速，并支持使用GPTQ、AWQ、BNB等技术对大模型和多模态大模型进行量化。为了帮助研究者和开发者更轻松地微调和应用大模型，ms-swift还提供了基于Gradio的Web-UI界面及丰富的最佳实践。
 
 **为什么选择ms-swift？**
-- 🍎 **模型类型**：支持400+纯文本大模型、**100+多模态大模型**，All-to-All全模态模型的**训练到部署全流程**。
+- 🍎 **模型类型**：支持400+纯文本大模型、**150+多模态大模型**，All-to-All全模态模型的**训练到部署全流程**。
 - **数据集类型**：内置150+预训练、微调、人类对齐、多模态等各种类型的数据集，并支持自定义数据集。
 - **硬件支持**：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU等。
 - 🍊 **轻量训练**：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
diff --git a/docs/source/GetStarted/快速开始.md b/docs/source/GetStarted/快速开始.md
@@ -1,8 +1,8 @@
 # 快速开始
 
-ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架，现已支持400+大模型与100+多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。模型开发者可以在ms-swift框架中一站式完成围绕大模型的各类需求。目前ms-swift的主要能力包含：
+ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架，现已支持400+大模型与150+多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。模型开发者可以在ms-swift框架中一站式完成围绕大模型的各类需求。目前ms-swift的主要能力包含：
 
-- 🍎 模型类型：支持400+纯文本大模型、100+多模态大模型，All-to-All全模态模型的训练到部署全流程。
+- 🍎 模型类型：支持400+纯文本大模型、150+多模态大模型，All-to-All全模态模型的训练到部署全流程。
 - 数据集类型：内置150+预训练、微调、人类对齐、多模态等各种类型的数据集，并支持自定义数据集。
 - 硬件支持：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU等。
 - 🍊 轻量训练：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
diff --git a/docs/source/Instruction/支持的模型和数据集.md b/docs/source/Instruction/支持的模型和数据集.md
@@ -603,6 +603,17 @@
 |[AI-ModelScope/paligemma-3b-pt-896](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-pt-896)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma-3b-pt-896](https://huggingface.co/google/paligemma-3b-pt-896)|
 |[AI-ModelScope/paligemma-3b-mix-224](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-mix-224)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma-3b-mix-224](https://huggingface.co/google/paligemma-3b-mix-224)|
 |[AI-ModelScope/paligemma-3b-mix-448](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-mix-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma-3b-mix-448](https://huggingface.co/google/paligemma-3b-mix-448)|
+|[AI-ModelScope/paligemma2-3b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma2-3b-pt-224)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224)|
+|[AI-ModelScope/paligemma2-3b-pt-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-3b-pt-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-3b-pt-448](https://huggingface.co/google/paligemma2-3b-pt-448)|
+|[AI-ModelScope/paligemma2-3b-pt-896](https://modelscope.cn/models/AI-ModelScope/paligemma2-3b-pt-896)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-3b-pt-896](https://huggingface.co/google/paligemma2-3b-pt-896)|
+|[AI-ModelScope/paligemma2-10b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma2-10b-pt-224)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-10b-pt-224](https://huggingface.co/google/paligemma2-10b-pt-224)|
+|[AI-ModelScope/paligemma2-10b-pt-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-10b-pt-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-10b-pt-448](https://huggingface.co/google/paligemma2-10b-pt-448)|
+|[AI-ModelScope/paligemma2-10b-pt-896](https://modelscope.cn/models/AI-ModelScope/paligemma2-10b-pt-896)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-10b-pt-896](https://huggingface.co/google/paligemma2-10b-pt-896)|
+|[AI-ModelScope/paligemma2-28b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma2-28b-pt-224)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-28b-pt-224](https://huggingface.co/google/paligemma2-28b-pt-224)|
+|[AI-ModelScope/paligemma2-28b-pt-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-28b-pt-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-28b-pt-448](https://huggingface.co/google/paligemma2-28b-pt-448)|
+|[AI-ModelScope/paligemma2-28b-pt-896](https://modelscope.cn/models/AI-ModelScope/paligemma2-28b-pt-896)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-28b-pt-896](https://huggingface.co/google/paligemma2-28b-pt-896)|
+|[AI-ModelScope/paligemma2-3b-ft-docci-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-3b-ft-docci-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-3b-ft-docci-448](https://huggingface.co/google/paligemma2-3b-ft-docci-448)|
+|[AI-ModelScope/paligemma2-10b-ft-docci-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-10b-ft-docci-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-10b-ft-docci-448](https://huggingface.co/google/paligemma2-10b-ft-docci-448)|
 |[LLM-Research/Molmo-7B-O-0924](https://modelscope.cn/models/LLM-Research/Molmo-7B-O-0924)|molmo|molmo|transformers>=4.45|vision|[allenai/Molmo-7B-O-0924](https://huggingface.co/allenai/Molmo-7B-O-0924)|
 |[LLM-Research/Molmo-7B-D-0924](https://modelscope.cn/models/LLM-Research/Molmo-7B-D-0924)|molmo|molmo|transformers>=4.45|vision|[allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924)|
 |[LLM-Research/Molmo-72B-0924](https://modelscope.cn/models/LLM-Research/Molmo-72B-0924)|molmo|molmo|transformers>=4.45|vision|[allenai/Molmo-72B-0924](https://huggingface.co/allenai/Molmo-72B-0924)|
diff --git a/docs/source_en/GetStarted/Quick-start.md b/docs/source_en/GetStarted/Quick-start.md
@@ -1,8 +1,8 @@
 # Quick Start
 
-ms-swift is a comprehensive training and deployment framework for large language models and multimodal large models, provided by the ModelScope Community. It currently supports the training (CPT, SFT, RLHF), inference, evaluation, quantization, and deployment of over 400 LLM and over 100 MLLM. Model developers can fulfill all kinds of needs related to large models in a single platform within the ms-swift framework. The main capabilities of ms-swift include:
+ms-swift is a comprehensive training and deployment framework for large language models and multimodal large models, provided by the ModelScope Community. It currently supports the training (CPT, SFT, RLHF), inference, evaluation, quantization, and deployment of 400+ LLM and 150+ MLLM. Model developers can fulfill all kinds of needs related to large models in a single platform within the ms-swift framework. The main capabilities of ms-swift include:
 
-- 🍎 Model Types: Supports the full process from training to deployment of over 400 text-based large models and over 100 multimodal large models, including All-to-All all-modality models.
+- 🍎 Model Types: Supports the full process from training to deployment of 400+ text-based large models and 150+ multimodal large models, including All-to-All all-modality models.
 - Dataset Types: Comes with more than 150 pre-built datasets for pre-training, fine-tuning, human alignment, multimodal, and supports custom datasets.
 - Hardware Support: Compatible with CPU, RTX series, T4/V100, A10/A100/H100, Ascend NPU, and others.
 - 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
diff --git a/docs/source_en/Instruction/Supported-models-and-datasets.md b/docs/source_en/Instruction/Supported-models-and-datasets.md
@@ -603,6 +603,17 @@ The table below introduces the models integrated with ms-swift:
 |[AI-ModelScope/paligemma-3b-pt-896](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-pt-896)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma-3b-pt-896](https://huggingface.co/google/paligemma-3b-pt-896)|
 |[AI-ModelScope/paligemma-3b-mix-224](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-mix-224)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma-3b-mix-224](https://huggingface.co/google/paligemma-3b-mix-224)|
 |[AI-ModelScope/paligemma-3b-mix-448](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-mix-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma-3b-mix-448](https://huggingface.co/google/paligemma-3b-mix-448)|
+|[AI-ModelScope/paligemma2-3b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma2-3b-pt-224)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224)|
+|[AI-ModelScope/paligemma2-3b-pt-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-3b-pt-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-3b-pt-448](https://huggingface.co/google/paligemma2-3b-pt-448)|
+|[AI-ModelScope/paligemma2-3b-pt-896](https://modelscope.cn/models/AI-ModelScope/paligemma2-3b-pt-896)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-3b-pt-896](https://huggingface.co/google/paligemma2-3b-pt-896)|
+|[AI-ModelScope/paligemma2-10b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma2-10b-pt-224)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-10b-pt-224](https://huggingface.co/google/paligemma2-10b-pt-224)|
+|[AI-ModelScope/paligemma2-10b-pt-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-10b-pt-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-10b-pt-448](https://huggingface.co/google/paligemma2-10b-pt-448)|
+|[AI-ModelScope/paligemma2-10b-pt-896](https://modelscope.cn/models/AI-ModelScope/paligemma2-10b-pt-896)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-10b-pt-896](https://huggingface.co/google/paligemma2-10b-pt-896)|
+|[AI-ModelScope/paligemma2-28b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma2-28b-pt-224)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-28b-pt-224](https://huggingface.co/google/paligemma2-28b-pt-224)|
+|[AI-ModelScope/paligemma2-28b-pt-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-28b-pt-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-28b-pt-448](https://huggingface.co/google/paligemma2-28b-pt-448)|
+|[AI-ModelScope/paligemma2-28b-pt-896](https://modelscope.cn/models/AI-ModelScope/paligemma2-28b-pt-896)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-28b-pt-896](https://huggingface.co/google/paligemma2-28b-pt-896)|
+|[AI-ModelScope/paligemma2-3b-ft-docci-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-3b-ft-docci-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-3b-ft-docci-448](https://huggingface.co/google/paligemma2-3b-ft-docci-448)|
+|[AI-ModelScope/paligemma2-10b-ft-docci-448](https://modelscope.cn/models/AI-ModelScope/paligemma2-10b-ft-docci-448)|paligemma|paligemma|transformers>=4.41|vision|[google/paligemma2-10b-ft-docci-448](https://huggingface.co/google/paligemma2-10b-ft-docci-448)|
 |[LLM-Research/Molmo-7B-O-0924](https://modelscope.cn/models/LLM-Research/Molmo-7B-O-0924)|molmo|molmo|transformers>=4.45|vision|[allenai/Molmo-7B-O-0924](https://huggingface.co/allenai/Molmo-7B-O-0924)|
 |[LLM-Research/Molmo-7B-D-0924](https://modelscope.cn/models/LLM-Research/Molmo-7B-D-0924)|molmo|molmo|transformers>=4.45|vision|[allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924)|
 |[LLM-Research/Molmo-72B-0924](https://modelscope.cn/models/LLM-Research/Molmo-72B-0924)|molmo|molmo|transformers>=4.45|vision|[allenai/Molmo-72B-0924](https://huggingface.co/allenai/Molmo-72B-0924)|
diff --git a/swift/llm/model/model/gemma.py b/swift/llm/model/model/gemma.py
@@ -28,9 +28,26 @@ def get_model_tokenizer_paligemma_vision(model_dir: str,
                 Model('AI-ModelScope/paligemma-3b-pt-224', 'google/paligemma-3b-pt-224'),
                 Model('AI-ModelScope/paligemma-3b-pt-448', 'google/paligemma-3b-pt-448'),
                 Model('AI-ModelScope/paligemma-3b-pt-896', 'google/paligemma-3b-pt-896'),
+            ]),
+            ModelGroup([
                 Model('AI-ModelScope/paligemma-3b-mix-224', 'google/paligemma-3b-mix-224'),
                 Model('AI-ModelScope/paligemma-3b-mix-448', 'google/paligemma-3b-mix-448'),
             ]),
+            ModelGroup([
+                Model('AI-ModelScope/paligemma2-3b-pt-224', 'google/paligemma2-3b-pt-224'),
+                Model('AI-ModelScope/paligemma2-3b-pt-448', 'google/paligemma2-3b-pt-448'),
+                Model('AI-ModelScope/paligemma2-3b-pt-896', 'google/paligemma2-3b-pt-896'),
+                Model('AI-ModelScope/paligemma2-10b-pt-224', 'google/paligemma2-10b-pt-224'),
+                Model('AI-ModelScope/paligemma2-10b-pt-448', 'google/paligemma2-10b-pt-448'),
+                Model('AI-ModelScope/paligemma2-10b-pt-896', 'google/paligemma2-10b-pt-896'),
+                Model('AI-ModelScope/paligemma2-28b-pt-224', 'google/paligemma2-28b-pt-224'),
+                Model('AI-ModelScope/paligemma2-28b-pt-448', 'google/paligemma2-28b-pt-448'),
+                Model('AI-ModelScope/paligemma2-28b-pt-896', 'google/paligemma2-28b-pt-896'),
+            ]),
+            ModelGroup([
+                Model('AI-ModelScope/paligemma2-3b-ft-docci-448', 'google/paligemma2-3b-ft-docci-448'),
+                Model('AI-ModelScope/paligemma2-10b-ft-docci-448', 'google/paligemma2-10b-ft-docci-448'),
+            ]),
         ],
         TemplateType.paligemma,
         get_model_tokenizer_paligemma_vision,
diff --git a/swift/llm/template/template/gemma.py b/swift/llm/template/template/gemma.py
@@ -42,7 +42,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
             encoded['token_type_ids'] = [0] * len(encoded['input_ids'])
         if raw_image:
             model_inputs = processor(text=inputs.to_history()['query'], images=raw_image[0], return_tensors='pt')
-            encoded['pixel_values'] = model_inputs['pixel_values']
+            encoded['pixel_values'] = model_inputs['pixel_values'].to(self.config.torch_dtype)
         return encoded
 
 
diff --git a/tests/test_align/test_template/test_vision.py b/tests/test_align/test_template/test_vision.py