Skip to content

Commit 29aac74

Browse files
hjh0119jinghan
andauthored
support mini-internvl (#1032)
--------- Co-authored-by: jinghan <[email protected]>
1 parent df58536 commit 29aac74

File tree

10 files changed

+65
-16
lines changed

10 files changed

+65
-16
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ SWIFT has rich documentations for users, please check [here](https://github.com/
4747
SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) and [ModelScope studio](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary), please feel free to try!
4848

4949
## 🎉 News
50+
- 2024.05.31: Supports Mini-Internvl model, Use model_type `mini-internvl-chat-2b-v1_5` and `mini-internvl-chat-4b-v1_5`to train.
5051
- 2024.05.24: Supports Phi3-vision model, Use model_type `phi3-vision-128k-instruct` to train.
5152
- 2024.05.22: Supports DeepSeek-V2-Lite series models, model_type are `deepseek-v2-lite` and `deepseek-v2-lite-chat`
5253
- 2024.05.22: Supports TeleChat-12B-v2 model with quantized version, model_type are `telechat-12b-v2` and `telechat-12b-v2-gptq-int4`
@@ -533,7 +534,7 @@ The complete list of supported models and datasets can be found at [Supported Mo
533534
| Llava | [Llava series models](https://github.com/haotian-liu/LLaVA) | English | 7B-34B | chat model |
534535
| Llava-Next | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT) | Chinese<br>English | 8B-110B | chat model |
535536
| mPLUG-Owl | [mPLUG-Owl series models](https://github.com/X-PLUG/mPLUG-Owl) | English | 11B | chat model |
536-
| InternVL | [InternVL](https://github.com/OpenGVLab/InternVL) | Chinese<br>English | 25.5B<br>including quantized version | chat model |
537+
| InternVL | [InternVL](https://github.com/OpenGVLab/InternVL) | Chinese<br>English | 2B-25.5B<br>including quantized version | chat model |
537538
| Llava-llama3 | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | English | 8B | chat model |
538539
| Phi3-Vision | Microsoft | English | 4B | chat model |
539540
| PaliGemma | Google | English | 3B | chat model |

README_CN.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ SWIFT具有丰富的文档体系,如有使用问题请请查看[这里](https:
4848
可以在[Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift)[ModelScope创空间](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary) 中体验SWIFT web-ui功能了。
4949

5050
## 🎉 新闻
51+
- 2024.05.31: 支持Mini-Internvl多模态模型, 使用model_type `mini-internvl-chat-2b-v1_5``mini-internvl-chat-4b-v1_5`来训练.
5152
- 2024.05.24: 支持Phi3多模态模型, 使用model_type `phi3-vision-128k-instruct`来训练.
5253
- 2024.05.22: 支持DeepSeek-V2-lite系列模型, model_type为 `deepseek-v2-lite``deekseek-v2-lite-chat`
5354
- 2024.05.22: 支持TeleChat-12b-v2模型和量化版本, model_type为 `telechat-12b-v2``telechat-12b-v2-gptq-int4`
@@ -530,7 +531,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
530531
| Llava | [Llava系列模型](https://github.com/haotian-liu/LLaVA) | 英文 | 7B-34B | chat模型 |
531532
| Llava-Next | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT) | 中文<br>英文 | 8B-110B | chat模型 |
532533
| mPLUG-Owl | [mPLUG-Owl系列模型](https://github.com/X-PLUG/mPLUG-Owl) | 英文 | 11B | chat模型 |
533-
| InternVL | [InternVL](https://github.com/OpenGVLab/InternVL) | 中文<br>英文 | 25.5B<br>包含量化版本 | chat模型 |
534+
| InternVL | [InternVL](https://github.com/OpenGVLab/InternVL) | 中文<br>英文 | 2B-25.5B<br>包含量化版本 | chat模型 |
534535
| Llava-llama3 | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | 英文 | 8B | chat model |
535536
| Phi3-Vision | 微软 | 英文 | 4B | chat model |
536537
| PaliGemma | Google | 英文 | 3B | chat model |

docs/source/LLM/支持的模型和数据集.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -297,6 +297,8 @@
297297
|internlm-xcomposer2-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)|wqkv|internlm-xcomposer2|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-7b](https://huggingface.co/internlm/internlm-xcomposer2-7b)|
298298
|internvl-chat-v1_5|[AI-ModelScope/InternVL-Chat-V1-5](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)|
299299
|internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
300+
|mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|
301+
|mini-internvl-chat-4b-v1_5|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5/summary)|qkv_proj|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)|
300302
|deepseek-vl-1_3b-chat|[deepseek-ai/deepseek-vl-1.3b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;|attrdict|vision|[deepseek-ai/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-1.3b-chat)|
301303
|deepseek-vl-7b-chat|[deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;|attrdict|vision|[deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)|
302304
|paligemma-3b-pt-448|[AI-ModelScope/paligemma-3b-pt-448](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-pt-448/summary)|q_proj, k_proj, v_proj|paligemma|&#x2714;|&#x2718;|transformers>=4.41|vision|[google/paligemma-3b-pt-448](https://huggingface.co/google/paligemma-3b-pt-448)|

docs/source/Multi-Modal/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@
1414
1. [Llava最佳实践](llava最佳实践.md)
1515
2. [Yi-VL最佳实践.md](yi-vl最佳实践.md)
1616
3. [mPLUG-Owl2最佳实践](mplug-owl2最佳实践.md)
17-
4. [InternVL-Chat-V1.5最佳实践](internvl最佳实践.md)
1817

1918

2019
整个对话围绕一张图片:
2120
1. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md)
2221
2. [MiniCPM-V最佳实践](minicpm-v最佳实践.md), [MiniCPM-V-2最佳实践](minicpm-v-2最佳实践.md), [MiniCPM-V-2.5最佳实践](minicpm-v-2.5最佳实践.md)
22+
3. [InternVL-Chat-V1.5最佳实践](internvl最佳实践.md)

docs/source/Multi-Modal/internvl最佳实践.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ pip install Pillow
2020

2121
推理[internvl-chat-v1.5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)[internvl-chat-v1.5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)
2222

23-
下面教程以`internvl-chat-v1.5`为例,你可以修改`--model_type internvl-chat-v1_5-int8`来选择int8版本的模型
23+
下面教程以`internvl-chat-v1.5`为例,你可以修改`--model_type internvl-chat-v1_5-int8`来选择int8版本的模型,使用`mini-internvl-chat-2b-v1_5`
24+
`mini-internvl-chat-4b-v1_5`来使用Mini-Internvl
2425

2526
**注意**
2627
- 如果要使用本地模型文件,加上参数 `--model_id_or_path /path/to/model`

docs/source_en/LLM/Supported-models-datasets.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -297,6 +297,8 @@ The table below introcudes all models supported by SWIFT:
297297
|internlm-xcomposer2-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)|wqkv|internlm-xcomposer2|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-7b](https://huggingface.co/internlm/internlm-xcomposer2-7b)|
298298
|internvl-chat-v1_5|[AI-ModelScope/InternVL-Chat-V1-5](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)|
299299
|internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
300+
|mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|
301+
|mini-internvl-chat-4b-v1_5|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5/summary)|qkv_proj|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)|
300302
|deepseek-vl-1_3b-chat|[deepseek-ai/deepseek-vl-1.3b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;|attrdict|vision|[deepseek-ai/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-1.3b-chat)|
301303
|deepseek-vl-7b-chat|[deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;|attrdict|vision|[deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)|
302304
|paligemma-3b-pt-448|[AI-ModelScope/paligemma-3b-pt-448](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-pt-448/summary)|q_proj, k_proj, v_proj|paligemma|&#x2714;|&#x2718;|transformers>=4.41|vision|[google/paligemma-3b-pt-448](https://huggingface.co/google/paligemma-3b-pt-448)|

docs/source_en/Multi-Modal/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ A single round of dialogue can contain multiple images (or no images):
1313
A single round of dialogue can only contain one image:
1414
1. [Llava Best Practice](llava-best-practice.md)
1515
2. [Yi-VL Best Practice.md](yi-vl-best-practice.md)
16-
5. [InternVL-Chat-V1.5 Best Practice](internvl-best-practice.md)
1716

1817

1918
整个对话围绕一张图片:
2019
1. [CogVLM Best Practice](cogvlm-best-practice.md), [CogVLM2 Best Practice](cogvlm2-best-practice.md)
2120
2. [MiniCPM-V Best Practice](minicpm-v-best-practice.md)
21+
3. [InternVL-Chat-V1.5 Best Practice](internvl-best-practice.md)

docs/source_en/Multi-Modal/internvl-best-practice.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Inference for [internvl-chat-v1.5](https://www.modelscope.cn/models/AI-ModelScop
2121

2222
Inference with [internvl-chat-v1.5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary) and [internvl-chat-v1.5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary).
2323

24-
The tutorial below takes `internvl-chat-v1.5` as an example, and you can change to `--model_type internvl-chat-v1_5-int8` to select the INT8 version of the model.
24+
The tutorial below takes `internvl-chat-v1.5` as an example, and you can change to `--model_type internvl-chat-v1_5-int8` to select the INT8 version of the model. Alternatively, select the Mini-Internvl model by choosing either `mini-internvl-chat-2b-v1_5` or `mini-internvl-chat-4b-v1_5`.
2525

2626
**Note**
2727
- If you want to use a local model file, add the argument --model_id_or_path /path/to/model.

swift/llm/utils/model.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,6 +223,8 @@ class ModelType:
223223
# internvl
224224
internvl_chat_v1_5 = 'internvl-chat-v1_5'
225225
internvl_chat_v1_5_int8 = 'internvl-chat-v1_5-int8'
226+
mini_internvl_chat_2b_v1_5 = 'mini-internvl-chat-2b-v1_5'
227+
mini_internvl_chat_4b_v1_5 = 'mini-internvl-chat-4b-v1_5'
226228
# deepseek
227229
deepseek_7b = 'deepseek-7b'
228230
deepseek_7b_chat = 'deepseek-7b-chat'
@@ -2789,6 +2791,7 @@ def _new_forward(*args, **kwargs):
27892791
TemplateType.internvl,
27902792
requires=['transformers>=4.35', 'timm'],
27912793
support_flash_attn=True,
2794+
placeholder_tokens=['<IMG_CONTEXT>'],
27922795
tags=['multi-modal', 'vision'],
27932796
hf_model_id='OpenGVLab/InternVL-Chat-V1-5')
27942797
@register_model(
@@ -2798,8 +2801,29 @@ def _new_forward(*args, **kwargs):
27982801
TemplateType.internvl,
27992802
requires=['transformers>=4.35', 'timm'],
28002803
support_flash_attn=True,
2804+
placeholder_tokens=['<IMG_CONTEXT>'],
28012805
tags=['multi-modal', 'vision'],
28022806
hf_model_id='OpenGVLab/InternVL-Chat-V1-5-int8')
2807+
@register_model(
2808+
ModelType.mini_internvl_chat_2b_v1_5,
2809+
'OpenGVLab/Mini-InternVL-Chat-2B-V1-5',
2810+
LoRATM.internlm2,
2811+
TemplateType.internvl,
2812+
requires=['transformers>=4.35', 'timm'],
2813+
support_flash_attn=True,
2814+
placeholder_tokens=['<IMG_CONTEXT>'],
2815+
tags=['multi-modal', 'vision'],
2816+
hf_model_id='OpenGVLab/Mini-InternVL-Chat-2B-V1-5')
2817+
@register_model(
2818+
ModelType.mini_internvl_chat_4b_v1_5,
2819+
'OpenGVLab/Mini-InternVL-Chat-4B-V1-5',
2820+
LoRATM.phi3,
2821+
TemplateType.internvl,
2822+
requires=['transformers>=4.35', 'timm'],
2823+
support_flash_attn=True,
2824+
placeholder_tokens=['<IMG_CONTEXT>'],
2825+
tags=['multi-modal', 'vision'],
2826+
hf_model_id='OpenGVLab/Mini-InternVL-Chat-4B-V1-5')
28032827
def get_model_tokenizer_internvl(model_dir: str,
28042828
torch_dtype: Dtype,
28052829
model_kwargs: Dict[str, Any],

swift/llm/utils/template.py

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -910,37 +910,55 @@ def get_generate_ids(generate_ids: Tensor, input_token_len: int) -> List[int]:
910910

911911
class InternvlTemplate(Template):
912912
system = 'You are an AI assistant whose name is InternLM (书生·浦语).'
913-
internvl_query_template = '\n{{QUERY}}<|im_end|><|im_start|>assistant\n'
914913
num_image_token = 256
915914

916915
def __init__(self):
917-
super().__init__([], ['<|im_start|>user\n{{QUERY}}<|im_end|><|im_start|>assistant\n'], ['<|im_end|>'],
918-
['<|im_end|>'], self.system, ['<|im_start|>system\n{{SYSTEM}}'])
916+
super().__init__(['<s>'], ['<|im_start|>user\n', [-100], '{{QUERY}}<|im_end|><|im_start|>assistant\n'],
917+
['<|im_end|>'], ['<|im_end|>'], self.system, ['<|im_start|>system\n{{SYSTEM}}'])
919918

920919
def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any]]:
920+
inputs, _ = super().encode(example)
921+
if len(inputs) == 0:
922+
return inputs, {}
923+
input_ids = inputs['input_ids']
924+
idx_list = _findall(input_ids, -100)
921925
pixel_values = None
922926
if example.get('images') is not None:
923927
from .vision_utils import load_image
928+
labels = inputs['labels']
929+
if len(idx_list) >= 2:
930+
input_ids = _remove_idx(input_ids, idx_list[1:])
931+
if labels is not None:
932+
labels = _remove_idx(labels, idx_list[1:])
933+
924934
images_path = example['images']
925935
pixel_values = []
926936
for image_path in images_path:
927937
pixel_values.append(load_image(image_path))
928938
pixel_values = torch.cat(pixel_values, dim=0)
929939
image_bs = pixel_values.shape[0]
930-
if example.get('query') is not None:
931-
example['query'] = ('<img>' + '<IMG_CONTEXT>' * self.num_image_token * image_bs + '</img>\n'
932-
+ example['query'])
933940

934-
inputs, _ = super().encode(example)
935-
inputs.pop('loss_scale', None)
936-
if pixel_values is not None:
941+
idx = idx_list[0]
942+
img_tokens = self.tokenizer.encode('<img>' + '<IMG_CONTEXT>' * self.num_image_token * image_bs + '</img>\n')
943+
input_ids = input_ids[:idx] + img_tokens + input_ids[idx + 1:]
944+
if labels is not None:
945+
labels = labels[:idx] + [-100] * len(img_tokens) + labels[idx + 1:]
946+
inputs['input_ids'] = input_ids
947+
inputs['labels'] = labels
948+
937949
inputs['pixel_values'] = pixel_values.to(self.model.dtype)
938950
inputs['image_flags'] = torch.ones(image_bs)
951+
else:
952+
input_ids = _remove_idx(input_ids, idx_list)
953+
if labels is not None:
954+
labels = _remove_idx(labels, idx_list)
939955

956+
inputs.pop('loss_scale', None)
940957
return inputs, {}
941958

942959
def data_collator(self, batch: List[Dict[str, Any]], padding_to: Optional[int] = None) -> Dict[str, Any]:
943960
res = super().data_collator(batch, padding_to)
961+
assert all('pixel_values' in b for b in batch), 'Temporarily, Interval only supports data with images'
944962
res['pixel_values'] = torch.concat([b['pixel_values'] for b in batch])
945963
res['image_flags'] = torch.concat([b['image_flags'] for b in batch])
946964
return res
@@ -955,7 +973,7 @@ def get_generate_ids(generate_ids: Tensor, input_token_len: int) -> List[int]:
955973
InternvlTemplate(),
956974
use_model=True,
957975
lazy_tokenize=True,
958-
infer_media_type='round',
976+
infer_media_type='dialogue',
959977
dataloader_num_workers=0,
960978
dataloader_pin_memory=False)
961979

0 commit comments

Comments
 (0)