Skip to content

Commit cdf4e51

Browse files
YuzaChongyitastelikefeet
authored andcommitted
feat(model): support minicpm-v-2 (#699)
(cherry picked from commit 8981182)
1 parent c53587a commit cdf4e51

File tree

4 files changed

+229
-12
lines changed

4 files changed

+229
-12
lines changed
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
2+
# MiniCPM-V-2 最佳实践
3+
4+
## 目录
5+
- [环境准备](#环境准备)
6+
- [推理](#推理)
7+
- [微调](#微调)
8+
- [微调后推理](#微调后推理)
9+
10+
11+
## 环境准备
12+
```shell
13+
pip install ms-swift[llm] -U
14+
```
15+
16+
## 推理
17+
18+
推理[minicpm-v-2](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/summary):
19+
```shell
20+
# Experimental environment: A10, 3090, V100, ...
21+
# 10GB GPU memory
22+
CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2
23+
```
24+
25+
输出: (支持传入本地路径或URL)
26+
```python
27+
"""
28+
<<< 描述这张图片
29+
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
30+
这张图片展示了一只小猫的特写,它的毛色主要是黑白相间,带有一些浅色条纹,可能是灰色或白色。小猫的眼睛是蓝色的,瞳孔呈黑色,与它的毛色形成鲜明对比。它的耳朵竖立着,尖端是白色的,看起来警觉而好奇。小猫的鼻子是黑色的,嘴巴微微张开,露出牙齿,表明它可能在微笑或发出声音。背景模糊不清,但似乎是柔和的绿色,可能是室内环境,比如房间或房间的一部分。小猫的表情和姿势传达出一种顽皮和可爱爱的感觉。
31+
--------------------------------------------------
32+
<<< clear
33+
<<< 图中有几只羊?
34+
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
35+
这幅图描绘了一群羊在草地上。总共有四只羊,它们都长着白色的毛和棕色的角。这些羊的大小各不相同,其中一只看起来比其他三只要小一些。它们站在一片郁郁葱葱的草地上,背景是起伏的山脉,天空中飘着几朵云。这幅图像的风格是卡通化的,羊的面部表情和身体特征都夸夸张化了。
36+
--------------------------------------------------
37+
<<< clear
38+
<<< 计算结果是多少
39+
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
40+
计算结果是1452 + 4530 = 5982。
41+
--------------------------------------------------
42+
<<< clear
43+
<<< 根据图片中的内容写首诗
44+
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
45+
这幅图片描绘了一个宁静的夜晚场景,一艘小船漂浮在宁静的湖面上。船身呈棕色,看起来像是木质结构,船头有桅杆,顶部有一盏灯,可能是为了导航或照明。船身周围散布着一些小火苗,给画面增添了温暖的光芒。湖面反射着星星和灯光,营造出一种宁静而梦幻的氛围。背景中,树木繁茂,呈现出深绿色,暗示着森林或丛林的环境。天空呈现出渐变的粉色和紫色,暗示着日出或日落。整体氛围宁静而略带带神秘感。
46+
"""
47+
```
48+
49+
示例图片如下:
50+
51+
cat:
52+
53+
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
54+
55+
animal:
56+
57+
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
58+
59+
math:
60+
61+
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
62+
63+
poem:
64+
65+
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
66+
67+
**单样本推理**
68+
69+
```python
70+
import os
71+
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
72+
73+
from swift.llm import (
74+
get_model_tokenizer, get_template, inference, ModelType,
75+
get_default_template_type, inference_stream
76+
)
77+
from swift.utils import seed_everything
78+
import torch
79+
80+
model_type = ModelType.minicpm_v_v2
81+
template_type = get_default_template_type(model_type)
82+
print(f'template_type: {template_type}')
83+
84+
model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
85+
model_kwargs={'device_map': 'auto'})
86+
model.generation_config.max_new_tokens = 256
87+
template = get_template(template_type, tokenizer)
88+
seed_everything(42)
89+
90+
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
91+
query = '距离各城市多远?'
92+
response, history = inference(model, template, query, images=images)
93+
print(f'query: {query}')
94+
print(f'response: {response}')
95+
96+
# 流式
97+
query = '距离最远的城市是哪?'
98+
gen = inference_stream(model, template, query, history, images=images)
99+
print_idx = 0
100+
print(f'query: {query}\nresponse: ', end='')
101+
for response, history in gen:
102+
delta = response[print_idx:]
103+
print(delta, end='', flush=True)
104+
print_idx = len(response)
105+
print()
106+
print(f'history: {history}')
107+
"""
108+
query: 距离最远的城市是哪?
109+
response: 距离最远的城市是广州,距离离为293公里。
110+
history: [['距离各城市多远?', ' 马踏到马塔14公里,到阳江62公里,到广州293公里。'], ['距离最远的城市是哪?', ' 距离最远的城市是广州,距离为293公里。']]
111+
"""
112+
```
113+
114+
示例图片如下:
115+
116+
road:
117+
118+
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
119+
120+
121+
## 微调
122+
多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
123+
124+
(默认只对LLM部分的qkv进行lora微调. 如果你想对所有linear含vision模型部分都进行微调, 可以指定`--lora_target_modules ALL`. 支持全参数微调.)
125+
```shell
126+
# Experimental environment: A10, 3090, V100, ...
127+
# 10GB GPU memory
128+
CUDA_VISIBLE_DEVICES=0 swift sft \
129+
--model_type minicpm-v-v2 \
130+
--dataset coco-mini-en-2 \
131+
```
132+
133+
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
134+
135+
(支持多轮对话, 但总的轮次对话只能包含一张图片, 支持传入本地路径或URL)
136+
137+
```jsonl
138+
{"query": "55555", "response": "66666", "images": ["image_path"]}
139+
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
140+
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path"]}
141+
```
142+
143+
144+
## 微调后推理
145+
直接推理:
146+
```shell
147+
CUDA_VISIBLE_DEVICES=0 swift infer \
148+
--ckpt_dir output/minicpm-v-v2/vx-xxx/checkpoint-xxx \
149+
--load_dataset_config true \
150+
```
151+
152+
**merge-lora**并推理:
153+
```shell
154+
CUDA_VISIBLE_DEVICES=0 swift export \
155+
--ckpt_dir output/minicpm-v-v2/vx-xxx/checkpoint-xxx \
156+
--merge_lora true
157+
158+
CUDA_VISIBLE_DEVICES=0 swift infer \
159+
--ckpt_dir output/minicpm-v-v2/vx-xxx/checkpoint-xxx-merged \
160+
--load_dataset_config true
161+
```

docs/source/Multi-Modal/minicpm-v最佳实践.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ model_type = ModelType.minicpm_v_3b_chat
8787
template_type = get_default_template_type(model_type)
8888
print(f'template_type: {template_type}')
8989

90-
model, tokenizer = get_model_tokenizer(model_type, torch.float16,
90+
model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
9191
model_kwargs={'device_map': 'auto'})
9292
model.generation_config.max_new_tokens = 256
9393
template = get_template(template_type, tokenizer)

swift/llm/utils/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2859,7 +2859,7 @@ def get_model_tokenizer_minicpm(model_dir: str,
28592859
support_flash_attn=True)
28602860
@register_model(
28612861
ModelType.minicpm_v_v2,
2862-
'OpenBMB/MiniCPM-V-2.0',
2862+
'OpenBMB/MiniCPM-V-2',
28632863
LoRATM.llama2,
28642864
TemplateType.minicpm_v,
28652865
support_flash_attn=True)

swift/llm/utils/template.py

Lines changed: 66 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from io import BytesIO
44
from typing import Any, Dict, List, Literal, Optional, Tuple, Union
55

6+
import numpy as np
67
import requests
78
import torch
89
import torch.nn.functional as F
@@ -1159,17 +1160,72 @@ def encode(
11591160
inputs, _ = super().encode(example)
11601161
input_ids = inputs['input_ids']
11611162
labels = inputs['labels']
1162-
idx = input_ids.index(0)
1163+
1164+
img_start_idxs = np.where(
1165+
np.array(input_ids) == self.tokenizer.im_start_id)[0]
1166+
if len(
1167+
img_start_idxs
1168+
) > 1: # if mutli-round, input_ids have mutli <image><unk></image>\n
1169+
start = 0
1170+
new_input_ids = []
1171+
for idx in img_start_idxs[1:]:
1172+
new_input_ids = new_input_ids + input_ids[start:idx]
1173+
start = idx + 4 # skip <image><unk></image>\n
1174+
new_input_ids = new_input_ids + input_ids[start:]
1175+
input_ids = new_input_ids
1176+
1177+
idx = img_start_idxs[0] + 1 # first <unk>
11631178
config = self.model.config
1164-
input_ids = (
1165-
input_ids[:idx] + [self.tokenizer.unk_token_id] * config.query_num
1166-
+ input_ids[idx + 1:])
1167-
if labels is not None:
1168-
labels = (
1169-
labels[:idx] + [-100] * config.query_num + labels[idx + 1:])
1170-
image_bound = [torch.tensor([[idx, idx + config.query_num]])]
1171-
pixel_values = self.model.transform(image)[None].to(
1172-
device=self.model.device)
1179+
if hasattr(config, 'slice_mode') and config.slice_mode:
1180+
slice_mode = True
1181+
assert hasattr(config, 'patch_size')
1182+
assert hasattr(config, 'max_slice_nums')
1183+
assert hasattr(config, 'scale_resolution')
1184+
else:
1185+
slice_mode = False
1186+
1187+
if slice_mode:
1188+
images, placeholder = self.model.get_slice_image_placeholder(
1189+
image, self.tokenizer)
1190+
placeholder_id = self.tokenizer.encode(
1191+
placeholder, add_special_tokens=False)
1192+
input_ids = (
1193+
input_ids[:idx - 1] + placeholder_id + input_ids[idx + 2:])
1194+
if labels is not None:
1195+
labels = (
1196+
labels[:idx - 1] + [-100] * len(placeholder_id)
1197+
+ labels[idx + 2])
1198+
input_tensor_ids = torch.tensor(input_ids)
1199+
image_start_idx = torch.where(
1200+
input_tensor_ids == self.tokenizer.im_start_id)[0]
1201+
image_start_idx += 1
1202+
image_end_idx = torch.where(
1203+
input_tensor_ids == self.tokenizer.im_end_id)[0]
1204+
valid_image_nums = max(len(image_start_idx), len(image_end_idx))
1205+
image_bound = [
1206+
torch.hstack([
1207+
image_start_idx[:valid_image_nums].unsqueeze(-1),
1208+
image_end_idx[:valid_image_nums].unsqueeze(-1)
1209+
])
1210+
]
1211+
pixel_values = [
1212+
self.model.transform(img).to(device=self.model.device)
1213+
for img in images
1214+
]
1215+
1216+
else:
1217+
input_ids = (
1218+
input_ids[:idx]
1219+
+ [self.tokenizer.unk_token_id] * config.query_num
1220+
+ input_ids[idx + 1:])
1221+
if labels is not None:
1222+
labels = (
1223+
labels[:idx] + [-100] * config.query_num
1224+
+ labels[idx + 1:])
1225+
image_bound = [torch.tensor([[idx, idx + config.query_num]])]
1226+
pixel_values = [
1227+
self.model.transform(image).to(device=self.model.device)
1228+
]
11731229
inputs_embeds, _ = self.model.get_vllm_embedding({
11741230
'input_ids':
11751231
torch.tensor(input_ids)[None].to(device=self.model.device),

0 commit comments

Comments
 (0)