Skip to content

Commit 00c2eaa

Browse files
authored
fix docs multimodal; fix pretrain mllm (#2742)
1 parent f913bca commit 00c2eaa

File tree

5 files changed

+39
-13
lines changed

5 files changed

+39
-13
lines changed

docs/source/Customization/自定义数据集.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ query-response格式:
3434

3535
## 推荐数据集格式
3636

37-
以下给出ms-swift的推荐数据集格式
37+
以下给出ms-swift的推荐数据集格式,其中system字段是可选的,默认使用template中定义的`default_system`
3838

3939
### 预训练
4040

@@ -69,11 +69,23 @@ query-response格式:
6969

7070
### 多模态
7171

72-
对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源:
72+
对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源,`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置。下面给出的四条示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。
73+
74+
预训练:
75+
```
76+
{"messages": [{"role": "assistant", "content": "预训练的文本在这里"}]}
77+
{"messages": [{"role": "assistant", "content": "<image>是一只小狗,<image>是一只小猫"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
78+
{"messages": [{"role": "assistant", "content": "<audio>描述了今天天气真不错"}], "audios": ["/xxx/x.wav"]}
79+
{"messages": [{"role": "assistant", "content": "<image>是一个大象,<video>是一只狮子在跑步"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
80+
```
81+
82+
微调:
7383
```jsonl
74-
{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么,<video>视频中是什么"}, {"role": "assistant", "content": "一个大象,一个狮子"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
84+
{"messages": [{"role": "user", "content": "<image><image>两张图片有什么区别"}, {"role": "assistant", "content": "前一张是小猫,后一张是小狗"}], "images": ["/xxx/x.jpg", "xxx/x.png"]}
85+
{"messages": [{"role": "user", "content": "<audio>语音说了什么"}, {"role": "assistant", "content": "今天天气真好呀"}], "audios": ["/xxx/x.mp3"]}
86+
{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么,<video>视频中是什么"}, {"role": "assistant", "content": "图片中是一个大象,视频中是一只小狗在草地上奔跑"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
7587
```
76-
其中`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置
88+
RLHF的数据格式可以参考纯文本大模型的格式
7789

7890
#### grounding
7991

docs/source_en/Customization/Custom-dataset.md

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ There are three ways to integrate a custom dataset, with increasing control over
3333

3434
## Recommended Dataset Format
3535

36-
Here is the recommended dataset format for ms-swift:
36+
The following provides the recommended dataset format for ms-swift, where the system field is optional and defaults to the `default_system` defined in the template.
3737

3838
### Pre-training
3939

@@ -68,11 +68,25 @@ Here is the recommended dataset format for ms-swift:
6868

6969
### Multimodal
7070

71-
For multimodal datasets, the format is the same as above. The difference is that it includes the keys `images`, `videos`, and `audios`, which represent multimodal resources:
71+
For multimodal datasets, the format is the same as the tasks mentioned above. The difference is the addition of several keys: `images`, `videos`, and `audios`, which represent multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate the positions where images, videos, and audio are inserted, respectively. The four examples provided below demonstrate the data format for pure text, as well as formats that include image, video, and audio data.
72+
73+
74+
Pre-training:
75+
```jsonl
76+
{"messages": [{"role": "assistant", "content": "Pre-trained text goes here"}]}
77+
{"messages": [{"role": "assistant", "content": "<image> is a puppy, <image> is a kitten"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
78+
{"messages": [{"role": "assistant", "content": "<audio> describes how nice the weather is today"}], "audios": ["/xxx/x.wav"]}
79+
{"messages": [{"role": "assistant", "content": "<image> is an elephant, <video> is a lion running"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
80+
```
81+
82+
Supervised Fine-tuning:
83+
7284
```jsonl
73-
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "<image> What is in the image? <video> What is in the video?"}, {"role": "assistant", "content": "An elephant and a lion"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
85+
{"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}, {"role": "assistant", "content": "The first one is a kitten, and the second one is a puppy."}], "images": ["/xxx/x.jpg", "xxx/x.png"]}
86+
{"messages": [{"role": "user", "content": "<audio>What did the audio say?"}, {"role": "assistant", "content": "The weather is really nice today."}], "audios": ["/xxx/x.mp3"]}
87+
{"messages": [{"role": "system", "content": "You are a helpful and harmless assistant."}, {"role": "user", "content": "<image>What is in the image, <video>What is in the video?"}, {"role": "assistant", "content": "The image shows an elephant, and the video shows a puppy running on the grass."}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
7488
```
75-
The `<image>`, `<video>`, and `<audio>` tags indicate where to insert images/videos/audios.
89+
The data format for RLHF can refer to the format used for pure text large models.
7690

7791
#### Grounding
7892

swift/llm/template/base.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -442,6 +442,7 @@ def _pre_tokenize(self, context_list: List[Context], loss_scale_list: List[float
442442
idx = getattr(inputs, f'{k}_idx')
443443
c_list = self.replace_tag(k, idx, inputs)
444444
setattr(inputs, f'{k}_idx', idx + 1)
445+
loss_scale = 0.
445446
break
446447
else:
447448
if context == '<ref-object>':

swift/llm/template/template/gemma.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
4141
else:
4242
encoded['token_type_ids'] = [0] * len(encoded['input_ids'])
4343
if raw_image:
44-
model_inputs = processor(text=inputs.to_history()['query'], images=raw_image[0], return_tensors='pt')
44+
model_inputs = processor(text='<image>' * len(raw_image), images=raw_image, return_tensors='pt')
4545
encoded['pixel_values'] = model_inputs['pixel_values'].to(self.config.torch_dtype)
4646
return encoded
4747

tests/test_align/test_template/test_vision.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -189,10 +189,9 @@ def test_paligemma2():
189189
pt_engine = PtEngine('AI-ModelScope/paligemma2-3b-ft-docci-448', torch_dtype=torch.bfloat16)
190190
response = _infer_model(pt_engine, messages=[{'role': 'user', 'content': 'caption en'}])
191191
assert response == (
192-
'A close up view of a white kitten with black stripes on its head and body. The kitten is looking straight '
193-
'ahead with its light blue eyes. The kitten has a pink nose and mouth. The kitten is sitting on a white '
194-
'surface. A white light is shining on the kitten and the white surface. A shadow is being cast underneath '
195-
'the kitten and the white surface.')
192+
'A close up view of a white and gray kitten with black stripes on its head and face staring forward with '
193+
'its light blue eyes. The kitten is sitting on a white surface with a blurry background. '
194+
"There is a light shining on the top of the kitten's head and the front of its body.")
196195

197196

198197
def test_pixtral():

0 commit comments

Comments
 (0)