fix docs multimodal; fix pretrain mllm (#2742)

Jintao-Huang · web-flow · commit 00c2eaa97c68 · 2024-12-24T10:52:01.000+08:00
diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -34,7 +34,7 @@ query-response格式：
 
 ## 推荐数据集格式
 
-以下给出ms-swift的推荐数据集格式：
+以下给出ms-swift的推荐数据集格式，其中system字段是可选的，默认使用template中定义的`default_system`。
 
 ### 预训练
 
@@ -69,11 +69,23 @@ query-response格式：
 
 ### 多模态
 
-对于多模态数据集，和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key，分别代表多模态资源：
+对于多模态数据集，和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key，分别代表多模态资源，`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置。下面给出的四条示例分别展示了纯文本，以及包含图像、视频和音频数据的数据格式。
+
+预训练：
+```
+{"messages": [{"role": "assistant", "content": "预训练的文本在这里"}]}
+{"messages": [{"role": "assistant", "content": "<image>是一只小狗，<image>是一只小猫"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
+{"messages": [{"role": "assistant", "content": "<audio>描述了今天天气真不错"}], "audios": ["/xxx/x.wav"]}
+{"messages": [{"role": "assistant", "content": "<image>是一个大象，<video>是一只狮子在跑步"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
+```
+
+微调：
 ```jsonl
-{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么,<video>视频中是什么"}, {"role": "assistant", "content": "一个大象，一个狮子"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
+{"messages": [{"role": "user", "content": "<image><image>两张图片有什么区别"}, {"role": "assistant", "content": "前一张是小猫，后一张是小狗"}], "images": ["/xxx/x.jpg", "xxx/x.png"]}
+{"messages": [{"role": "user", "content": "<audio>语音说了什么"}, {"role": "assistant", "content": "今天天气真好呀"}], "audios": ["/xxx/x.mp3"]}
+{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么，<video>视频中是什么"}, {"role": "assistant", "content": "图片中是一个大象，视频中是一只小狗在草地上奔跑"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
 ```
-其中`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置。
+RLHF的数据格式可以参考纯文本大模型的格式。
 
 #### grounding
 
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -33,7 +33,7 @@ There are three ways to integrate a custom dataset, with increasing control over
 
 ## Recommended Dataset Format
 
-Here is the recommended dataset format for ms-swift:
+The following provides the recommended dataset format for ms-swift, where the system field is optional and defaults to the `default_system` defined in the template.
 
 ### Pre-training
 
@@ -68,11 +68,25 @@ Here is the recommended dataset format for ms-swift:
 
 ### Multimodal
 
-For multimodal datasets, the format is the same as above. The difference is that it includes the keys `images`, `videos`, and `audios`, which represent multimodal resources:
+For multimodal datasets, the format is the same as the tasks mentioned above. The difference is the addition of several keys: `images`, `videos`, and `audios`, which represent multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate the positions where images, videos, and audio are inserted, respectively. The four examples provided below demonstrate the data format for pure text, as well as formats that include image, video, and audio data.
+
+
+Pre-training:
+```jsonl
+{"messages": [{"role": "assistant", "content": "Pre-trained text goes here"}]}
+{"messages": [{"role": "assistant", "content": "<image> is a puppy, <image> is a kitten"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
+{"messages": [{"role": "assistant", "content": "<audio> describes how nice the weather is today"}], "audios": ["/xxx/x.wav"]}
+{"messages": [{"role": "assistant", "content": "<image> is an elephant, <video> is a lion running"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
+```
+
+Supervised Fine-tuning:
+
 ```jsonl
-{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "<image> What is in the image? <video> What is in the video?"}, {"role": "assistant", "content": "An elephant and a lion"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
+{"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}, {"role": "assistant", "content": "The first one is a kitten, and the second one is a puppy."}], "images": ["/xxx/x.jpg", "xxx/x.png"]}
+{"messages": [{"role": "user", "content": "<audio>What did the audio say?"}, {"role": "assistant", "content": "The weather is really nice today."}], "audios": ["/xxx/x.mp3"]}
+{"messages": [{"role": "system", "content": "You are a helpful and harmless assistant."}, {"role": "user", "content": "<image>What is in the image, <video>What is in the video?"}, {"role": "assistant", "content": "The image shows an elephant, and the video shows a puppy running on the grass."}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
 ```
-The `<image>`, `<video>`, and `<audio>` tags indicate where to insert images/videos/audios.
+The data format for RLHF can refer to the format used for pure text large models.
 
 #### Grounding
 
diff --git a/swift/llm/template/base.py b/swift/llm/template/base.py
@@ -442,6 +442,7 @@ def _pre_tokenize(self, context_list: List[Context], loss_scale_list: List[float
                     idx = getattr(inputs, f'{k}_idx')
                     c_list = self.replace_tag(k, idx, inputs)
                     setattr(inputs, f'{k}_idx', idx + 1)
+                    loss_scale = 0.
                     break
             else:
                 if context == '<ref-object>':
diff --git a/swift/llm/template/template/gemma.py b/swift/llm/template/template/gemma.py
@@ -41,7 +41,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
         else:
             encoded['token_type_ids'] = [0] * len(encoded['input_ids'])
         if raw_image:
-            model_inputs = processor(text=inputs.to_history()['query'], images=raw_image[0], return_tensors='pt')
+            model_inputs = processor(text='<image>' * len(raw_image), images=raw_image, return_tensors='pt')
             encoded['pixel_values'] = model_inputs['pixel_values'].to(self.config.torch_dtype)
         return encoded
 
diff --git a/tests/test_align/test_template/test_vision.py b/tests/test_align/test_template/test_vision.py
@@ -189,10 +189,9 @@ def test_paligemma2():
     pt_engine = PtEngine('AI-ModelScope/paligemma2-3b-ft-docci-448', torch_dtype=torch.bfloat16)
     response = _infer_model(pt_engine, messages=[{'role': 'user', 'content': 'caption en'}])
     assert response == (
-        'A close up view of a white kitten with black stripes on its head and body. The kitten is looking straight '
-        'ahead with its light blue eyes. The kitten has a pink nose and mouth. The kitten is sitting on a white '
-        'surface. A white light is shining on the kitten and the white surface. A shadow is being cast underneath '
-        'the kitten and the white surface.')
+        'A close up view of a white and gray kitten with black stripes on its head and face staring forward with '
+        'its light blue eyes. The kitten is sitting on a white surface with a blurry background. '
+        "There is a light shining on the top of the kitten's head and the front of its body.")
 
 
 def test_pixtral():