Support Latex-OCR dataset (#1810)

Jintao-Huang · web-flow · commit 603a655171bd · 2024-08-23T21:19:34.000+08:00
diff --git a/README.md b/README.md
@@ -56,7 +56,7 @@ You can contact us and communicate with us by adding our group:
 
 ## 🎉 News
 - 🔥2024.08.22: Support `reft` tuner from [ReFT](https://github.com/stanfordnlp/pyreft) to achieve 15×–65× more parameter-efficient than LoRA, use `--sft_type reft` to begin!
-- 2024.08.21: Support for phi3_5-mini-instruct, phi3_5-moe-instruct, and phi3_5-vision-instruct.
+- 🔥2024.08.21: Support for phi3_5-mini-instruct, phi3_5-moe-instruct, and phi3_5-vision-instruct. The best practices for fine-tuning Latex OCR using phi3_5-vision-instruct can be found [here](https://github.com/modelscope/ms-swift/issues/1809).
 - 2024.08.21: Support for idefics3-8b-llama3, llava-onevision-qwen2-0_5b-ov, llava-onevision-qwen2-7b-ov, and llava-onevision-qwen2-72b-ov.
 - 🔥2024.08.20: Support fine-tuning of multimodal large models using DeepSpeed-Zero3.
 - 2024.08.20: Supported models: longwriter-glm4-9b, longwriter-llama3_1-8b. Supported dataset: longwriter-6k.
diff --git a/README_CN.md b/README_CN.md
@@ -57,7 +57,7 @@ SWIFT具有丰富全面的文档，请查看我们的文档网站:
 
 ## 🎉 新闻
 - 🔥2024.08.22: 支持[ReFT](https://github.com/stanfordnlp/pyreft), 该tuner可以以LoRA的1/15~1/65的参数量达到和LoRA匹配或更好的效果, 使用`--sft_type reft`开始训练!
-- 2024.08.21: 支持phi3_5-mini-instruct, phi3_5-moe-instruct, phi3_5-vision-instruct.
+- 🔥2024.08.21: 支持phi3_5-mini-instruct, phi3_5-moe-instruct, phi3_5-vision-instruct. 使用phi3_5-vision-instruct进行Latex OCR微调的最佳实践可以查看[这里](https://github.com/modelscope/ms-swift/issues/1809).
 - 2024.08.21: 支持idefics3-8b-llama3, llava-onevision-qwen2-0_5b-ov, llava-onevision-qwen2-7b-ov, llava-onevision-qwen2-72b-ov.
 - 🔥2024.08.20: 支持使用deepspeed-zero3对多模态大模型进行微调.
 - 2024.08.20: 支持模型: longwriter-glm4-9b, longwriter-llama3_1-8b. 支持数据集: longwriter-6k.
diff --git a/docs/source/LLM/支持的模型和数据集.md b/docs/source/LLM/支持的模型和数据集.md
@@ -510,9 +510,11 @@
 |coco-en-2|[modelscope/coco_2014_caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)|coco_2014_caption|454617|36.8±2.8, min=32, max=89|chat, multi-modal, vision|-|
 |🔥coco-en-2-mini|[modelscope/coco_2014_caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)|coco_2014_caption|40504|36.8±2.6, min=32, max=75|chat, multi-modal, vision|-|
 |capcha-images|[AI-ModelScope/captcha-images](https://modelscope.cn/datasets/AI-ModelScope/captcha-images/summary)||8000|31.0±0.0, min=31, max=31|chat, multi-modal, vision|-|
+|latex-ocr-print|[AI-ModelScope/LaTeX_OCR](https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)|full|17918|362.7±34.8, min=294, max=528|chat, ocr, multi-modal, vision|[linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR)|
+|latex-ocr-handwrite|[AI-ModelScope/LaTeX_OCR](https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)|synthetic_handwrite|95424|375.1±59.4, min=292, max=2115|chat, ocr, multi-modal, vision|[linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR)|
 |aishell1-zh|[speech_asr/speech_asr_aishell1_trainsets](https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets/summary)||141600|152.2±36.8, min=63, max=419|chat, multi-modal, audio|-|
 |🔥aishell1-zh-mini|[speech_asr/speech_asr_aishell1_trainsets](https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets/summary)||14526|152.2±35.6, min=74, max=359|chat, multi-modal, audio|-|
-|🔥video-chatgpt|[swift/VideoChatGPT](https://modelscope.cn/datasets/swift/VideoChatGPT/summary)|Generic<br>Temporal<br>Consistency|3206|88.4±48.3, min=32, max=399|chat, multi-modal, video|-|
+|🔥video-chatgpt|[swift/VideoChatGPT](https://modelscope.cn/datasets/swift/VideoChatGPT/summary)|Generic<br>Temporal<br>Consistency|3206|88.4±48.3, min=32, max=399|chat, multi-modal, video|[lmms-lab/VideoChatGPT](https://huggingface.co/datasets/lmms-lab/VideoChatGPT)|
 |hh-rlhf|[AI-ModelScope/hh-rlhf](https://modelscope.cn/datasets/AI-ModelScope/hh-rlhf/summary)|harmless-base<br>helpful-base<br>helpful-online<br>helpful-rejection-sampled|127459|245.4±190.7, min=22, max=1999|rlhf, dpo, pairwise|-|
 |🔥hh-rlhf-cn|[AI-ModelScope/hh_rlhf_cn](https://modelscope.cn/datasets/AI-ModelScope/hh_rlhf_cn/summary)|hh_rlhf<br>harmless_base_cn<br>harmless_base_en<br>helpful_base_cn<br>helpful_base_en|355920|171.2±122.7, min=22, max=3078|rlhf, dpo, pairwise|-|
 |orpo-dpo-mix-40k|[AI-ModelScope/orpo-dpo-mix-40k](https://modelscope.cn/datasets/AI-ModelScope/orpo-dpo-mix-40k/summary)|default|43666|548.3±397.4, min=28, max=8483|dpo, orpo, en, quality|[mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k)|
diff --git a/docs/source/Multi-Modal/index.md b/docs/source/Multi-Modal/index.md
@@ -16,7 +16,7 @@
 4. [InternVL系列最佳实践](internvl最佳实践.md)
 5. [Deepseek-VL最佳实践](deepseek-vl最佳实践.md)
 6. [Internlm2-Xcomposers最佳实践](internlm-xcomposer2最佳实践.md)
-7. [Phi3-Vision最佳实践](phi3-vision最佳实践.md)
+7. [Phi3-Vision最佳实践](phi3-vision最佳实践.md), [Phi3.5-Vision最佳实践](https://github.com/modelscope/ms-swift/issues/1809).
 
 
 一轮对话只能包含一张图片（可能可以不含图片）:
diff --git a/docs/source_en/LLM/Supported-models-datasets.md b/docs/source_en/LLM/Supported-models-datasets.md
@@ -510,9 +510,11 @@ The table below introduces the datasets supported by SWIFT:
 |coco-en-2|[modelscope/coco_2014_caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)|coco_2014_caption|454617|36.8±2.8, min=32, max=89|chat, multi-modal, vision|-|
 |🔥coco-en-2-mini|[modelscope/coco_2014_caption](https://modelscope.cn/datasets/modelscope/coco_2014_caption/summary)|coco_2014_caption|40504|36.8±2.6, min=32, max=75|chat, multi-modal, vision|-|
 |capcha-images|[AI-ModelScope/captcha-images](https://modelscope.cn/datasets/AI-ModelScope/captcha-images/summary)||8000|31.0±0.0, min=31, max=31|chat, multi-modal, vision|-|
+|latex-ocr-print|[AI-ModelScope/LaTeX_OCR](https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)|full|17918|362.7±34.8, min=294, max=528|chat, ocr, multi-modal, vision|[linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR)|
+|latex-ocr-handwrite|[AI-ModelScope/LaTeX_OCR](https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)|synthetic_handwrite|95424|375.1±59.4, min=292, max=2115|chat, ocr, multi-modal, vision|[linxy/LaTeX_OCR](https://huggingface.co/datasets/linxy/LaTeX_OCR)|
 |aishell1-zh|[speech_asr/speech_asr_aishell1_trainsets](https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets/summary)||141600|152.2±36.8, min=63, max=419|chat, multi-modal, audio|-|
 |🔥aishell1-zh-mini|[speech_asr/speech_asr_aishell1_trainsets](https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets/summary)||14526|152.2±35.6, min=74, max=359|chat, multi-modal, audio|-|
-|🔥video-chatgpt|[swift/VideoChatGPT](https://modelscope.cn/datasets/swift/VideoChatGPT/summary)|Generic<br>Temporal<br>Consistency|3206|88.4±48.3, min=32, max=399|chat, multi-modal, video|-|
+|🔥video-chatgpt|[swift/VideoChatGPT](https://modelscope.cn/datasets/swift/VideoChatGPT/summary)|Generic<br>Temporal<br>Consistency|3206|88.4±48.3, min=32, max=399|chat, multi-modal, video|[lmms-lab/VideoChatGPT](https://huggingface.co/datasets/lmms-lab/VideoChatGPT)|
 |hh-rlhf|[AI-ModelScope/hh-rlhf](https://modelscope.cn/datasets/AI-ModelScope/hh-rlhf/summary)|harmless-base<br>helpful-base<br>helpful-online<br>helpful-rejection-sampled|127459|245.4±190.7, min=22, max=1999|rlhf, dpo, pairwise|-|
 |🔥hh-rlhf-cn|[AI-ModelScope/hh_rlhf_cn](https://modelscope.cn/datasets/AI-ModelScope/hh_rlhf_cn/summary)|hh_rlhf<br>harmless_base_cn<br>harmless_base_en<br>helpful_base_cn<br>helpful_base_en|355920|171.2±122.7, min=22, max=3078|rlhf, dpo, pairwise|-|
 |orpo-dpo-mix-40k|[AI-ModelScope/orpo-dpo-mix-40k](https://modelscope.cn/datasets/AI-ModelScope/orpo-dpo-mix-40k/summary)|default|43666|548.3±397.4, min=28, max=8483|dpo, orpo, en, quality|[mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k)|
diff --git a/docs/source_en/Multi-Modal/index.md b/docs/source_en/Multi-Modal/index.md
@@ -16,7 +16,7 @@ A single round of dialogue can contain multiple images (or no images):
 4. [InternVL Series Best Practice](internvl-best-practice.md)
 5. [Deepseek-VL Best Practice](deepseek-vl-best-practice.md)
 6. [Internlm2-Xcomposers Best Practice](internlm-xcomposer2-best-practice.md)
-7. [Phi3-Vision Best Practice](phi3-vision-best-practice.md)
+7. [Phi3-Vision Best Practice](phi3-vision-best-practice.md), [Phi3.5-Vision Best Practice](https://github.com/modelscope/ms-swift/issues/1809).
 
 
 A single round of dialogue can only contain one image:
diff --git a/swift/llm/utils/client_utils.py b/swift/llm/utils/client_utils.py
@@ -100,7 +100,8 @@ def _from_base64(img_base64: Union[str, 'PIL.Image.Image'], tmp_dir: str = 'tmp'
     sha256_hash = hashlib.sha256(img_base64.encode('utf-8')).hexdigest()
     img_path = os.path.join(tmp_dir, f'{sha256_hash}.png')
     image = Image.open(BytesIO(base64.b64decode(img_base64)))
-    image.save(img_path)
+    if not os.path.exists(img_path):
+        image.save(img_path)
     return img_path
 
 
diff --git a/swift/llm/utils/dataset.py b/swift/llm/utils/dataset.py
@@ -156,6 +156,8 @@ class DatasetName:
     coco_en_2 = 'coco-en-2'
     coco_en_2_mini = 'coco-en-2-mini'
     capcha_images = 'capcha-images'
+    latex_ocr_print = 'latex-ocr-print'
+    latex_ocr_handwrite = 'latex-ocr-handwrite'
     # for qwen-audio
     aishell1_zh = 'aishell1-zh'
     aishell1_zh_mini = 'aishell1-zh-mini'
@@ -747,7 +749,10 @@ def _process(d):
         response = d[response_key]
         return {'query': query * len(response), 'response': response, 'images': images}
 
-    return dataset.map(_process)
+    kwargs = {}
+    if not isinstance(dataset, HfIterableDataset):
+        kwargs['load_from_cache_file'] = dataset_enable_cache
+    return dataset.map(_process, **kwargs)
 
 
 register_dataset(
@@ -861,6 +866,7 @@ def _process(d):
     _preprocess_video_chatgpt,
     get_dataset_from_repo,
     split=['test'],
+    hf_dataset_id='lmms-lab/VideoChatGPT',
     tags=['chat', 'multi-modal', 'video', '🔥'])
 
 
@@ -1784,7 +1790,7 @@ def preprocess_row(row):
         query = row['question']
         response = row['choices'][row['answer']]
         solution = row['solution']
-        return {'query': query, 'response': f'{solution}\nSo the final answer is:{response}'}
+        return {'query': query, 'response': f'{solution}\nSo the final answer is: {response}'}
 
     kwargs = {}
     if not isinstance(dataset, HfIterableDataset):
@@ -2028,16 +2034,48 @@ def preprocess(row):
     tags=['chat', 'general', 'multi-round'])
 
 
+def _preprocess_latex_ocr_dataset(dataset: DATASET_TYPE) -> DATASET_TYPE:
+    from datasets import Image
+    prompt = 'Using LaTeX to perform OCR on the image.'
+
+    def _process(d):
+        return {'query': prompt, 'response': d['text']}
+
+    kwargs = {}
+    if not isinstance(dataset, HfIterableDataset):
+        kwargs['load_from_cache_file'] = dataset_enable_cache
+    return dataset.map(_process, **kwargs).rename_column('image', 'images')
+
+
+register_dataset(
+    DatasetName.latex_ocr_print,
+    'AI-ModelScope/LaTeX_OCR',
+    ['full'],
+    _preprocess_latex_ocr_dataset,
+    get_dataset_from_repo,
+    split=['validation', 'test'],  # There are some problems in the training dataset.
+    hf_dataset_id='linxy/LaTeX_OCR',
+    tags=['chat', 'ocr', 'multi-modal', 'vision'])
+
+register_dataset(
+    DatasetName.latex_ocr_handwrite,
+    'AI-ModelScope/LaTeX_OCR', ['synthetic_handwrite'],
+    _preprocess_latex_ocr_dataset,
+    get_dataset_from_repo,
+    split=['train', 'validation', 'test'],
+    hf_dataset_id='linxy/LaTeX_OCR',
+    tags=['chat', 'ocr', 'multi-modal', 'vision'])
+
+
 def _preprocess_capcha_images(dataset: DATASET_TYPE) -> DATASET_TYPE:
     from datasets import Image
     query = 'recognize the content.'
-    image_key = 'image'
     response_key = 'solution'
 
     def _process(d):
-        return {'query': query * len(d[response_key]), 'response': d[response_key], 'images': [d[image_key]]}
+        return {'query': query * len(d[response_key]), 'response': d[response_key]}
 
-    return dataset.map(_process).cast_column('image', Image(decode=True))
+    return dataset.map(_process).rename_column('image', 'images')
 
 
 register_dataset(
diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
@@ -4287,7 +4287,7 @@ def _get_new_func(func_name: str):
 
         @wraps(_old_func)
         def _new_func(self, *args, **kwargs):
-            res = _old_func(self, *args, **kwargs)
+            res = _old_func(getattr(self, submodel_name), *args, **kwargs)
             if func_name == 'forward':
                 device = find_device(args)
                 if device is None:
@@ -4298,12 +4298,9 @@ def _new_func(self, *args, **kwargs):
         return _new_func
 
     for key in func_list:
-        value = MethodType(_get_new_func(key), submodel)
-        setattr(model, key, value)
+        setattr(model, key, MethodType(_get_new_func(key), model))
         if key == 'generate' and model.device != submodel.device:
             submodel.__class__.device = model.device
-        if key == 'forward' and 'generate' in func_list:
-            setattr(submodel, key, value)
 
 
 @register_model(
diff --git a/tests/custom/test_main.py b/tests/custom/test_main.py
@@ -20,6 +20,7 @@ def test_pt():
 
 
 def test_vlm_sft():
+    # lora full
     os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
     from swift.llm import sft_main, SftArguments, infer_main, InferArguments
     model_type = 'phi3_5-vision-instruct'
@@ -45,9 +46,22 @@ def test_llm_sft():
         InferArguments(ckpt_dir=last_model_checkpoint, load_dataset_config=True, merge_lora=True, infer_backend='pt'))
 
 
+def test_vlm_dpo():
+    # lora, full, stream
+    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
+    from swift.llm import rlhf_main, RLHFArguments, infer_main, InferArguments
+    model_type = 'internvl2-2b'
+    dataset = 'rlaif-v#100'
+
+    output = rlhf_main(RLHFArguments(model_type=model_type, dataset=dataset, max_length=8192, sft_type='full'))
+    last_model_checkpoint = output['last_model_checkpoint']
+    infer_main(InferArguments(ckpt_dir=last_model_checkpoint, load_dataset_config=True))
+
+
 if __name__ == '__main__':
     # test_eval_llm()
     # test_eval_vlm()
     # test_pt()
-    test_vlm_sft()
+    # test_vlm_sft()
     # test_llm_sft()
+    test_vlm_dpo()