modelscope
diff --git a/‎docs/source/LLM/支持的模型和数据集.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/source/LLM/支持的模型和数据集.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/source/Multi-Modal/cogvlm2最佳实践.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/source/Multi-Modal/cogvlm2最佳实践.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/source/Multi-Modal/index.md‎
Lines changed: 16 additions & 8 deletions b/‎docs/source/Multi-Modal/index.md‎
Lines changed: 16 additions & 8 deletions
diff --git a/‎docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/Multi-Modal/phi3-vision最佳实践.md‎
Lines changed: 199 additions & 0 deletions b/‎docs/source/Multi-Modal/phi3-vision最佳实践.md‎
Lines changed: 199 additions & 0 deletions
diff --git a/‎docs/source_en/LLM/Supported-models-datasets.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/source_en/LLM/Supported-models-datasets.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/source_en/Multi-Modal/cogvlm-best-practice.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source_en/Multi-Modal/cogvlm-best-practice.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source_en/Multi-Modal/cogvlm2-best-practice.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source_en/Multi-Modal/cogvlm2-best-practice.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source_en/Multi-Modal/deepseek-vl-best-practice.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source_en/Multi-Modal/deepseek-vl-best-practice.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source_en/Multi-Modal/index.md‎
Lines changed: 19 additions & 11 deletions b/‎docs/source_en/Multi-Modal/index.md‎
Lines changed: 19 additions & 11 deletions
@@ -273,9 +273,9 @@
 |phi2-3b|[AI-ModelScope/phi-2](https://modelscope.cn/models/AI-ModelScope/phi-2/summary)|Wqkv|default-generation|&#x2714;|&#x2714;||coding|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2)|
 |phi3-4b-4k-instruct|[LLM-Research/Phi-3-mini-4k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-mini-4k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2718;|transformers>=4.36|general|[microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)|
 |phi3-4b-128k-instruct|[LLM-Research/Phi-3-mini-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-mini-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)|
-|phi3-vision-128k-instruct|[LLM-Research/Phi-3-vision-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary)|qkv_proj|phi3-vl|&#x2718;|&#x2718;||-|[microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)|
-|phi3_small_128k_instruct|[LLM-Research/Phi-3-small-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-small-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct)|
-|phi3_medium_128k_instruct|[LLM-Research/Phi-3-medium-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-medium-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct)|
+|phi3-small-128k-instruct|[LLM-Research/Phi-3-small-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-small-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct)|
+|phi3-medium-128k-instruct|[LLM-Research/Phi-3-medium-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-medium-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct)|
+|phi3-vision-128k-instruct|[LLM-Research/Phi-3-vision-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary)|qkv_proj|phi3-vl|&#x2714;|&#x2718;|transformers>=4.36|multi-modal, vision|[microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)|
 |cogvlm-17b-chat|[ZhipuAI/cogvlm-chat](https://modelscope.cn/models/ZhipuAI/cogvlm-chat/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;||multi-modal, vision|[THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf)|
 |cogvlm2-19b-chat|[ZhipuAI/cogvlm2-llama3-chinese-chat-19B](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chinese-chat-19B/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;||-|[THUDM/cogvlm2-llama3-chinese-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chinese-chat-19B)|
 |cogvlm2-en-19b-chat|[ZhipuAI/cogvlm2-llama3-chat-19B](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chat-19B/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;||-|[THUDM/cogvlm2-llama3-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B)|
 
@@ -114,14 +114,14 @@ seed_everything(42)
 
 images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
 query = '距离各城市多远？'
-response, _ = inference(model, template, query, images=images)
+response, history = inference(model, template, query, images=images)
 print(f'query: {query}')
 print(f'response: {response}')
 
 # 流式
 query = '距离最远的城市是哪？'
 images = images
-gen = inference_stream(model, template, query, images=images)
+gen = inference_stream(model, template, query, history, images=images)
 print_idx = 0
 print(f'query: {query}\nresponse: ', end='')
 for response, _ in gen:
@@ -134,7 +134,7 @@ print()
 query: 距离各城市多远？
 response: 距离马踏Mata有14km，距离阳江Yangjiang有62km，距离广州Guangzhou有293km。
 query: 距离最远的城市是哪？
-response: 距离最远的城市是广州Guangzhou。
+response: 距离最远的城市是广州Guangzhou，有293km。
 """
 ```
 
 
@@ -2,13 +2,21 @@
 
 ### Multi-Modal最佳实践系列
 
+一轮对话可以包含多张图片（或不含图片）:
 1. [Qwen-VL最佳实践](qwen-vl最佳实践.md)
 2. [Qwen-Audio最佳实践](qwen-audio最佳实践.md)
-3. [Llava最佳实践](llava最佳实践.md)
-4. [Deepseek-VL最佳实践](deepseek-vl最佳实践.md)
-5. [Yi-VL最佳实践.md](yi-vl最佳实践.md)
-6. [Internlm2-Xcomposers最佳实践](internlm-xcomposer2最佳实践.md)
-7. [MiniCPM-V最佳实践](minicpm-v最佳实践.md), [MiniCPM-V-2最佳实践](minicpm-v-2最佳实践.md), [MiniCPM-V-2.5最佳实践](minicpm-v-2.5最佳实践.md)
-8. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md)
-9. [mPLUG-Owl2最佳实践](mplug-owl2最佳实践.md)
-10. [InternVL-Chat-V1.5最佳实践](internvl最佳实践.md)
+3. [Deepseek-VL最佳实践](deepseek-vl最佳实践.md)
+4. [Internlm2-Xcomposers最佳实践](internlm-xcomposer2最佳实践.md)
+5. [Phi3-Vision最佳实践](phi3-vision最佳实践.md)
+
+
+一轮对话只能包含一张图片:
+1. [Llava最佳实践](llava最佳实践.md)
+2. [Yi-VL最佳实践.md](yi-vl最佳实践.md)
+3. [mPLUG-Owl2最佳实践](mplug-owl2最佳实践.md)
+4. [InternVL-Chat-V1.5最佳实践](internvl最佳实践.md)
+
+
+整个对话围绕一张图片:
+1. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md)
+2. [MiniCPM-V最佳实践](minicpm-v最佳实践.md), [MiniCPM-V-2最佳实践](minicpm-v-2最佳实践.md), [MiniCPM-V-2.5最佳实践](minicpm-v-2.5最佳实践.md)
@@ -109,7 +109,7 @@ print()
 print(f'history: {history}')
 """
 query: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远？
-response:  马鞍山距离阳江62公里，广州距离广州293公里。
+response: 马鞍山距离阳江62公里，广州距离广州293公里。
 query: 距离最远的城市是哪？
 response: 距离最最远的城市是广州，距离广州293公里。
 history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远？', ' 马鞍山距离阳江62公里，广州距离广州293公里。'], ['距离最远的城市是哪？', ' 距离最远的城市是广州，距离广州293公里。']]
 
@@ -0,0 +1,199 @@
+
+# Phi3-Vision 最佳实践
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+```
+模型链接:
+- phi3-vision-128k-instruct: [https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary)
+
+
+## 推理
+
+推理 phi3-vision-128k-instruct:
+```shell
+# Experimental environment: A10, 3090, V100, ...
+# 16GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type phi3-vision-128k-instruct
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< Who are you?
+I am Phi, an AI developed by Microsoft to assist with providing information, answering questions, and helping users find solutions to their queries. How can I assist you today?
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>What is the difference between these two pictures?
+The first picture shows a group of four cartoon sheep standing in a field, while the second picture is a close-up of a kitten with a blurred background. The main difference between these two pictures is the subject matter and the setting. The first picture features animals that are typically associated with farm life and agriculture, while the second picture focuses on a domestic animal, a kitten, which is more commonly found in households. Additionally, the first picture has a more peaceful and serene atmosphere, while the second picture has a more intimate and detailed view of the kitten.
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>How many sheep are there in the picture?
+There are four sheep in the picture.
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png</img>What is the result of the calculation?
+The result of the calculation 1452 + 45304 is 46756.
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png</img>Write a poem based on the content of the picture.
+In the tranquil night, a boat sails,
+Through the darkened river, it sets sail.
+A single candle flickers, casting light,
+Guiding the way through the endless night.
+
+The stars above, like diamonds bright,
+Gleam down upon the boat's gentle flight.
+The moon, a silent guardian in the sky,
+Watches over the boat as it sails by.
+
+The river, a mirror to the night,
+Reflects the boat's journey, a beautiful sight.
+The trees on either side, standing tall,
+Whisper secrets to the boat, one and all.
+
+In the stillness of the night, a sense of peace,
+The boat, the river, the trees, all in their place.
+A moment frozen in time, a scene so serene,
+A journey through the night, a dream so unseen.
+"""
+```
+
+示例图片如下:
+
+cat:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
+
+animal:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
+
+math:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
+
+poem:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
+
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.phi3_vision_128k_instruct
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+query = """<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?"""
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = 'Which city is the farthest?'
+gen = inference_stream(model, template, query, history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: Which city is the farthest?
+response: Guangzhou is the farthest city, located 293km away.
+history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?', 'The distances are as follows: Mata is 14km away, Yangjiang is 62km away, and Guangzhou is 293km away.'], ['Which city is the farthest?', 'Guangzhou is the farthest city, located 293km away.']]
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+(默认只对LLM部分的qkv进行lora微调. 如果你想对所有linear含vision模型部分都进行微调, 可以指定`--lora_target_modules ALL`. 支持全参数微调.)
+```shell
+# Experimental environment: A10, 3090, V100, ...
+# 16GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type phi3-vision-128k-instruct \
+    --dataset coco-en-mini \
+```
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(支持多轮对话, 支持每轮对话含多张图片或不含图片, 支持传入本地路径或URL)
+
+```json
+[
+    {"conversations": [
+        {"from": "user", "value": "<img>img_path</img>11111"},
+        {"from": "assistant", "value": "22222"}
+    ]},
+    {"conversations": [
+        {"from": "user", "value": "<img>img_path</img><img>img_path2</img><img>img_path3</img>aaaaa"},
+        {"from": "assistant", "value": "bbbbb"},
+        {"from": "user", "value": "<img>img_path</img>ccccc"},
+        {"from": "assistant", "value": "ddddd"}
+    ]},
+    {"conversations": [
+        {"from": "user", "value": "AAAAA"},
+        {"from": "assistant", "value": "BBBBB"},
+        {"from": "user", "value": "CCCCC"},
+        {"from": "assistant", "value": "DDDDD"}
+    ]}
+]
+```
+
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx \
+    --merge_lora true --safe_serialization false
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx-merged \
+    --load_dataset_config true
+```
@@ -273,9 +273,9 @@ The table below introcudes all models supported by SWIFT:
 |phi2-3b|[AI-ModelScope/phi-2](https://modelscope.cn/models/AI-ModelScope/phi-2/summary)|Wqkv|default-generation|&#x2714;|&#x2714;||coding|[microsoft/phi-2](https://huggingface.co/microsoft/phi-2)|
 |phi3-4b-4k-instruct|[LLM-Research/Phi-3-mini-4k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-mini-4k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2718;|transformers>=4.36|general|[microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)|
 |phi3-4b-128k-instruct|[LLM-Research/Phi-3-mini-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-mini-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)|
-|phi3-vision-128k-instruct|[LLM-Research/Phi-3-vision-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary)|qkv_proj|phi3-vl|&#x2718;|&#x2718;||-|[microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)|
-|phi3_small_128k_instruct|[LLM-Research/Phi-3-small-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-small-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct)|
-|phi3_medium_128k_instruct|[LLM-Research/Phi-3-medium-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-medium-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct)|
+|phi3-small-128k-instruct|[LLM-Research/Phi-3-small-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-small-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct)|
+|phi3-medium-128k-instruct|[LLM-Research/Phi-3-medium-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-medium-128k-instruct/summary)|qkv_proj|phi3|&#x2714;|&#x2714;|transformers>=4.36|general|[microsoft/Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct)|
+|phi3-vision-128k-instruct|[LLM-Research/Phi-3-vision-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary)|qkv_proj|phi3-vl|&#x2714;|&#x2718;|transformers>=4.36|multi-modal, vision|[microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)|
 |cogvlm-17b-chat|[ZhipuAI/cogvlm-chat](https://modelscope.cn/models/ZhipuAI/cogvlm-chat/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;||multi-modal, vision|[THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf)|
 |cogvlm2-19b-chat|[ZhipuAI/cogvlm2-llama3-chinese-chat-19B](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chinese-chat-19B/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;||-|[THUDM/cogvlm2-llama3-chinese-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chinese-chat-19B)|
 |cogvlm2-en-19b-chat|[ZhipuAI/cogvlm2-llama3-chat-19B](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chat-19B/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;||-|[THUDM/cogvlm2-llama3-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B)|
 
@@ -1,4 +1,4 @@
-# CogVLM Best Practices
+# CogVLM Best Practice
 
 ## Table of Contents
 - [Environment Setup](#environment-setup)
 
@@ -1,4 +1,4 @@
-# CogVLM2 Best Practices
+# CogVLM2 Best Practice
 
 ## Table of Contents
 - [Environment Setup](#environment-setup)
 
@@ -1,4 +1,4 @@
-# Deepseek-VL Best Practices
+# Deepseek-VL Best Practice
 
 ## Table of Contents
 - [Environment Preparation](#environment-preparation)
 
@@ -1,13 +1,21 @@
 ## Multi-Modal Documentation
 
-### Multi-Modal Best Practices
-
-1. [Qwen-VL Best Practices](qwen-vl-best-practice.md)
-2. [Qwen-Audio Best Practices](qwen-audio-best-practice.md)
-3. [Llava Best Practices](llava-best-practice.md)
-4. [Deepseek-VL Best Practices](deepseek-vl-best-practice.md)
-5. [Yi-VL Best Practices.md](yi-vl-best-practice.md)
-6. [Internlm2-Xcomposers Best Practices](internlm-xcomposer2-best-practice.md)
-7. [MiniCPM-V Best Practices](minicpm-v-best-practice.md)
-8. [CogVLM Best Practices](cogvlm-best-practice.md), [CogVLM2 Best Practices](cogvlm2-best-practice.md)
-9. [InternVL-Chat-V1.5 Best Practices](internvl-best-practice.md)
+### Multi-Modal Best Practice
+
+A single round of dialogue can contain multiple images (or no images):
+1. [Qwen-VL Best Practice](qwen-vl-best-practice.md)
+2. [Qwen-Audio Best Practice](qwen-audio-best-practice.md)
+3. [Deepseek-VL Best Practice](deepseek-vl-best-practice.md)
+4. [Internlm2-Xcomposers Best Practice](internlm-xcomposer2-best-practice.md)
+5. [Phi3-Vision Best Practice](phi3-vision-best-practice.md)
+
+
+A single round of dialogue can only contain one image:
+1. [Llava Best Practice](llava-best-practice.md)
+2. [Yi-VL Best Practice.md](yi-vl-best-practice.md)
+5. [InternVL-Chat-V1.5 Best Practice](internvl-best-practice.md)
+
+
+整个对话围绕一张图片:
+1. [CogVLM Best Practice](cogvlm-best-practice.md), [CogVLM2 Best Practice](cogvlm2-best-practice.md)
+2. [MiniCPM-V Best Practice](minicpm-v-best-practice.md)
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# CogVLM Best Practices`
	`1`	`+# CogVLM Best Practice`
`2`	`2`
`3`	`3`	`## Table of Contents`
`4`	`4`	`- [Environment Setup](#environment-setup)`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# CogVLM2 Best Practices`
	`1`	`+# CogVLM2 Best Practice`
`2`	`2`
`3`	`3`	`## Table of Contents`
`4`	`4`	`- [Environment Setup](#environment-setup)`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Deepseek-VL Best Practices`
	`1`	`+# Deepseek-VL Best Practice`
`2`	`2`
`3`	`3`	`## Table of Contents`
`4`	`4`	`- [Environment Preparation](#environment-preparation)`