update internvl doc(#1449)

hjh0119 · web-flow · commit 53c14d2981e4 · 2024-07-19T23:48:25.000+08:00
diff --git a/docs/source/Multi-Modal/internvl最佳实践.md b/docs/source/Multi-Modal/internvl最佳实践.md
@@ -20,18 +20,21 @@
 **FAQ**
 
 1. **模型显示 `The request model does not exist!`**
+
 这种情况通常发生在尝试使用mini-internvl或InternVL2模型, 原因是modelscope上相应模型是申请制。解决这个问题，你需要登录modelscope, 并前往相应的模型页面进行**申请下载**, 申请成功后可以通过以下任意一种方式获取模型：
 - 使用`snap_download`将模型下载到本地(在模型文件中的模型下载中有相应代码), 然后使用`--model_id_or_path`指定本地模型文件路径
 - 在[modelscope账号主页](https://www.modelscope.cn/my/myaccesstoken)获取账号的SDK token, 使用参数`--hub_token`或者环境变量`MODELSCOPE_API_TOKEN`指定
 
 也可以设置环境变量`USE_HF`, 从hugging face处下载模型
 
 2. **多卡运行模型时, 为什么不同卡的分布不均匀, 导致OOM?**
+
 transformers的auto device map算法对多模态模型支持不友好, 这可能导致不同 GPU 卡之间的显存分配不均匀。
 - 可以通过参数`--device_max_memory`设置每张卡的显存使用, 比如四卡环境, 可以设置`--device_map_memory 15GB 15GB 15GB 15GB`
 - 或者通过`--device_map_config_path`显式指定device map
 
 3. **InternVL2模型与前系列(InternVL-V1.5和Mini-InternVL)模型的区别**
+
 - InternVL2模型支持多轮多图推理和训练, 即多轮对话带有图片, 且单轮中支持文字图片交错,具体参考[自定义数据集](#自定义数据集)和推理的InternVL2部分。前系列模型支持多轮对话, 但只能有单轮带有图片
 - InternVL2模型支持视频输入, 具体格式参考[自定义数据集](#自定义数据集)
 
@@ -60,7 +63,6 @@ pip install Pillow
 - 如果你的GPU不支持flash attention, 使用参数`--use_flash_attn false`。且对于int8模型，推理时需要指定`dtype --bf16`, 否则可能会出现乱码
 - 模型本身config中的max_length较小，为2048，可以设置`--max_length`来修改
 - 可以使用参数`--gradient_checkpoting true`减少显存占用
-- InternVL系列模型的**训练**只支持带有图片的数据集
 
 ```shell
 # Experimental environment: A100
@@ -345,12 +347,12 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
 {"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
 ```
 
-**InternVL2**模型支持多图多轮训练, 使用tag `<image>` 标明图片在对话中的位置, 如果数据集中没有tag `<image>`, 默认放在最后一轮query的开头
+**InternVL2**模型除了以上数据格式外, 还支持多图多轮训练, 使用tag `<image>` 标明图片在对话中的位置, 如果数据集中没有tag `<image>`, 默认放在最后一轮query的开头
 ```jsonl
 {"query": "Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.", "response": "xxxxxxxxx", "history": [["<image> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], "images": ["image_path1", "image_path2", "image_path3"]}
 ```
 或者用`<img>image_path</img>` 表示图像路径和图像位置
-""
+
 ```jsonl
 {"query": "Image-1: <img>img_path</img>\n Image-2: <img>img_path2</img>\n Describe the two images in detail.", "response": "xxxxxxxxx", "history": [["<img>img_path3</img> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], }
 ```
diff --git a/docs/source_en/Multi-Modal/internvl-best-practice.md b/docs/source_en/Multi-Modal/internvl-best-practice.md
@@ -17,17 +17,20 @@ The following practice takes `internvl-chat-v1_5` as an example, and you can als
 
 **FAQ**
 1. **Model shows `The request model does not exist!`**
+
 This issue often arises when attempting to use the mini-internvl or InternVL2 models, as the corresponding models on modelscope are subject to an application process. To resolve this, you need to log in to modelscope and go to the respective model page to apply for download. After approval, you can obtain the model through either of the following methods:
 - Use `snap_download` to download the model locally (the relevant code is available in the model download section of the model file), and then specify the local model file path using `--model_id_or_path`.
 - Obtain the SDK token for your account from the [modelscope account homepage](https://www.modelscope.cn/my/myaccesstoken), and specify it using the `--hub_token` parameter or the `MODELSCOPE_API_TOKEN` environment variable.
 
 2. **Why is the distribution uneven across multiple GPU cards when running models, leading to OOM?**
+
 The auto device map algorithm in transformers is not friendly to multi-modal models, which may result in uneven memory allocation across different GPU cards.
 
 - You can set the memory usage for each card using the `--device_max_memory parameter`, for example, in a four-card environment, you can set `--device_map_memory 15GB 15GB 15GB 15GB`.
 - Alternatively, you can explicitly specify the device map using `--device_map_config_path`.
 
 3. **Differences between the InternVL2 model and its predecessors (InternVL-V1.5 and Mini-InternVL)**
+
 - The InternVL2 model supports multi-turn multi-image inference and training, meaning multi-turn conversations with images, and supports text and images interleaved within a single turn. For details, refer to [Custom Dataset](#custom-dataset) and InternVL2 part in Inference section. The predecessors models supported multi-turn conversations but could only have images in a single turn.
 - The InternVL2 model supports video input. For specific formats, refer to [Custom Dataset](#custom-dataset).
 
@@ -53,7 +56,6 @@ pip install Pillow
 - If your GPU does not support flash attention, use the argument --use_flash_attn false. And for int8 models, it is necessary to specify `dtype --bf16` during inference, otherwise the output may be garbled.
 - The model's configuration specifies a relatively small max_length of 2048, which can be modified by setting `--max_length`.
 - Memory consumption can be reduced by using the parameter `--gradient_checkpointing true`.
-- The InternVL series of models only support training on datasets that include images.
 
 ```shell
 # Experimental environment: A100
@@ -310,13 +312,12 @@ Supports multi-turn conversations, Images support for local path or URL input, m
 {"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
 ```
 
-The **InternVL2** model supports multi-image multi-turn training. It uses the tag `<image>` to indicate the position of images in the conversation. If the tag `<image>` is not present in the dataset, the images are placed at the beginning of the last round's query by default.
+In addition to the above data formats, the **InternVL2** model also supports multi-image multi-turn training. It uses the tag `<image>` to indicate the position of images in the conversation. If the tag `<image>` is not present in the dataset, the images are placed at the beginning of the last round's query by default.
 ```jsonl
 {"query": "Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.", "response": "xxxxxxxxx", "history": [["<image> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], "images": ["image_path1", "image_path2", "image_path3"]}
 ```
 Alternatively, use `<img>image_path</img>` to represent the image path and image location.
 
-""
 ```jsonl
 {"query": "Image-1: <img>img_path</img>\n Image-2: <img>img_path2</img>\n Describe the two images in detail.", "response": "xxxxxxxxx", "history": [["<img>img_path3</img> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], }
 ```