Skip to content

Commit 0bfa5e7

Browse files
committed
Update README.md
1 parent 295246d commit 0bfa5e7

File tree

2 files changed

+80
-21
lines changed

2 files changed

+80
-21
lines changed
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Continued Fine-tune
2+
3+
#### 1. Prepare the pre-trained model
4+
5+
Before you start the second fine-tuning, you need to download the pre-trained model we provided. Here we provide two pre-trained models, [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) and [InternVL-Chat-V1.2-Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus).
6+
7+
You can use the following command to download one of them, we recommend you download the plus version.
8+
9+
```shell
10+
cd pretrained/
11+
# pip install -U huggingface_hub
12+
# download OpenGVLab/InternVL-Chat-Chinese-V1-2
13+
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL-Chat-Chinese-V1-2 --local-dir InternVL-Chat-Chinese-V1-2
14+
# download OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus
15+
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus --local-dir InternVL-Chat-Chinese-V1-2-Plus
16+
```
17+
18+
#### 2. Prepare your custom training data
19+
20+
After downloading the pre-trained model, you need to prepare your customized SFT data. You should write a JSON file in `internvl_chat/shell/data/`, just like [this file](./shell/data/data_yi34b_finetune.json).
21+
22+
The format for organizing this JSON file is:
23+
24+
```json
25+
{
26+
"your-custom-dataset-1": {
27+
"root": "path/to/the/image/",
28+
"annotation": "path/to/the/jsonl/annotation",
29+
"data_augment": false,
30+
"repeat_time": 1,
31+
"length": number of your data
32+
},
33+
...
34+
}
35+
```
36+
37+
For example:
38+
39+
```json
40+
{
41+
"sharegpt4v_instruct_gpt4-vision_cap100k": {
42+
"root": "playground/data/",
43+
"annotation": "playground/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl",
44+
"data_augment": false,
45+
"repeat_time": 1,
46+
"length": 102025
47+
}
48+
}
49+
```
50+
51+
#### 3. Start fine-tuning
52+
53+
You can fine-tune our pre-trained models using this [script (train full LLM)](./shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune_continue.sh) or this [script (train LoRA adapter)](./shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune_continue_lora.sh), depending on your available GPU devices.
54+
55+
Before fine-tuning, you should set the `--meta_path` to the path of the JSON file you created in the last step. And, the default pre-trained model in these shell scripts is `./pretrained/InternVL-Chat-Chinese-V1-2`. You should change it to `./pretrained/InternVL-Chat-Chinese-V1-2-Plus` if you want to fine-tune our plus version.
56+
57+
> Note: fine-tune the full LLM needs 16 A100 80G GPUs, and fine-tune the LoRA needs 2 A100 80G GPUs.
58+
59+
```sh
60+
# using 16 GPUs with slurm system, fine-tune the full LLM
61+
PARTITION='your partition' GPUS=16 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune_continue.sh
62+
# using 2 GPUs, fine-tune the LoRA
63+
CUDA_VISIBLE_DEVICES=0,1 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune_continue_lora.sh
64+
```
65+
66+
If you run into any problems, please let me know and I will improve the training guide to make it easier to use.

internvl_chat/README.md

Lines changed: 14 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -140,18 +140,9 @@ The hyperparameters used for fine-tuning are listed in the following table. And,
140140
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
141141
| InternVL-Chat-V1.2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
142142

143-
## Continue Fine-tune
143+
### Continued Fine-tune
144144

145-
You can continue to fine-tune the checkpoint from the previous training process use this [script](./shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune_continue.sh).
146-
147-
Before fine-tuning, you should set the `--meta_path` in to your custom meta file of training data.
148-
149-
```sh
150-
# using 16 GPUs, fine-tune the full LLM
151-
PARTITION='your partition' GPUS=16 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune_continue.sh
152-
# using 2 GPUs, fine-tune the LoRA
153-
CUDA_VISIBLE_DEVICES=0,1 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune_continue_lora.sh
154-
```
145+
See [CONTINUED_FINETUNE.md](CONTINUED_FINETUNE.md).
155146

156147
## 📊 Evaluation
157148

@@ -161,19 +152,19 @@ CUDA_VISIBLE_DEVICES=0,1 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b
161152

162153
| name | model size | MathVista<br>(testmini) | MMB<br>(dev/test) | MMB−CN<br>(dev/test) | MMMU<br>(val/test) | CMMMU<br>(val/test) | MMVP | MME | POPE | Tiny LVLM | SEEDv1<br>(image) | LLaVA Wild | MM−Vet |
163154
| ------------------------------------------------------------------------------------------- | ---------- | ----------------------- | ----------------- | -------------------- | ---------------------------------------------------------------------------------- | ------------------- | ---- | -------------- | ---- | --------- | ----------------- | ---------- | ------ |
164-
| [InternVL−Chat−V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | 19B | 34.5 | 76.7&nbsp;/&nbsp;75.4 | 71.9&nbsp;/&nbsp;70.3 | 39.1&nbsp;/&nbsp;35.3 | 34.8&nbsp;/&nbsp;34.0 | 44.7 | 1675.1&nbsp;/&nbsp;348.6 | 87.1 | 343.2 | 73.2 | 73.2 | 46.7 |
165-
| [InternVL−Chat−V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | 40B | 47.7 | 81.4&nbsp;/&nbsp;82.2 | 79.5&nbsp;/&nbsp;81.2 | 51.6&nbsp;/&nbsp;[46.2](https://eval.ai/web/challenges/challenge-page/2179/leaderboard/5377) | TODO | 56.7 | 1672.1&nbsp;/&nbsp;509.3 | 88.0 | 350.3 | 75.6 | 85.0 | 48.9 |
166-
| [InternVL−Chat−V1.2−Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | 40B | 59.9 | 83.4&nbsp;/&nbsp;83.8 | 81.6&nbsp;/&nbsp;82.0 | 50.3&nbsp;/&nbsp;45.6 | TODO | 58.7 | 1623.6&nbsp;/&nbsp;550.7 | 88.7 | 353.9 | 76.4 | 84.6 | 47.9 |
155+
| [InternVL−Chat−V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | 19B | 34.5 | 76.7 / 75.4 | 71.9 / 70.3 | 39.1 / 35.3 | 34.8 / 34.0 | 44.7 | 1675.1 / 348.6 | 87.1 | 343.2 | 73.2 | 73.2 | 46.7 |
156+
| [InternVL−Chat−V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | 40B | 47.7 | 81.4 / 82.2 | 79.5 / 81.2 | 51.6 / [46.2](https://eval.ai/web/challenges/challenge-page/2179/leaderboard/5377) | TODO | 56.7 | 1672.1 / 509.3 | 88.0 | 350.3 | 75.6 | 85.0 | 48.9 |
157+
| [InternVL−Chat−V1.2−Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | 40B | 59.9 | 83.4 / 83.8 | 81.6 / 82.0 | 50.3 / 45.6 | TODO | 58.7 | 1623.6 / 550.7 | 88.7 | 353.9 | 76.4 | 84.6 | 47.9 |
167158

168159
**Image Captioning & Visual Question Answering**
169160

170161
\* Training set observed.
171162

172163
| name | model size | COCO<br>(test) | Flickr30K<br>(test) | NoCaps<br>(val) | VQAv2<br>(testdev) | OKVQA<br>(val) | TextVQA<br>(val) | VizWiz<br>(val/test) | AI2D<br>(test) | GQA<br>(test) | ScienceQA<br>(image) |
173164
| ------------------------------------------------------------------------------------------- | ---------- | -------------- | ------------------- | --------------- | ------------------ | -------------- | ---------------- | -------------------- | -------------- | ------------- | -------------------- |
174-
| [InternVL−Chat−V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | 19B | 142.2\* | 85.3 | 120.8 | 80.9\* | 64.1\* | 65.9 | 59.0&nbsp;/&nbsp;57.3 | 72.2\* | 62.5\* | 90.1\* |
175-
| [InternVL−Chat−V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | 40B | 113.9 | 92.4 | 112.5 | - | 62.5\* | 69.7 | 61.9&nbsp;/&nbsp;60.0 | 77.1\* | 64.0\* | 83.3 |
176-
| [InternVL−Chat−V1.2−Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | 40B | 143.4\* | 90.5 | 125.8 | - | 67.6\* | 71.3\* | 61.3&nbsp;/&nbsp;59.5 | 78.2\* | 66.9\* | 98.1\* |
165+
| [InternVL−Chat−V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | 19B | 142.2\* | 85.3 | 120.8 | 80.9\* | 64.1\* | 65.9 | 59.0 / 57.3 | 72.2\* | 62.5\* | 90.1\* |
166+
| [InternVL−Chat−V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | 40B | 113.9 | 92.4 | 112.5 | - | 62.5\* | 69.7 | 61.9 / 60.0 | 77.1\* | 64.0\* | 83.3 |
167+
| [InternVL−Chat−V1.2−Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | 40B | 143.4\* | 90.5 | 125.8 | - | 67.6\* | 71.3\* | 61.3 / 59.5 | 78.2\* | 66.9\* | 98.1\* |
177168

178169
- We found that incorrect images were used for training and testing in `AI2D`, meaning that for problems where `abcLabel` is True, `abc_images` were not utilized. We have now corrected the images used for testing, but the results may still be somewhat lower as a consequence.
179170

@@ -189,8 +180,8 @@ CUDA_VISIBLE_DEVICES=0,1 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b
189180

190181
| model | QLLaMA | LLM | res | COCO | Flickr | NoCaps | VQAv2 | GQA | VizWiz | TextVQA | MME | POPE | Download |
191182
| ------------- | ------ | ------------ | --- | ----- | ------ | ------ | ----- | ---- | ------ | ------- | ------ | ---- | -------- |
192-
| InternVL−Chat | ✔️ | frozen&nbsp;V−7B | 224 | 141.4 | 89.7 | 120.5 | 72.3 | 57.7 | 44.5 | 42.1 | 1298.5 | 85.2 | TODO |
193-
| InternVL−Chat | ✔️ | frozen&nbsp;V−13B | 224 | 142.4 | 89.9 | 123.1 | 71.7 | 59.5 | 54.0 | 49.1 | 1317.2 | 85.4 | TODO |
183+
| InternVL−Chat | ✔️ | frozen V−7B | 224 | 141.4 | 89.7 | 120.5 | 72.3 | 57.7 | 44.5 | 42.1 | 1298.5 | 85.2 | TODO |
184+
| InternVL−Chat | ✔️ | frozen V−13B | 224 | 142.4 | 89.9 | 123.1 | 71.7 | 59.5 | 54.0 | 49.1 | 1317.2 | 85.4 | TODO |
194185
| InternVL−Chat | ✔️ | V−13B | 336 | 146.2 | 92.2 | 126.2 | 81.2 | 66.6 | 58.5 | 61.5 | 1586.4 | 87.6 | TODO |
195186

196187
## ❓ How to Evaluate
@@ -411,8 +402,10 @@ GPUS=8 sh evaluate.sh <checkpoint> caption-coco
411402
mkdir -p data/flickr30k && cd data/flickr30k
412403

413404
# download images from https://bryanplummer.com/Flickr30kEntities/
414-
# karpathy split annotations can be downloaded from https://cs.stanford.edu/people/karpathy/deepimagesent/
415-
# download converted files
405+
# karpathy split annotations can be downloaded from the following link:
406+
# https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_test_karpathy.txt
407+
# this file is provided by the clip-benchmark repository.
408+
# We convert this txt file to json format, download the converted file:
416409
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_test_karpathy.json
417410

418411
cd ../..

0 commit comments

Comments
 (0)