Skip to content

Commit 26601cc

Browse files
committed
Update README.md & BLOG.md
1 parent 88a146c commit 26601cc

File tree

3 files changed

+392
-6
lines changed

3 files changed

+392
-6
lines changed

BLOG.md

Lines changed: 376 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,14 @@
1-
# Blog
1+
# InternVL's Blog
22

33
## InternVL-Chat-V1.2
44

55
> Date: 2024/02/12<br>
6-
> Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang
6+
> Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang, Erfei Cui, Zhangwei Gao, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai
77
8-
In January 2024, we released [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1), featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. In that version, we explored increasing the resolution to 448x448, enhancing OCR capabilities, and improving support for Chinese conversations. However, it still lagged behind existing SOTA in some benchmarks.
8+
We are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
99

10-
<img width="600" alt="image" src="https://github.com/czczup/InternVL-MoE/assets/23737120/9b68aa35-40fd-4e81-9595-d404cbbfc6bd">
10+
<img width="650" alt="image" src="https://github.com/czczup/InternVL-MoE/assets/23737120/9b68aa35-40fd-4e81-9595-d404cbbfc6bd">
1111

12-
Today, we are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model.
1312
From the experimental results, **we've observed that a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model ([InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)).**
1413

1514
For better training reproducibility, we follow the minimalist design and data efficiency similar to LLaVA-NeXT. To reduce training costs, we provide a pre-trained MLP projector and only employ around 1 million visual instruction tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32 A100 GPUs. The code, data, and model will be made publicly available.
@@ -49,3 +48,375 @@ The hyperparameters used for finetuning are listed in the following table.
4948
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
5049
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
5150
| InternVL-Chat-V1.2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
51+
52+
## InternVL-Chat-V1.1
53+
54+
> Date: 2024/01/24<br>
55+
> Developed by: Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai
56+
57+
We released [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1), featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. In this version, we explored increasing the resolution to 448x448, enhancing OCR capabilities, and improving support for Chinese conversations. Below is an example of the improved capabilities.
58+
59+
<img width="650" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/8529570/0e60912e-c52b-46fa-bd61-5f94a221d1fc">
60+
61+
62+
## InternVL
63+
64+
> Date: 2023/12/12<br>
65+
> Developed by: Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
66+
67+
### What is InternVL?
68+
69+
We released [InternVL](https://huggingface.co/collections/OpenGVLab/internvl-65b92d6be81c86166ca0dde4), scaling up the ViT to 6B parameters and aligning it with LLM. It is the largest open-source vision/vision-language foundation model (14B) to date, achieving 32 state-of-the-art performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
70+
71+
<img width="950" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/7cd8c1d5-99e7-4b62-b70a-e73d4838daa8">
72+
73+
### How is InternVL trained?
74+
75+
The training strategy of InternVL consists of three progressive stages, including vision-language
76+
contrastive training, vision-language generative training, and supervised fine-tuning. These stages effectively leverage public data from
77+
diverse sources, ranging from noisy image-text pairs on the web to high-quality caption, VQA, and multi-modal dialogue datasets.
78+
79+
<img width="700" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/8529570/a060ba07-faf7-45db-8a4c-a3a343141569">
80+
81+
### What can InternVL do?
82+
83+
InternVL is a “Swiss Army Knife” Model. By flexibly combining the vision encoder and the language middleware, InternVL can support various vision or vision-language tasks, including
84+
85+
<details>
86+
<summary>Visual Perception (click to expand)</summary>
87+
88+
- Linear-Probe Image Classification [\[see details\]](./classification#-evaluation)
89+
90+
ViT-22B uses the private JFT-3B dataset.
91+
92+
| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
93+
| ------------------- | :----: | :---: | :-----: | :---: | :--: | :--: | :-------: |
94+
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
95+
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
96+
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
97+
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
98+
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 ||
99+
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
100+
101+
- Semantic Segmentation [\[see details\]](./segmentation#-evaluation)
102+
103+
| method | decoder | #param (train/total) | crop size | mIoU |
104+
| --------------------- | :-----: | :------------------: | :-------: | ------------ |
105+
| OpenCLIP-G (frozen) | Linear | 0.3M / 1.8B | 512 | 39.3 |
106+
| ViT-22B (frozen) | Linear | 0.9M / 21.7B | 504 | 34.6 |
107+
| InternViT-6B (frozen) | Linear | 0.5M / 5.9B | 504 | 47.2 (+12.6) |
108+
| ViT-22B (frozen) | UperNet | 0.8B / 22.5B | 504 | 52.7 |
109+
| InternViT-6B (frozen) | UperNet | 0.4B / 6.3B | 504 | 54.9 (+2.2) |
110+
| ViT-22B | UperNet | 22.5B / 22.5B | 504 | 55.3 |
111+
| InternViT-6B | UperNet | 6.3B / 6.3B | 504 | 58.9 (+3.6) |
112+
113+
- Zero-Shot Image Classification [\[see details\]](./clip_benchmark#imagenet-variants-and-objectnet)
114+
115+
| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
116+
| ----------------- | :---: | :--: | :--: | :---: | :-------: | :-------: |
117+
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
118+
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
119+
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 || 87.6 |
120+
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
121+
122+
- Multilingual Zero-Shot Image Classification [\[see details\]](./clip_benchmark#multilingual-imagenet-1k)
123+
124+
EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
125+
126+
| method | IN-1K (EN) | IN-1K (ZH) | IN-1K (JP) | IN-1K (AR) | IN-1K (IT) |
127+
| ----------------- | :--------: | :--------: | :--------: | :--------: | :--------: |
128+
| Taiyi-CLIP-ViT-H | - | 54.4 | - | - | - |
129+
| WuKong-ViT-L-G | - | 57.5 | - | - | - |
130+
| CN-CLIP-ViT-H | - | 59.6 | - | - | - |
131+
| AltCLIP-ViT-L | 74.5 | 59.6 | - | - | - |
132+
| EVA-02-CLIP-E+ | 82.0 | - | - | - | 41.2 |
133+
| OpenCLIP-XLM-R-H | 77.0 | 55.7 | 53.1 | 37.0 | 56.8 |
134+
| InternVL-C (ours) | 83.2 | 64.5 | 61.5 | 44.9 | 65.7 |
135+
136+
- Zero-Shot Video Classification \[see details\]
137+
138+
| method | #frame | K400 | K600 | K700 |
139+
| ----------------- | :----: | :--: | :--: | :--: |
140+
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
141+
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
142+
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
143+
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
144+
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
145+
146+
</details>
147+
148+
<details>
149+
<summary>Cross-Modal Retrieval (click to expand)</summary>
150+
151+
- English Zero-Shot Image-Text Retrieval [\[see details\]](./clip_benchmark#flickr30k--coco)
152+
153+
<table>
154+
<tr align=center>
155+
<td rowspan="3" align=left><b>model</b></td>
156+
<td colspan="6" align=center><b>Flickr30K</b></td>
157+
<td colspan="6" align=center><b>COCO</b></td>
158+
<td rowspan="3" align=center><b>avg</b></td>
159+
160+
</tr>
161+
<tr align=center>
162+
<td colspan="3" align=center><b>image-to-text</b></td>
163+
<td colspan="3" align=center><b>text-to-image</b></td>
164+
<td colspan="3" align=center><b>image-to-text</b></td>
165+
<td colspan="3" align=center><b>text-to-image</b></td>
166+
</tr>
167+
<tr>
168+
<td>R@1</td>
169+
<td>R@5</td>
170+
<td>R@10</td>
171+
<td>R@1</td>
172+
<td>R@5</td>
173+
<td>R@10</td>
174+
<td>R@1</td>
175+
<td>R@5</td>
176+
<td>R@10</td>
177+
<td>R@1</td>
178+
<td>R@5</td>
179+
<td>R@10</td>
180+
</tr>
181+
182+
<tr align=center>
183+
<td align=left>OpenCLIP-G</td>
184+
<td>92.9</td>
185+
<td>99.3</td>
186+
<td>99.8</td>
187+
<td>79.5</td>
188+
<td>95.0</td>
189+
<td>97.1</td>
190+
<td>67.3</td>
191+
<td>86.9</td>
192+
<td>92.6</td>
193+
<td>51.4</td>
194+
<td>74.9</td>
195+
<td>83.0</td>
196+
<td>85.0</td>
197+
</tr>
198+
<tr align=center>
199+
<td align=left>EVA-02-CLIP-E+</td>
200+
<td>93.9</td>
201+
<td>99.4</td>
202+
<td>99.8</td>
203+
<td>78.8</td>
204+
<td>94.2</td>
205+
<td>96.8</td>
206+
<td>68.8</td>
207+
<td>87.8</td>
208+
<td>92.8</td>
209+
<td>51.1</td>
210+
<td>75.0</td>
211+
<td>82.7</td>
212+
<td>85.1</td>
213+
</tr>
214+
<tr align=center>
215+
<td align=left>EVA-CLIP-8B</td>
216+
<td>95.6</td>
217+
<td>99.6</td>
218+
<td>99.9</td>
219+
<td>80.8</td>
220+
<td>95.5</td>
221+
<td>97.6</td>
222+
<td>70.3</td>
223+
<td>89.3</td>
224+
<td>93.9</td>
225+
<td>53.0</td>
226+
<td>76.0</td>
227+
<td>83.4</td>
228+
<td>86.2</td>
229+
</tr>
230+
<tr align=center>
231+
<td align=left>InternVL-C (ours)</td>
232+
<td>94.7</td>
233+
<td>99.6</td>
234+
<td>99.9</td>
235+
<td>81.7</td>
236+
<td>96.0</td>
237+
<td>98.2</td>
238+
<td>70.6</td>
239+
<td>89.0</td>
240+
<td>93.5</td>
241+
<td>54.1</td>
242+
<td>77.3</td>
243+
<td>84.6</td>
244+
<td>86.6</td>
245+
</tr>
246+
<tr align=center>
247+
<td align=left>InternVL-G (ours)</td>
248+
<td>95.7</td>
249+
<td>99.7</td>
250+
<td>99.9</td>
251+
<td>85.0</td>
252+
<td>97.0</td>
253+
<td>98.6</td>
254+
<td>74.9</td>
255+
<td>91.3</td>
256+
<td>95.2</td>
257+
<td>58.6</td>
258+
<td>81.3</td>
259+
<td>88.0</td>
260+
<td>88.8</td>
261+
</tr>
262+
263+
</table>
264+
265+
- Chinese Zero-Shot Image-Text Retrieval [\[see details\]](./clip_benchmark#flickr30k-cn--coco-cn)
266+
267+
<table>
268+
<tr align=center>
269+
<td rowspan="3" align=left><b>model</b></td>
270+
<td colspan="6" align=center><b>Flickr30K-CN</b></td>
271+
<td colspan="6" align=center><b>COCO-CN</b></td>
272+
<td rowspan="3" align=center><b>avg</b></td>
273+
274+
</tr>
275+
<tr align=center>
276+
<td colspan="3" align=center><b>image-to-text</b></td>
277+
<td colspan="3" align=center><b>text-to-image</b></td>
278+
<td colspan="3" align=center><b>image-to-text</b></td>
279+
<td colspan="3" align=center><b>text-to-image</b></td>
280+
</tr>
281+
<tr>
282+
<td>R@1</td>
283+
<td>R@5</td>
284+
<td>R@10</td>
285+
<td>R@1</td>
286+
<td>R@5</td>
287+
<td>R@10</td>
288+
<td>R@1</td>
289+
<td>R@5</td>
290+
<td>R@10</td>
291+
<td>R@1</td>
292+
<td>R@5</td>
293+
<td>R@10</td>
294+
</tr>
295+
296+
<tr align=center>
297+
<td align=left>CN-CLIP-ViT-H</td>
298+
<td>81.6</td>
299+
<td>97.5</td>
300+
<td>98.8</td>
301+
<td>71.2</td>
302+
<td>91.4</td>
303+
<td>95.5</td>
304+
<td>63.0</td>
305+
<td>86.6</td>
306+
<td>92.9</td>
307+
<td>69.2</td>
308+
<td>89.9</td>
309+
<td>96.1</td>
310+
<td>86.1</td>
311+
</tr>
312+
313+
<tr align=center>
314+
<td align=left>OpenCLIP-XLM-R-H</td>
315+
<td>86.1</td>
316+
<td>97.5</td>
317+
<td>99.2</td>
318+
<td>71.0</td>
319+
<td>90.5</td>
320+
<td>94.9</td>
321+
<td>70.0</td>
322+
<td>91.5</td>
323+
<td>97.0</td>
324+
<td>66.1</td>
325+
<td>90.8</td>
326+
<td>96.0</td>
327+
<td>87.6</td>
328+
</tr>
329+
330+
<tr align=center>
331+
<td align=left>InternVL-C (ours)</td>
332+
<td>90.3</td>
333+
<td>98.8</td>
334+
<td>99.7</td>
335+
<td>75.1</td>
336+
<td>92.9</td>
337+
<td>96.4</td>
338+
<td>68.8</td>
339+
<td>92.0</td>
340+
<td>96.7</td>
341+
<td>68.9</td>
342+
<td>91.9</td>
343+
<td>96.5</td>
344+
<td>89.0</td>
345+
</tr>
346+
<tr align=center>
347+
<td align=left>InternVL-G (ours)</td>
348+
<td>92.9</td>
349+
<td>99.4</td>
350+
<td>99.8</td>
351+
<td>77.7</td>
352+
<td>94.8</td>
353+
<td>97.3</td>
354+
<td>71.4</td>
355+
<td>93.9</td>
356+
<td>97.7</td>
357+
<td>73.8</td>
358+
<td>94.4</td>
359+
<td>98.1</td>
360+
<td>90.9</td>
361+
</tr>
362+
363+
</table>
364+
365+
- Multilingual Zero-Shot Image-Text Retrieval on XTD [\[see details\]](./clip_benchmark#xtd)
366+
367+
| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
368+
| ----------------- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
369+
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
370+
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
371+
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
372+
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
373+
374+
</details>
375+
376+
<details>
377+
<summary>Multimodal Dialogue (click to expand)</summary>
378+
379+
- Zero-Shot Image Captioning [\[see details\]](./internvl_g#zero-shot-image-captioning)
380+
381+
| method | COCO | Flickr30K | NoCaps |
382+
| ----------------- | :---: | :-------: | :----: |
383+
| Emu-I | 117.7 | - | - |
384+
| DreamLLM | 115.4 | - | - |
385+
| InternVL-G (ours) | 128.2 | 79.2 | 113.7 |
386+
387+
- Multimodal Benchmarks with Frozen LLM [\[see details\]](./internvl_chat#-evaluation)
388+
389+
| method | visual encoder | glue layer | LLM | res. | COCO | Flickr | NoCaps | VQAv2 | GQA | VizWiz | TextVQA | MME | POPE |
390+
| -------------------- | :------------: | :--------: | :---: | :--: | :---: | :----: | :----: | :---: | :--: | :----: | :-----: | :----: | :--: |
391+
| InstructBLIP | EVA-g | QFormer | V-7B | 224 || 82.4 | 123.1 || 49.2 | 34.5 | 50.1 |||
392+
| BLIP-2 | EVA-g | QFormer | V-13B | 224 || 71.6 | 103.9 | 41.0 | 41.0 | 19.6 | 42.5 | 1293.8 | 85.3 |
393+
| InstructBLIP | EVA-g | QFormer | V-13B | 224 || 82.8 | 121.9 || 49.5 | 33.4 | 50.7 | 1212.8 | 78.9 |
394+
| InternVL-Chat (ours) | IViT-6B | QLLaMA | V-7B | 224 | 141.4 | 89.7 | 120.5 | 72.3 | 57.7 | 44.5 | 42.1 | 1298.5 | 85.2 |
395+
| InternVL-Chat (ours) | IViT-6B | QLLaMA | V-13B | 224 | 142.4 | 89.9 | 123.1 | 71.7 | 59.5 | 54.0 | 49.1 | 1317.2 | 85.4 |
396+
397+
- Multimodal Benchmarks with Trainable LLM [\[see details\]](./internvl_chat_llava)
398+
399+
| method | vision encoder | LLM | res. | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MMB | MMB<sub>CN</sub> | MMVet |
400+
| -------------------- | :------------: | :---: | :--: | :---: | :--: | :----: | :--: | :-----: | :--: | :----: | :--: | :--------------: | :---: |
401+
| LLaVA-1.5 | CLIP-L-336px | V-7B | 336 | 78.5 | 62.0 | 50.0 | 66.8 | 58.2 | 85.9 | 1510.7 | 64.3 | 58.3 | 30.5 |
402+
| LLaVA-1.5 | CLIP-L-336px | V-13B | 336 | 80.0 | 63.3 | 53.6 | 71.6 | 61.3 | 85.9 | 1531.3 | 67.7 | 63.6 | 35.4 |
403+
| InternVL-Chat (ours) | IViT-6B-224px | V-7B | 336 | 79.3 | 62.9 | 52.5 | 66.2 | 57.0 | 86.4 | 1525.1 | 64.6 | 57.6 | 31.2 |
404+
| InternVL-Chat (ours) | IViT-6B-224px | V-13B | 336 | 80.2 | 63.9 | 54.6 | 70.1 | 58.7 | 87.1 | 1546.9 | 66.5 | 61.9 | 33.7 |
405+
| InternVL-Chat (ours) | IViT-6B-448px | V-13B | 448 | 82.0 | 64.1 | 60.1 | 71.6 | 64.8 | 87.2 | 1579.0 | 68.2 | 64.0 | 36.7 |
406+
407+
- Tiny LVLM [\[see details\]](https://github.com/OpenGVLab/Multi-Modality-Arena/tree/main/tiny_lvlm_evaluation)
408+
409+
| Rank | Model | Version | Score |
410+
| :--: | :---------------------------------------------------------------------------------: | :----------------------: | :--------: |
411+
| 🏅️ | **[InternVL](https://github.com/OpenGVLab/InternVL)** | InternVL-Chat | **327.61** |
412+
| 🥈 | **[InternLM-XComposer-VL](https://github.com/InternLM/InternLM-XComposer)** | InternLM-XComposer-VL-7B | **322.51** |
413+
| 🥉 | **[Bard](https://bard.google.com/)** | Bard | **319.59** |
414+
| 4 | [Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL) | Qwen-VL-Chat | 316.81 |
415+
| 5 | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) | Vicuna-7B | 307.17 |
416+
| 6 | [InstructBLIP](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) | Vicuna-7B | 300.64 |
417+
| 7 | [InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer) | InternLM-XComposer-7B | 288.89 |
418+
| 8 | [BLIP2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) | FlanT5xl | 284.72 |
419+
| 9 | [BLIVA](https://github.com/mlpc-ucsd/BLIVA) | Vicuna-7B | 284.17 |
420+
| 10 | [Lynx](https://github.com/bytedance/lynx-llm) | Vicuna-7B | 279.24 |
421+
422+
</details>

0 commit comments

Comments
 (0)