|
1 | 1 | # <img width="60" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/8529570/5aa4cda8-b453-40a0-9336-17012b430ae8"> InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks —— An Open-Source Alternative to ViT-22B |
2 | 2 |
|
3 | | -\[[Update Blog](./BLOG.md)\] \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[Quick Start](#quick-start-with-huggingface)\] \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\] |
| 3 | +\[[Update Blog](./BLOG.md)\] \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[Quick Start](#quick-start-with-huggingface)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)\] |
4 | 4 |
|
5 | 5 | ## News🚀🚀🚀 |
6 | 6 |
|
| 7 | +- `2024/04/18`: InternVL-Chat-V1.5 has been released at [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5), approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. |
7 | 8 | - `2024/02/27`: InternVL is accepted by CVPR 2024! 🎉 |
8 | 9 | - `2024/02/24`: InternVL-Chat models have been included in the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). |
9 | | -- `2024/02/21`: [InternVL-Chat-V1.2-Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) achieves SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our [blog](BLOG.md) for more details. |
10 | | -- `2024/02/12`: InternVL-Chat-V1.2 has been released, utilizing [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the LLM. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](BLOG.md), [SFT data](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets) or try our [demo](https://internvl.opengvlab.com/). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2), and both training/evaluation data and scripts are open-sourced. |
11 | | -- `2024/02/04`: [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) achieves 44.67% on [MMVP](https://github.com/tsb0601/MMVP), higher than GPT-4V! |
| 10 | +- `2024/02/21`: [InternVL-Chat-V1.2-Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) achieves SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our [blog](BLOG.md) for more details. |
| 11 | +- `2024/02/12`: InternVL-Chat-V1.2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](BLOG.md), [SFT data](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets) or try our [demo](https://internvl.opengvlab.com/). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), and both training/evaluation data and scripts are open-sourced. |
| 12 | +- `2024/02/04`: [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) achieves 44.67% on [MMVP](https://github.com/tsb0601/MMVP), higher than GPT-4V! |
12 | 13 | - `2024/01/27`: We release 448 resolution model, achieving 76.6 on MMBench dev, see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation-chinese-models). |
13 | | -- `2024/01/24`: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) or try our [demo](https://internvl.opengvlab.com/). |
| 14 | +- `2024/01/24`: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) or try our [demo](https://internvl.opengvlab.com/). |
14 | 15 | - `2024/01/16`: We release our [customized mmcv/mmsegmentation/mmdetection code](https://github.com/OpenGVLab/InternVL-MMDetSeg), integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models. |
15 | 16 |
|
| 17 | + |
| 18 | +## Compared with SOTA VLLMs |
| 19 | + |
| 20 | +<img width="900" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/519f3edb-951a-4ddb-9ace-31add081faad"> |
| 21 | + |
| 22 | +<br> |
| 23 | +<br> |
| 24 | + |
| 25 | +<img width="1229" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/dd43bb52-2fb2-4532-b9b9-c33761437ca9"> |
| 26 | + |
16 | 27 | ## What is InternVL? |
17 | 28 |
|
18 | 29 | InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM. |
19 | 30 |
|
20 | 31 | ## Model Zoo |
21 | 32 |
|
| 33 | +**Vision Large Language Model** |
| 34 | + |
| 35 | +| Model | Date | Download | Note | |
| 36 | +| ----------------------- | ---------- | ------------------------------------------------------------------------------------ | ---------------------------------- | |
| 37 | +| InternVL-Chat-V1.5 | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)| |
| 38 | +| InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) | more SFT data and stronger | |
| 39 | +| InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) | scaling up LLM to 34B | |
| 40 | +| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) | support Chinese and stronger OCR | |
| 41 | +| InternVL-Chat-19B-448px | 2024.02.03 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B-448px) | 448 resolution | |
| 42 | +| InternVL-Chat-19B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue | |
| 43 | +| InternVL-Chat-13B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B) | English multimodal dialogue | |
| 44 | + |
| 45 | + |
22 | 46 | **Vision-Language Foundation Model** |
23 | 47 |
|
24 | 48 | | Model | Date | Download | Note | |
25 | 49 | | ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- | |
| 50 | +| InternViT-6B-448px-V1.5 | 2024.04.20 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | support dynamic resolution, super strong OCR (🔥new) | |
| 51 | +| InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution | |
| 52 | +| InternViT-6B-448px-V1.0 | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) | 448 resolution | |
26 | 53 | | InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model | |
27 | 54 | | InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model | |
28 | | -| InternViT-6B-448px | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px) | 448 resolution | |
29 | | -| InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution (🔥new) | |
30 | | - |
31 | | -**Vision Large Language Model** |
32 | | - |
33 | | -| Model | Date | Download | Note | |
34 | | -| ----------------------- | ---------- | ------------------------------------------------------------------------------------ | ---------------------------------- | |
35 | | -| InternVL-Chat-13B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B) | English multimodal dialogue | |
36 | | -| InternVL-Chat-19B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue | |
37 | | -| InternVL-Chat-19B-448px | 2024.02.03 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B-448px) | 448 resolution | |
38 | | -| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR | |
39 | | -| InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | scaling up LLM to 34B (🔥new) | |
40 | | -| InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | more SFT data and stronger (🔥new) | |
41 | 55 |
|
42 | 56 | ## What can InternVL do? |
43 | 57 |
|
@@ -335,21 +349,6 @@ InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM. |
335 | 349 | <details> |
336 | 350 | <summary>Multimodal Dialogue (click to expand)</summary> |
337 | 351 |
|
338 | | -- Compared with SOTA VLLMs |
339 | | - |
340 | | - | name | image size | MMMU<br>(val) | MMMU<br>(test) | MathVista<br>(testmini) | MMB<br>(test) | MMB−CN<br>(test) | MMVP | MME | ScienceQA<br>(image) | POPE | TextVQA | SEEDv1<br>(image) | VizWiz<br>(test) | GQA<br>(test) | |
341 | | - | ------------------ | ---------- | ------------- | -------------- | ----------------------- | ------------- | ---------------- | ---- | -------- | -------------------- | ---- | ------- | ----------------- | ---------------- | ------------- | |
342 | | - | GPT-4V\* | unknown | 56.8 | 55.7 | 49.9 | 77.0 | 74.4 | 38.7 | 1409/517 | - | - | 78.0 | 71.6 | - | - | |
343 | | - | Gemini Ultra\* | unknown | 59.4 | - | 53.0 | - | - | - | - | - | - | 82.3 | - | - | - | |
344 | | - | Gemini Pro\* | unknown | 47.9 | - | 45.2 | 73.6 | 74.3 | 40.7 | 1497/437 | - | - | 74.6 | 70.7 | - | - | |
345 | | - | Qwen-VL-Plus\* | unknown | 45.2 | 40.8 | 43.3 | 67.0 | 70.7 | - | 1681/502 | - | - | 78.9 | 65.7 | - | - | |
346 | | - | Qwen-VL-Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - | |
347 | | - | | | | | | | | | | | | | | | | |
348 | | - | LLaVA-NEXT-34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 | |
349 | | - | InternVL-Chat-V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1672/509 | 83.3 | 88.0 | 69.7 | 75.6 | 60.0 | 64.0 | |
350 | | - |
351 | | - \* denotes proprietary models. MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard). In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B. |
352 | | - |
353 | 352 | - Zero-Shot Image Captioning [\[see details\]](./internvl_g#zero-shot-image-captioning) |
354 | 353 |
|
355 | 354 | | method | COCO | Flickr30K | NoCaps | |
@@ -515,7 +514,7 @@ from PIL import Image |
515 | 514 | from transformers import AutoModel, CLIPImageProcessor |
516 | 515 | from transformers import AutoTokenizer |
517 | 516 |
|
518 | | -path = "OpenGVLab/InternVL-Chat-Chinese-V1-1" |
| 517 | +path = "OpenGVLab/InternVL-Chat-V1-1" |
519 | 518 | model = AutoModel.from_pretrained( |
520 | 519 | path, |
521 | 520 | torch_dtype=torch.bfloat16, |
@@ -548,7 +547,7 @@ from PIL import Image |
548 | 547 | from transformers import AutoModel, CLIPImageProcessor |
549 | 548 | from transformers import AutoTokenizer |
550 | 549 |
|
551 | | -path = "OpenGVLab/InternVL-Chat-Chinese-V1-1" |
| 550 | +path = "OpenGVLab/InternVL-Chat-V1-1" |
552 | 551 | model = AutoModel.from_pretrained( |
553 | 552 | path, |
554 | 553 | torch_dtype=torch.bfloat16, |
@@ -606,17 +605,17 @@ response = model.chat(tokenizer, pixel_values, question, generation_config) |
606 | 605 | # run the command in the `internvl_chat_llava` folder |
607 | 606 | python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40001 --worker http://localhost:40001 --model-path ./path/to/InternVL-Chat-ViT-6B-Vicuna-13B |
608 | 607 |
|
609 | | - # OpenGVLab/InternVL-Chat-Chinese-V1-1 |
| 608 | + # OpenGVLab/InternVL-Chat-V1-1 |
610 | 609 | # run the command in the `internvl_chat` folder |
611 | | - python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40002 --worker http://localhost:40002 --model-path ./path/to/InternVL-Chat-Chinese-V1-1 |
| 610 | + python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40002 --worker http://localhost:40002 --model-path ./path/to/InternVL-Chat-V1-1 |
612 | 611 |
|
613 | | - # OpenGVLab/InternVL-Chat-Chinese-V1-2 |
| 612 | + # OpenGVLab/InternVL-Chat-V1-2 |
614 | 613 | # run the command in the `internvl_chat` folder |
615 | | - python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40003 --worker http://localhost:40003 --model-path ./path/to/InternVL-Chat-Chinese-V1-2 |
| 614 | + python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40003 --worker http://localhost:40003 --model-path ./path/to/InternVL-Chat-V1-2 |
616 | 615 |
|
617 | | - # OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus |
| 616 | + # OpenGVLab/InternVL-Chat-V1-2-Plus |
618 | 617 | # run the command in the `internvl_chat` folder |
619 | | - python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40004 --worker http://localhost:40004 --model-path ./path/to/InternVL-Chat-Chinese-V1-2-Plus |
| 618 | + python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40004 --worker http://localhost:40004 --model-path ./path/to/InternVL-Chat-V1-2-Plus |
620 | 619 | ``` |
621 | 620 | </details> |
622 | 621 |
|
|
0 commit comments