@@ -49,7 +49,7 @@ First, download the [annotation files](https://huggingface.co/OpenGVLab/InternVL
4949
5050Second, download all the images we used.
5151
52- - AI2D: [ ai2d-all ] ( https://ai2-public-datasets.s3.amazonaws. com/diagrams/ai2d-all.zip )
52+ - AI2D: [ ai2d_images ] ( https://drive.google. com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing ) (provided by InternLM-XComposer )
5353- ChartQA: [ ChartQA Dataset] ( https://huggingface.co/datasets/ahmed-masry/ChartQA/resolve/main/ChartQA%20Dataset.zip )
5454- COCO: [ train2017] ( http://images.cocodataset.org/zips/train2017.zip )
5555- DocVQA: [ train] ( https://datasets.cvc.uab.es/rrc/DocVQA/train.tar.gz ) , [ val] ( https://datasets.cvc.uab.es/rrc/DocVQA/val.tar.gz ) , [ test] ( https://datasets.cvc.uab.es/rrc/DocVQA/test.tar.gz )
@@ -78,45 +78,46 @@ playground/
7878├── geoqa+.jsonl
7979├── synthdog_en.jsonl
8080├── data
81- │ ├── ai2d
82- │ │ └── images
83- │ ├── chartqa
84- │ │ ├── test
85- │ │ ├── train
86- │ │ └── val
87- │ ├── coco
88- │ │ └── train2017
89- │ ├── docvqa
90- │ │ ├── test
91- │ │ ├── train
92- │ │ └── val
93- │ ├── dvqa
94- │ │ └── images
95- │ ├── gqa
96- │ │ └── images
97- │ ├── llava
81+ │ ├── ai2d
82+ │ │ ├── abc_images
83+ │ │ └── images
84+ │ ├── chartqa
85+ │ │ ├── test
86+ │ │ ├── train
87+ │ │ └── val
88+ │ ├── coco
89+ │ │ └── train2017
90+ │ ├── docvqa
91+ │ │ ├── test
92+ │ │ ├── train
93+ │ │ └── val
94+ │ ├── dvqa
95+ │ │ └── images
96+ │ ├── gqa
97+ │ │ └── images
98+ │ ├── llava
9899│ │ └── llava_pretrain
99100│ │ └── images
100- │ ├── ocr_vqa
101+ │ ├── ocr_vqa
101102│ │ └── images
102- │ ├── sam
103+ │ ├── sam
103104│ │ └── images
104- │ ├── share_textvqa
105+ │ ├── share_textvqa
105106│ │ └── images
106- │ ├── synthdog-en
107+ │ ├── synthdog-en
107108│ │ └── images
108- │ ├── textvqa
109+ │ ├── textvqa
109110│ │ └── train_images
110- │ ├── vg
111+ │ ├── vg
111112│ │ ├── VG_100K
112113│ │ └── VG_100K_2
113- │ ├── web-celebrity
114+ │ ├── web-celebrity
114115│ │ └── images
115- │ ├── web-landmark
116+ │ ├── web-landmark
116117│ │ └── images
117- │ ├── wikiart
118+ │ ├── wikiart
118119│ │ └── images
119- │ ├── geoqa+
120+ │ ├── geoqa+
120121│ │ └── images
121122```
122123
@@ -160,19 +161,21 @@ CUDA_VISIBLE_DEVICES=0,1 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b
160161
161162| name | model size | MathVista<br >(testmini) | MMB<br >(dev/test) | MMB−CN<br >(dev/test) | MMMU<br >(val/test) | CMMMU<br >(val/test) | MMVP | MME | POPE | Tiny LVLM | SEEDv1<br >(image) | LLaVA Wild | MM−Vet |
162163| ------------------------------------------------------------------------------------------- | ---------- | ----------------------- | ----------------- | -------------------- | ---------------------------------------------------------------------------------- | ------------------- | ---- | -------------- | ---- | --------- | ----------------- | ---------- | ------ |
163- | [ InternVL−Chat−V1.1] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1 ) | 19B | 34.5 | 76.7 / 75.4 | 71.9 / 70.3 | 39.1 / 35.3 | 34.8 / 34.0 | 44.7 | 1675.1 / 348.6 | 87.1 | 343.2 | 73.2 | 73.2 | 46.7 |
164- | [ InternVL−Chat−V1.2] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2 ) | 40B | 47.7 | 81.4 / 82.2 | 79.5 / 81.2 | 51.6 / [ 46.2] ( https://eval.ai/web/challenges/challenge-page/2179/leaderboard/5377 ) | TODO | 56.7 | 1672.1 / 509.3 | 88.0 | 350.3 | 75.6 | 85.0 | 48.9 |
165- | [ InternVL−Chat−V1.2−Plus] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus ) | 40B | 59.9 | 83.4 / 83.8 | 81.6 / 82.0 | 50.3 / 45.6 | TODO | 58.7 | 1623.6 / 550.7 | 88.7 | 353.9 | 76.4 | 84.6 | 47.9 |
164+ | [ InternVL−Chat−V1.1] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1 ) | 19B | 34.5 | 76.7 / 75.4 | 71.9 / 70.3 | 39.1 / 35.3 | 34.8 / 34.0 | 44.7 | 1675.1 / 348.6 | 87.1 | 343.2 | 73.2 | 73.2 | 46.7 |
165+ | [ InternVL−Chat−V1.2] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2 ) | 40B | 47.7 | 81.4 / 82.2 | 79.5 / 81.2 | 51.6 / [ 46.2] ( https://eval.ai/web/challenges/challenge-page/2179/leaderboard/5377 ) | TODO | 56.7 | 1672.1 / 509.3 | 88.0 | 350.3 | 75.6 | 85.0 | 48.9 |
166+ | [ InternVL−Chat−V1.2−Plus] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus ) | 40B | 59.9 | 83.4 / 83.8 | 81.6 / 82.0 | 50.3 / 45.6 | TODO | 58.7 | 1623.6 / 550.7 | 88.7 | 353.9 | 76.4 | 84.6 | 47.9 |
166167
167168** Image Captioning & Visual Question Answering**
168169
169170\* Training set observed.
170171
171172| name | model size | COCO<br >(test) | Flickr30K<br >(test) | NoCaps<br >(val) | VQAv2<br >(testdev) | OKVQA<br >(val) | TextVQA<br >(val) | VizWiz<br >(val/test) | AI2D<br >(test) | GQA<br >(test) | ScienceQA<br >(image) |
172173| ------------------------------------------------------------------------------------------- | ---------- | -------------- | ------------------- | --------------- | ------------------ | -------------- | ---------------- | -------------------- | -------------- | ------------- | -------------------- |
173- | [ InternVL−Chat−V1.1] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1 ) | 19B | 142.2\* | 85.3 | 120.8 | 80.9\* | 64.1\* | 65.9 | 59.0 / 57.3 | 70.3\* | 62.5\* | 90.1\* |
174- | [ InternVL−Chat−V1.2] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2 ) | 40B | 113.9 | 92.4 | 112.5 | - | 62.5\* | 69.7 | 61.9 / 60.0 | 71.6\* | 64.0\* | 83.3 |
175- | [ InternVL−Chat−V1.2−Plus] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus ) | 40B | 143.4\* | 90.5 | 125.8 | - | 67.6\* | 71.3\* | 61.3 / - | 74.2\* | 66.9\* | 98.1\* |
174+ | [ InternVL−Chat−V1.1] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1 ) | 19B | 142.2\* | 85.3 | 120.8 | 80.9\* | 64.1\* | 65.9 | 59.0 / 57.3 | 72.2\* | 62.5\* | 90.1\* |
175+ | [ InternVL−Chat−V1.2] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2 ) | 40B | 113.9 | 92.4 | 112.5 | - | 62.5\* | 69.7 | 61.9 / 60.0 | 77.1\* | 64.0\* | 83.3 |
176+ | [ InternVL−Chat−V1.2−Plus] ( https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus ) | 40B | 143.4\* | 90.5 | 125.8 | - | 67.6\* | 71.3\* | 61.3 / - | 78.2\* | 66.9\* | 98.1\* |
177+
178+ - We found that incorrect images were used for training and testing in ` AI2D ` , meaning that for problems where ` abcLabel ` is True, ` abc_images ` were not utilized. We have now corrected the images used for testing, but the results may still be somewhat lower as a consequence.
176179
177180** Visual Grounding**
178181
298301│ └── ocrvqa_val.jsonl
299302├── ai2diagram
300303│ ├── ai2d/
301- │ ├── test.jsonl
302- │ └── train.jsonl
304+ │ │ ├── abc_images/
305+ │ │ └── images/
306+ │ └── test.jsonl
303307├── scienceqa
304308│ ├── images/
305309│ ├── problems.json
@@ -756,14 +760,12 @@ GPUS=8 sh evaluate.sh <checkpoint> vqa-ocrvqa-test
756760
757761``` bash
758762mkdir -p data/ai2diagram && cd data/ai2diagram
759-
760- # download images
761- wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip
762-
763763# download converted files
764- wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/train.jsonl
765- wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/test.jsonl
764+ wget https://huggingface.co/OpenGVLab/InternVL/raw/main/ai2d_test.jsonl -O test.jsonl
766765
766+ # download images from Google drive (provided by InternLM-XComposer)
767+ # https://drive.google.com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing
768+ # images should be placed in `data/ai2diagram/ai2d/abc_images` and `data/ai2diagram/ai2d/images`
767769cd ../..
768770```
769771
0 commit comments