Skip to content

Commit 7725e00

Browse files
committed
Fix AI2D training & testing & results
1 parent 53f6845 commit 7725e00

File tree

2 files changed

+47
-44
lines changed

2 files changed

+47
-44
lines changed

internvl_chat/README.md

Lines changed: 44 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ First, download the [annotation files](https://huggingface.co/OpenGVLab/InternVL
4949

5050
Second, download all the images we used.
5151

52-
- AI2D: [ai2d-all](https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip)
52+
- AI2D: [ai2d_images](https://drive.google.com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing) (provided by InternLM-XComposer)
5353
- ChartQA: [ChartQA Dataset](https://huggingface.co/datasets/ahmed-masry/ChartQA/resolve/main/ChartQA%20Dataset.zip)
5454
- COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip)
5555
- DocVQA: [train](https://datasets.cvc.uab.es/rrc/DocVQA/train.tar.gz), [val](https://datasets.cvc.uab.es/rrc/DocVQA/val.tar.gz), [test](https://datasets.cvc.uab.es/rrc/DocVQA/test.tar.gz)
@@ -78,45 +78,46 @@ playground/
7878
├── geoqa+.jsonl
7979
├── synthdog_en.jsonl
8080
├── data
81-
│   ├── ai2d
82-
│ │   └── images
83-
│   ├── chartqa
84-
│ │   ├── test
85-
│ │   ├── train
86-
│ │   └── val
87-
│   ├── coco
88-
│ │   └── train2017
89-
│   ├── docvqa
90-
│ │   ├── test
91-
│ │   ├── train
92-
│ │   └── val
93-
│   ├── dvqa
94-
│ │   └── images
95-
│   ├── gqa
96-
│ │   └── images
97-
│   ├── llava
81+
│ ├── ai2d
82+
│ │ ├── abc_images
83+
│ │ └── images
84+
│ ├── chartqa
85+
│ │ ├── test
86+
│ │ ├── train
87+
│ │ └── val
88+
│ ├── coco
89+
│ │ └── train2017
90+
│ ├── docvqa
91+
│ │ ├── test
92+
│ │ ├── train
93+
│ │ └── val
94+
│ ├── dvqa
95+
│ │ └── images
96+
│ ├── gqa
97+
│ │ └── images
98+
│ ├── llava
9899
│ │ └── llava_pretrain
99100
│ │ └── images
100-
   ├── ocr_vqa
101+
├── ocr_vqa
101102
│ │ └── images
102-
   ├── sam
103+
├── sam
103104
│ │ └── images
104-
   ├── share_textvqa
105+
├── share_textvqa
105106
│ │ └── images
106-
   ├── synthdog-en
107+
├── synthdog-en
107108
│ │ └── images
108-
   ├── textvqa
109+
├── textvqa
109110
│ │ └── train_images
110-
   ├── vg
111+
├── vg
111112
│ │ ├── VG_100K
112113
│ │ └── VG_100K_2
113-
   ├── web-celebrity
114+
├── web-celebrity
114115
│ │ └── images
115-
   ├── web-landmark
116+
├── web-landmark
116117
│ │ └── images
117-
   ├── wikiart
118+
├── wikiart
118119
│ │ └── images
119-
   ├── geoqa+
120+
├── geoqa+
120121
│ │ └── images
121122
```
122123

@@ -160,19 +161,21 @@ CUDA_VISIBLE_DEVICES=0,1 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b
160161

161162
| name | model size | MathVista<br>(testmini) | MMB<br>(dev/test) | MMB−CN<br>(dev/test) | MMMU<br>(val/test) | CMMMU<br>(val/test) | MMVP | MME | POPE | Tiny LVLM | SEEDv1<br>(image) | LLaVA Wild | MM−Vet |
162163
| ------------------------------------------------------------------------------------------- | ---------- | ----------------------- | ----------------- | -------------------- | ---------------------------------------------------------------------------------- | ------------------- | ---- | -------------- | ---- | --------- | ----------------- | ---------- | ------ |
163-
| [InternVL−Chat−V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | 19B | 34.5 | 76.7 / 75.4 | 71.9 / 70.3 | 39.1 / 35.3 | 34.8 / 34.0 | 44.7 | 1675.1 / 348.6 | 87.1 | 343.2 | 73.2 | 73.2 | 46.7 |
164-
| [InternVL−Chat−V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | 40B | 47.7 | 81.4 / 82.2 | 79.5 / 81.2 | 51.6 / [46.2](https://eval.ai/web/challenges/challenge-page/2179/leaderboard/5377) | TODO | 56.7 | 1672.1 / 509.3 | 88.0 | 350.3 | 75.6 | 85.0 | 48.9 |
165-
| [InternVL−Chat−V1.2−Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | 40B | 59.9 | 83.4 / 83.8 | 81.6 / 82.0 | 50.3 / 45.6 | TODO | 58.7 | 1623.6 / 550.7 | 88.7 | 353.9 | 76.4 | 84.6 | 47.9 |
164+
| [InternVL−Chat−V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | 19B | 34.5 | 76.7 / 75.4 | 71.9 / 70.3 | 39.1 / 35.3 | 34.8 / 34.0 | 44.7 | 1675.1 / 348.6 | 87.1 | 343.2 | 73.2 | 73.2 | 46.7 |
165+
| [InternVL−Chat−V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | 40B | 47.7 | 81.4 / 82.2 | 79.5 / 81.2 | 51.6 / [46.2](https://eval.ai/web/challenges/challenge-page/2179/leaderboard/5377) | TODO | 56.7 | 1672.1 / 509.3 | 88.0 | 350.3 | 75.6 | 85.0 | 48.9 |
166+
| [InternVL−Chat−V1.2−Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | 40B | 59.9 | 83.4 / 83.8 | 81.6 / 82.0 | 50.3 / 45.6 | TODO | 58.7 | 1623.6 / 550.7 | 88.7 | 353.9 | 76.4 | 84.6 | 47.9 |
166167

167168
**Image Captioning & Visual Question Answering**
168169

169170
\* Training set observed.
170171

171172
| name | model size | COCO<br>(test) | Flickr30K<br>(test) | NoCaps<br>(val) | VQAv2<br>(testdev) | OKVQA<br>(val) | TextVQA<br>(val) | VizWiz<br>(val/test) | AI2D<br>(test) | GQA<br>(test) | ScienceQA<br>(image) |
172173
| ------------------------------------------------------------------------------------------- | ---------- | -------------- | ------------------- | --------------- | ------------------ | -------------- | ---------------- | -------------------- | -------------- | ------------- | -------------------- |
173-
| [InternVL−Chat−V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | 19B | 142.2\* | 85.3 | 120.8 | 80.9\* | 64.1\* | 65.9 | 59.0 / 57.3 | 70.3\* | 62.5\* | 90.1\* |
174-
| [InternVL−Chat−V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | 40B | 113.9 | 92.4 | 112.5 | - | 62.5\* | 69.7 | 61.9 / 60.0 | 71.6\* | 64.0\* | 83.3 |
175-
| [InternVL−Chat−V1.2−Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | 40B | 143.4\* | 90.5 | 125.8 | - | 67.6\* | 71.3\* | 61.3 / - | 74.2\* | 66.9\* | 98.1\* |
174+
| [InternVL−Chat−V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | 19B | 142.2\* | 85.3 | 120.8 | 80.9\* | 64.1\* | 65.9 | 59.0 / 57.3 | 72.2\* | 62.5\* | 90.1\* |
175+
| [InternVL−Chat−V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | 40B | 113.9 | 92.4 | 112.5 | - | 62.5\* | 69.7 | 61.9 / 60.0 | 77.1\* | 64.0\* | 83.3 |
176+
| [InternVL−Chat−V1.2−Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | 40B | 143.4\* | 90.5 | 125.8 | - | 67.6\* | 71.3\* | 61.3 / - | 78.2\* | 66.9\* | 98.1\* |
177+
178+
- We found that incorrect images were used for training and testing in `AI2D`, meaning that for problems where `abcLabel` is True, `abc_images` were not utilized. We have now corrected the images used for testing, but the results may still be somewhat lower as a consequence.
176179

177180
**Visual Grounding**
178181

@@ -298,8 +301,9 @@ data
298301
│ └── ocrvqa_val.jsonl
299302
├── ai2diagram
300303
│ ├── ai2d/
301-
│ ├── test.jsonl
302-
│ └── train.jsonl
304+
│ │ ├── abc_images/
305+
│ │ └── images/
306+
│ └── test.jsonl
303307
├── scienceqa
304308
│ ├── images/
305309
│ ├── problems.json
@@ -756,14 +760,12 @@ GPUS=8 sh evaluate.sh <checkpoint> vqa-ocrvqa-test
756760

757761
```bash
758762
mkdir -p data/ai2diagram && cd data/ai2diagram
759-
760-
# download images
761-
wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip
762-
763763
# download converted files
764-
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/train.jsonl
765-
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/test.jsonl
764+
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/ai2d_test.jsonl -O test.jsonl
766765

766+
# download images from Google drive (provided by InternLM-XComposer)
767+
# https://drive.google.com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing
768+
# images should be placed in `data/ai2diagram/ai2d/abc_images` and `data/ai2diagram/ai2d/images`
767769
cd ../..
768770
```
769771

internvl_chat/eval/vqa/evaluate_vqa.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,8 @@ def __getitem__(self, idx):
236236

237237
image = Image.open(image).convert('RGB')
238238
pixel_values = self.transform(image).unsqueeze(0)
239-
question = question + ' ' + self.prompt
239+
if len(self.prompt) != 0:
240+
question = question + ' ' + self.prompt
240241
return {
241242
'question_id': question_id,
242243
'question': question,
@@ -293,7 +294,7 @@ def post_process(response):
293294
def evaluate_chat_model():
294295
base_prompt = 'Answer the question using a single word or phrase.'
295296
vizwiz_prompt = "When the provided information is insufficient, respond with 'Unanswerable'. "
296-
ai2d_prompt = 'Please answer the question based on the options mentioned before.'
297+
ai2d_prompt = ''
297298
random.seed(args.seed)
298299
summaries = []
299300

0 commit comments

Comments
 (0)