Skip to content

Commit b1f25c0

Browse files
authored
Add gemma3 and Qwen2.5 VL and sarashina and Refactoring (#123)
* Add test code * Fix output dir structure * Add gemma3 inference code * Add sarashina and generalize model.generate() interface * Fix sarashina to accept multiple images * Fix registry name and Add tqdm * Add test code for mecha-ja * Add test CI * Fix JIC-VQA dataset preparation * Update README * Fix model test * Update README * Fix gemma3 device * Fix eval_all * Add tips * Fix judge prompt
1 parent 006eb58 commit b1f25c0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1213
-728
lines changed

.github/workflows/test.yml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
name: Test workflow
2+
3+
on:
4+
push:
5+
6+
jobs:
7+
uv-example:
8+
name: python
9+
runs-on: ubuntu-latest
10+
11+
steps:
12+
- uses: actions/checkout@v4
13+
14+
- name: Install uv
15+
uses: astral-sh/setup-uv@v5
16+
17+
- name: Install the project
18+
run: uv sync --dev
19+
20+
- name: Run tests
21+
# For example, using `pytest`
22+
run: uv run pytest src/eval_mm/metrics/*.py

README.md

Lines changed: 45 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,9 @@ For details on the data format and the list of supported data, please check [DAT
1414

1515
## Table of Contents
1616

17-
- [LLM-jp-eval-mm](#llm-jp-eval-mm)
17+
- [llm-jp-eval-mm](#llm-jp-eval-mm)
1818
- [Table of Contents](#table-of-contents)
19-
- [Environment Setup](#environment-setup)
20-
- [Install via PyPI](#install-via-pypi)
21-
- [Clone the GitHub Repo](#clone-the-github-repo)
19+
- [Getting Started](#getting-started)
2220
- [How to Evaluate](#how-to-evaluate)
2321
- [Running an Evaluation](#running-an-evaluation)
2422
- [Leaderboard](#leaderboard)
@@ -32,64 +30,41 @@ For details on the data format and the list of supported data, please check [DAT
3230
- [How to Add Inference Code for a VLM Model](#how-to-add-inference-code-for-a-vlm-model)
3331
- [How to Add Dependencies](#how-to-add-dependencies)
3432
- [Formatting and Linting with ruff](#formatting-and-linting-with-ruff)
33+
- [Testing](#testing)
3534
- [How to Release to PyPI](#how-to-release-to-pypi)
3635
- [How to Update the Website](#how-to-update-the-website)
3736
- [Acknowledgements](#acknowledgements)
3837

39-
## Environment Setup
38+
## Getting Started
4039

41-
You can also use this tool via PyPI.
40+
You can use this tool via GitHub (Recommended).
4241

43-
### Install via PyPI
44-
45-
1. Use the `pip` command to include `eval_mm` in the virtual environment where you want to run it:
46-
47-
```bash
48-
pip install eval_mm
49-
```
50-
51-
2. This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
52-
53-
That’s it for environment setup.
54-
55-
If you prefer to clone the repository and use it, please follow the instructions below.
56-
57-
### Clone the GitHub Repo
58-
59-
`eval-mm` uses `uv` to manage virtual environments.
60-
61-
1. Clone the repository and move into it:
62-
```bash
63-
git clone [email protected]:llm-jp/llm-jp-eval-mm.git
64-
cd llm-jp-eval-mm
65-
```
66-
67-
2. Build the environment with `uv`.
68-
69-
Please install `uv` by referring to the [official doc](https://docs.astral.sh/uv/getting-started/installation/).
42+
```bash
43+
git clone [email protected]:llm-jp/llm-jp-eval-mm.git
44+
cd llm-jp-eval-mm
45+
uv sync
46+
```
7047

71-
```bash
72-
cd llm-jp-eval-mm
73-
uv sync
74-
```
48+
Or you can install it via PyPI.
49+
```bash
50+
pip install eval_mm
51+
```
7552

76-
3. Following the sample [.env.sample](./.env.sample), create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`, or `OPENAI_API_KEY`.
53+
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
7754

78-
That’s all you need for the setup.
55+
That’s it! You’re ready to evaluate your VLM model.
7956

8057
## How to Evaluate
8158

8259
### Running an Evaluation
8360

84-
(Currently, the llm-jp-eval-mm repository is private. You can download the `examples` directory from the Source Distribution at [https://pypi.org/project/eval-mm/#files](https://pypi.org/project/eval-mm/#files).)
85-
8661
We provide a sample code `examples/sample.py` for running an evaluation.
8762

8863
Models listed as `examples/{model_name}.py` are supported only in terms of their inference method.
8964

9065
If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing `examples/{model_name}.py`, and you can run the evaluation in the same way.
9166

92-
For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on the japanese-heron-bench task, run the following command:
67+
For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on japanese-heron-bench task, run the following command:
9368

9469
```bash
9570
uv sync --group normal
@@ -103,7 +78,7 @@ uv run --group normal python examples/sample.py \
10378
```
10479

10580
The evaluation score and output results will be saved in
106-
`test/{task_id}/evaluation/{model_id}.jsonl` and `test/{task_id}/prediction/{model_id}.jsonl`.
81+
`test/{task_id}/{model_id}/evaluation.jsonl` and `test/{task_id}/{model_id}/prediction/.jsonl`.
10782

10883
If you want to evaluate multiple models on multiple tasks, please check `eval_all.sh`.
10984

@@ -166,19 +141,28 @@ If you add a new group, don’t forget to configure [conflict](https://docs.astr
166141
## Benchmark-Specific Required Libraries
167142

168143
- JDocQA
169-
For constructing the JDocQA dataset, you need the [pdf2image](https://pypi.org/project/pdf2image/) library. Since pdf2image depends on poppler-utils, please install it with:
170144

171-
```bash
172-
sudo apt-get install poppler-utils
173-
```
145+
To prepare the JDocQA dataset, [pdf2image](https://pypi.org/project/pdf2image/) library is needed. Since pdf2image depends on poppler-utils, please install it with:
146+
147+
```bash
148+
sudo apt-get install poppler-utils
149+
```
150+
151+
- JIC-VQA
152+
153+
JIC-VQA only provide the image URL, so you need to download the images from the URL. You can use the following code to prepare the JIC-VQA dataset with the image download.
154+
155+
```python
156+
python scripts/prepare_jic_vqa.py
157+
```
174158

175159
## License
176160

177161
This repository is licensed under the Apache-2.0 License.
178162

179163
## Contribution
180164

181-
- If you find any issues or have suggestions, please report them on the Issue tracker.
165+
- If you find any issues or have suggestions, please report them on the Issue.
182166
- If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.
183167

184168
### How to Add a Benchmark Task
@@ -191,7 +175,7 @@ Please reference the code in [src/eval_mm/metrics](https://github.com/llm-jp/llm
191175

192176
### How to Add Inference Code for a VLM Model
193177
Inference code for VLM models is defined in the `VLM` class.
194-
Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to produce output text from images and prompts.
178+
Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to output text given images and text inputs.
195179

196180
### How to Add Dependencies
197181

@@ -206,6 +190,18 @@ uv run ruff format src
206190
uv run ruff check --fix src
207191
```
208192

193+
### Testing
194+
195+
You can test task classes and metric classes with the following command:
196+
```bash
197+
bash test.sh
198+
```
199+
You can also test each model's inference code with the following command:
200+
```bash
201+
bash test_model.sh
202+
```
203+
204+
209205
### How to Release to PyPI
210206

211207
```

eval_all.sh

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,18 @@ declare -A MODEL_GROUP_MAP=(
1515
["Qwen/Qwen2-VL-7B-Instruct"]="normal"
1616
["OpenGVLab/InternVL2-26B"]="normal"
1717
["Qwen/Qwen2-VL-72B-Instruct"]="normal"
18+
["Qwen/Qwen2.5-VL-7B-Instruct"]="normal"
19+
["Qwen/Qwen2.5-VL-72B-Instruct"]="normal"
1820
["gpt-4o-2024-05-13"]="normal"
1921
["mistralai/Pixtral-12B-2409"]="pixtral"
2022
["llm-jp/llm-jp-3-vila-14b"]="vilaja"
2123
["Efficient-Large-Model/VILA1.5-13b"]="vilaja"
2224
["SakanaAI/Llama-3-EvoVLM-JP-v2"]="evovlm"
25+
["google/gemma-3-12b-it"]="gemma"
26+
["sbintuitions/sarashina2-vision-8b"]="sarashina"
27+
["sbintuitions/sarashina2-vision-14b"]="sarashina"
2328
)
2429

25-
model_name="stabilityai/japanese-instructblip-alpha"
26-
echo "Model group: ${MODEL_GROUP_MAP[$model_name]}"
27-
2830
# Task list
2931
declare -a task_list=(
3032
"japanese-heron-bench"
@@ -45,11 +47,11 @@ declare -A METRIC_MAP=(
4547
["ja-vlm-bench-in-the-wild"]="llm_as_a_judge,rougel"
4648
["ja-vg-vqa-500"]="llm_as_a_judge,rougel"
4749
["jmmmu"]="jmmmu"
48-
["ja-multi-image-vqa"]="rougel"
50+
["ja-multi-image-vqa"]="llm_as_a_judge,rougel"
4951
["jdocqa"]="jdocqa,llm_as_a_judge"
5052
["mmmu"]="mmmu"
5153
["llava-bench-in-the-wild"]="llm_as_a_judge,rougel"
52-
["jic-vqa"]="jic-vqa"
54+
["jic-vqa"]="jic_vqa"
5355
["mecha-ja"]="mecha-ja"
5456
)
5557

examples/EvoVLM_JP_v1_7B.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,9 @@ def __init__(self, model_id: str = "SakanaAI/EvoVLM-JP-v1-7B") -> None:
1616
self.model.to(self.device)
1717

1818
def generate(
19-
self, image, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
20-
):
21-
text = text.replace("<image>", "")
22-
if isinstance(image, list):
23-
text = "<image>" * len(image) + f"{text}"
24-
else:
25-
text = f"<image>{text}"
19+
self, images, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
20+
) -> str:
21+
text = "<image>" * len(image) + f"{text}"
2622

2723
messages = [
2824
{
@@ -31,7 +27,7 @@ def generate(
3127
},
3228
{"role": "user", "content": text},
3329
]
34-
inputs = self.processor.image_processor(images=image, return_tensors="pt")
30+
inputs = self.processor.image_processor(images=images, return_tensors="pt")
3531
inputs["input_ids"] = self.processor.tokenizer.apply_chat_template(
3632
messages, return_tensors="pt"
3733
)

examples/GPT_4o.py

Lines changed: 22 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -23,53 +23,31 @@ def __init__(self, model_id: str = "gpt-4o-2024-05-13") -> None:
2323
)
2424

2525
def generate(
26-
self, image, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
27-
):
28-
if "<image>" in text:
29-
text = text.replace("<image>", "")
26+
self, images, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
27+
) -> str:
3028
message = []
31-
if isinstance(image, list):
32-
image__base64_list = [encode_image_to_base64(img) for img in image]
33-
message_base = {
34-
"role": "user",
35-
"content": [
36-
{
37-
"type": "text",
38-
"text": text,
39-
},
40-
],
41-
}
42-
for image_base64 in image__base64_list:
43-
message_base["content"].append(
44-
{
45-
"type": "image_url",
46-
"image_url": {
47-
"url": f"data:image/jpeg;base64,{image_base64}",
48-
"detail": "auto",
49-
},
50-
}
51-
)
52-
message = [message_base]
53-
else:
54-
image_base64 = encode_image_to_base64(image)
55-
message = [
29+
image__base64_list = [encode_image_to_base64(img) for img in images]
30+
message_base = {
31+
"role": "user",
32+
"content": [
33+
{
34+
"type": "text",
35+
"text": text,
36+
},
37+
],
38+
}
39+
for image_base64 in image__base64_list:
40+
message_base["content"].append(
5641
{
57-
"role": "user",
58-
"content": [
59-
{
60-
"type": "text",
61-
"text": text,
62-
},
63-
{
64-
"type": "image_url",
65-
"image_url": {
66-
"url": f"data:image/jpeg;base64,{image_base64}",
67-
"detail": "auto",
68-
},
69-
},
70-
],
42+
"type": "image_url",
43+
"image_url": {
44+
"url": f"data:image/jpeg;base64,{image_base64}",
45+
"detail": "auto",
46+
},
7147
}
72-
]
48+
)
49+
message = [message_base]
50+
7351
try:
7452
response = self.client.chat.completions.create(
7553
model=self.model_id,

examples/InternVL2.py

Lines changed: 14 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -136,36 +136,25 @@ def __init__(self, model_id: str = "OpenGVLab/InternVL2-8B") -> None:
136136
)
137137

138138
def generate(
139-
self, image, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
140-
):
139+
self, images, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
140+
) -> str:
141141
text = text.replace("<image>", "")
142142
if "<image>" not in text:
143-
if isinstance(image, list):
144-
image_tokens = ["<image>"] * len(image)
145-
image_tokens = " ".join(image_tokens)
146-
text = f"{image_tokens}\n{text}"
147-
else:
148-
text = f"<image>\n{text}"
149-
if isinstance(image, list):
150-
pixel_values_list = []
151-
for img in image:
152-
pixel_values = (
153-
load_image(img, max_num=12)
154-
.to(self.model.device)
155-
.to(self.model.dtype)
156-
)
157-
pixel_values_list.append(pixel_values)
158-
num_patches_list = [
159-
pixel_values.size(0) for pixel_values in pixel_values_list
160-
]
161-
pixel_values = torch.cat(pixel_values_list, dim=0)
162-
163-
else:
164-
num_patches_list = None
143+
image_tokens = ["<image>"] * len(images)
144+
image_tokens = " ".join(image_tokens)
145+
text = f"{image_tokens}\n{text}"
146+
147+
pixel_values_list = []
148+
for img in images:
165149
pixel_values = (
166-
load_image(image, max_num=12).to(self.model.device).to(self.model.dtype)
150+
load_image(img, max_num=12).to(self.model.device).to(self.model.dtype)
167151
)
152+
pixel_values_list.append(pixel_values)
153+
num_patches_list = [pixel_values.size(0) for pixel_values in pixel_values_list]
154+
pixel_values = torch.cat(pixel_values_list, dim=0)
155+
168156
import copy
157+
169158
generation_config = copy.deepcopy(gen_kwargs.__dict__)
170159
generation_config.pop("use_cache")
171160

0 commit comments

Comments
 (0)