Skip to content

Commit d9d18a8

Browse files
authored
Refactoring Task and Scorer class (#150)
* Add error message log * Refactoring sample.py * Add random_choice option in JMMMU and MMMU tasks * Add Qwen2.5-VL-32B-Instruct * Fix to generate json file that is used in github pages * Add Action for github pages
1 parent 14d32c8 commit d9d18a8

30 files changed

+1441
-630
lines changed
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
name: Deploy GitHub Pages
2+
3+
on:
4+
push:
5+
branches:
6+
- master
7+
paths:
8+
- .github/workflows/deploy-github-pages.yml
9+
- github_pages/**
10+
11+
jobs:
12+
build:
13+
runs-on: ubuntu-latest
14+
15+
defaults:
16+
run:
17+
working-directory: github_pages
18+
19+
steps:
20+
- name: Checkout
21+
uses: actions/checkout@v4
22+
with:
23+
fetch-depth: 0
24+
25+
- name: Instal Node.js
26+
uses: actions/setup-node@v4
27+
28+
- name: Install dependencies
29+
run: npm install
30+
31+
- name: Build
32+
run: npm run build
33+
env:
34+
PUBLIC_URL: /llm-jp-eval-mm
35+
36+
- name: Upload Pages artifact
37+
uses: actions/upload-pages-artifact@v3
38+
with:
39+
path: github_pages/build
40+
41+
deploy:
42+
needs: build
43+
44+
permissions:
45+
pages: write
46+
id-token: write
47+
48+
environment:
49+
name: github-pages
50+
url: ${{ steps.deployment.outputs.page_url }}
51+
52+
runs-on: ubuntu-latest
53+
54+
steps:
55+
- name: Deploy to GitHub Pages
56+
uses: actions/deploy-pages@v4
57+
id: deployment

README.md

Lines changed: 83 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,9 @@
33

44
[ [**Japanese**](./README_ja.md) | English ]
55

6-
This tool automatically evaluates Japanese multi-modal large language models across multiple datasets. It offers the following features:
6+
llm-jp-eval-mm automates the evaluation of multi-modal large language models (VLMs) across various datasets, mainly focusing on Japanese tasks.
77

8-
- Uses existing Japanese evaluation data and converts it into multi-modal text generation tasks for evaluation.
9-
- Calculates task-specific evaluation metrics using inference results created by users.
8+
This tool supports multi-modal text generation tasks and calculates task-specific evaluation metrics based on the inference results provided by users.
109

1110
![What llm-jp-eval-mm provides](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/assets/teaser.png)
1211

@@ -17,53 +16,54 @@ This tool automatically evaluates Japanese multi-modal large language models acr
1716
- [Getting Started](#getting-started)
1817
- [How to Evaluate](#how-to-evaluate)
1918
- [Running an Evaluation](#running-an-evaluation)
19+
- [Use llm-jp-eval-mm as a Library](#use-llm-jp-eval-mm-as-a-library)
2020
- [Leaderboard](#leaderboard)
2121
- [Supported Tasks](#supported-tasks)
2222
- [Required Libraries for Each VLM Model Inference](#required-libraries-for-each-vlm-model-inference)
2323
- [Benchmark-Specific Required Libraries](#benchmark-specific-required-libraries)
2424
- [Analyze VLMs Prediction](#analyze-vlms-prediction)
25-
- [License](#license)
2625
- [Contribution](#contribution)
2726
- [How to Add a Benchmark Task](#how-to-add-a-benchmark-task)
2827
- [How to Add a Metric](#how-to-add-a-metric)
2928
- [How to Add Inference Code for a VLM Model](#how-to-add-inference-code-for-a-vlm-model)
3029
- [How to Add Dependencies](#how-to-add-dependencies)
3130
- [Testing](#testing)
32-
- [Formatting and Linting with ruff](#formatting-and-linting-with-ruff)
33-
- [How to Release to PyPI](#how-to-release-to-pypi)
34-
- [How to Update the Website](#how-to-update-the-website)
31+
- [Formatting and Linting with Ruff](#formatting-and-linting-with-ruff)
32+
- [Releasing to PyPI](#releasing-to-pypi)
33+
- [Updating the Website](#updating-the-website)
3534
- [Acknowledgements](#acknowledgements)
3635

3736
## Getting Started
3837

39-
You can use this tool via GitHub (Recommended).
38+
You can install llm-jp-eval-mm from GitHub or via PyPI.
4039

40+
- Option 1: Clone from GitHub (Recommended)
4141
```bash
4242
git clone git@github.com:llm-jp/llm-jp-eval-mm.git
4343
cd llm-jp-eval-mm
4444
uv sync
4545
```
4646

47-
Or you can install it via PyPI.
47+
- Option 2: Install via PyPI
4848
```bash
4949
pip install eval_mm
5050
```
5151

52-
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
52+
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API.
53+
You need to configure the API keys in a .env file:
54+
- For Azure:`AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`
55+
- For OpenAI: `OPENAI_API_KEY`
56+
57+
If you're not using the LLM-as-a-judge method, you can set any value in the .env file to bypass the error.
5358

54-
That’s it! You’re ready to evaluate your VLM model.
5559

5660
## How to Evaluate
5761

5862
### Running an Evaluation
5963

60-
We provide a sample code `examples/sample.py` for running an evaluation.
61-
62-
Models listed as `examples/{model_name}.py` are supported only in terms of their inference method.
63-
64-
If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing `examples/{model_name}.py`, and you can run the evaluation in the same way.
64+
To evaluate your model on a specific task, we provide an example script: `examples/sample.py`.
6565

66-
For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on japanese-heron-bench task, run the following command:
66+
For example, to evaluate the `llava-hf/llava-1.5-7b-hf` model on the japanese-heron-bench task, run:
6767

6868
```bash
6969
uv sync --group normal
@@ -72,11 +72,11 @@ uv run --group normal python examples/sample.py \
7272
--task_id japanese-heron-bench \
7373
--result_dir result \
7474
--metrics "heron-bench" \
75-
--judge_model "gpt-4o-2024-05-13" \
75+
--judge_model "gpt-4o-2024-11-20" \
7676
--overwrite
7777
```
7878

79-
The evaluation score and model outputs will be saved in the `result` directory like below:
79+
The evaluation results will be saved in the result directory:
8080
```
8181
├── japanese-heron-bench
8282
│ ├── llava-hf
@@ -87,27 +87,63 @@ The evaluation score and model outputs will be saved in the `result` directory l
8787

8888
If you want to evaluate multiple models on multiple tasks, please check `eval_all.sh`.
8989

90+
91+
### Use llm-jp-eval-mm as a Library
92+
93+
You can also integrate llm-jp-eval-mm into your own code. Here's an example:
94+
```python
95+
from PIL import Image
96+
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
97+
98+
class MockVLM:
99+
def generate(self, images: list[Image.Image], text: str) -> str:
100+
return "宮崎駿"
101+
102+
task = TaskRegistry.load_task("japanese-heron-bench")
103+
example = task.dataset[0]
104+
105+
input_text = task.doc_to_text(example)
106+
images = task.doc_to_visual(example)
107+
reference = task.doc_to_answer(example)
108+
109+
model = MockVLM()
110+
prediction = model.generate(images, input_text)
111+
112+
scorer = ScorerRegistry.load_scorer(
113+
"rougel",
114+
ScorerConfig(docs=task.dataset)
115+
)
116+
result = scorer.aggregate(scorer.score([reference], [prediction]))
117+
print(result)
118+
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})
119+
```
120+
90121
### Leaderboard
91122

92-
You can create a leaderboard.md file by running the following command:
123+
To generate a leaderboard from your evaluation results, run:
93124
```bash
94125
python scripts/make_leaderboard.py --result_dir result
95126
```
96127

97-
Table like below will be created in `leaderboard.md` file.
98-
| Model | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
99-
| :----------------------- | --------: | ----------: | ------------: |
100-
| llava-hf/llava-1.5-7b-hf | 36.9038 | 2.7 | 40.7525 |
128+
This will create a `leaderboard.md` file with your model performance:
129+
130+
| Model | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
131+
| :--------------------------------------- | :-------- | :---------- | :------------ |
132+
| llm-jp/llm-jp-3-vila-14b | 68.03 | 4.08 | **52.4** |
133+
| Qwen/Qwen2.5-VL-7B-Instruct | 70.29 | 4.28 | 29.63 |
134+
| google/gemma-3-27b-it | 69.15 | 4.36 | 30.89 |
135+
| microsoft/Phi-4-multimodal-instruct | 45.52 | 3.2 | 26.8 |
136+
| gpt-4o-2024-11-20 | **93.7** | **4.44** | 32.2 |
101137

102138

103139

104-
Official Leaderboard is [here](https://llm-jp.github.io/llm-jp-eval-mm/)
140+
The official leaderboard is available [here](https://llm-jp.github.io/llm-jp-eval-mm/)
105141

106142
## Supported Tasks
107143

108-
Right now, the following benchmark tasks are supported:
144+
Currently, the following benchmark tasks are supported:
109145

110-
Japanese Task:
146+
Japanese Tasks:
111147
- [Japanese Heron Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench)
112148
- [JA-VG-VQA500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)
113149
- [JA-VLM-Bench-In-the-Wild](https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild)
@@ -117,68 +153,60 @@ Japanese Task:
117153
- [JIC-VQA](https://huggingface.co/datasets/line-corporation/JIC-VQA)
118154
- [MECHA-ja](https://huggingface.co/datasets/llm-jp/MECHA-ja)
119155

120-
English Task:
156+
English Tasks:
121157
- [MMMU](https://huggingface.co/datasets/MMMU/MMMU)
122158
- [LlaVA-Bench-In-the-Wild](https://huggingface.co/datasets/lmms-lab/llava-bench-in-the-wild)
123159

124160
## Required Libraries for Each VLM Model Inference
125161

126-
Different models require different libraries.
127-
In this repository, we use uv’s [Dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) to manage the libraries needed for each model.
162+
Each VLM model may have different dependencies.
163+
To manage these, llm-jp-eval-mm uses uv's dependency groups.
128164

129-
For example, when you use `llm-jp/llm-jp-3-vila-14b`, please specify the `vilaja` group:
165+
For example, to use llm-jp/llm-jp-3-vila-14b, run:
130166
```bash
131167
uv sync --group vilaja
132168
uv run --group vilaja python examples/VILA_ja.py
133169
```
134170

135-
For other models, please see the `eval_all.sh` script for the required group.
171+
Refer to eval_all.sh for a full list of model dependencies.
136172

137173
When you add a new group, don’t forget to configure [conflict](https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies).
138174

139175
## Benchmark-Specific Required Libraries
140176

141177
- JIC-VQA
142178

143-
JIC-VQA only provide the image URL, so you need to download the images from the URL. You can use the following code to prepare the JIC-VQA dataset with the image download.
179+
For the JIC-VQA dataset, you need to download images from URLs. Use the following script to prepare the dataset:
144180

145181
```python
146182
python scripts/prepare_jic_vqa.py
147183
```
148184

149185
## Analyze VLMs Prediction
150186

151-
Let's analyze VLMs prediction!
187+
Visualize your model’s predictions with the following Streamlit app:
152188
```bash
153189
uv run streamlit run scripts/browse_prediction.py --task_id "japanese-heron-bench" --result_dir "result"
154190
```
155-
You can see the visualization like below.
191+
You will be able to see the visualized predictions, like this:
156192
![Streamlit](./assets/streamlit_visualization.png)
157193

158194

159-
## License
160-
161-
This repository is licensed under the Apache-2.0 License.
162-
163195
## Contribution
164196

165-
- If you find any issues or have suggestions, please report them on the Issue.
166-
- If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.
197+
We welcome contributions! If you encounter issues, or if you have suggestions or improvements, please open an issue or submit a pull request.
167198

168199
### How to Add a Benchmark Task
169-
Tasks are defined in the `Task` class.
170-
Please reference the code in [src/eval_mm/tasks](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/src/eval_mm/tasks) and implement your `Task` class. You’ll need methods to convert the dataset into a format for input to the VLM model, and methods to calculate the score.
200+
Refer to the `src/eval_mm/tasks` directory to implement new benchmark tasks.
171201

172202
### How to Add a Metric
173-
Metrics are defined in the `Scorer` class.
174-
Please reference the code in [src/eval_mm/metrics](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/src/eval_mm/metrics) and implement your `Scorer` class. You’ll need to implement a `score()` method for sample-level scoring comparing references and generated outputs, and an `aggregate()` method for population-level metric calculation.
203+
To add new metrics, implement them in the Scorer class. The code for existing scorers can be found in `src/eval_mm/metrics`.
175204

176205
### How to Add Inference Code for a VLM Model
177-
Inference code for VLM models is defined in the `VLM` class.
178-
Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to output text given images and text inputs.
206+
Implement the inference code for VLM models in the VLM class. For reference, check `examples/base_vlm.py`.
179207

180208
### How to Add Dependencies
181-
209+
To add a new dependency, run:
182210
```
183211
uv add <package_name>
184212
uv add --group <group_name> <package_name>
@@ -187,32 +215,29 @@ uv add --group <group_name> <package_name>
187215

188216
### Testing
189217

190-
You can test task classes and metric classes with the following command:
218+
Run the following commands to test the task classes and metrics and to test the VLM models:
191219
```bash
192220
bash test.sh
193-
```
194-
You can also test each model's inference code with the following command:
195-
```bash
196221
bash test_model.sh
197222
```
198223

199-
### Formatting and Linting with ruff
224+
### Formatting and Linting with Ruff
200225
```
201226
uv run ruff format src
202227
uv run ruff check --fix src
203228
```
204229

205-
### How to Release to PyPI
206-
230+
### Releasing to PyPI
231+
To release a new version to PyPI:
207232
```
208233
git tag -a v0.x.x -m "version 0.x.x"
209234
git push origin --tags
210235
```
211-
Or you can manually create a new release on GitHub.
212236

213237

214-
### How to Update the Website
215-
Please refer to [github_pages/README.md](./github_pages/README.md).
238+
### Updating the Website
239+
For website updates, refer to the [github_pages/README.md](./github_pages/README.md).
240+
216241

217242
## Acknowledgements
218243
- [Heron](https://github.com/turingmotors/heron): We refer to the Heron code for the evaluation of the Japanese Heron Bench task.

assets/teaser.png

52.3 KB
Loading

examples/japanese_instructblip_alpha.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ def generate(
196196
# TODO: white space return problem some times
197197
response = self.tokenizer.batch_decode(output, skip_special_tokens=True)
198198
generated_text = response[0].strip()
199+
generated_text = generated_text.split("### 応答:")[-1].strip()
199200
return generated_text
200201

201202

examples/model_table.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
"Qwen/Qwen2-VL-72B-Instruct": "Qwen2_VL.VLM",
2323
"Qwen/Qwen2.5-VL-3B-Instruct": "Qwen2_VL.VLM",
2424
"Qwen/Qwen2.5-VL-7B-Instruct": "Qwen2_VL.VLM",
25+
"Qwen/Qwen2.5-VL-32B-Instruct": "Qwen2_VL.VLM",
2526
"Qwen/Qwen2.5-VL-72B-Instruct": "Qwen2_VL.VLM",
2627
"llm-jp/llm-jp-3-vila-14b": "VILA_ja.VLM",
2728
"stabilityai/japanese-instructblip-alpha": "japanese_instructblip_alpha.VLM",

0 commit comments

Comments
 (0)