You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+45-49Lines changed: 45 additions & 49 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,11 +14,9 @@ For details on the data format and the list of supported data, please check [DAT
14
14
15
15
## Table of Contents
16
16
17
-
-[LLM-jp-eval-mm](#llm-jp-eval-mm)
17
+
-[llm-jp-eval-mm](#llm-jp-eval-mm)
18
18
-[Table of Contents](#table-of-contents)
19
-
-[Environment Setup](#environment-setup)
20
-
-[Install via PyPI](#install-via-pypi)
21
-
-[Clone the GitHub Repo](#clone-the-github-repo)
19
+
-[Getting Started](#getting-started)
22
20
-[How to Evaluate](#how-to-evaluate)
23
21
-[Running an Evaluation](#running-an-evaluation)
24
22
-[Leaderboard](#leaderboard)
@@ -32,64 +30,41 @@ For details on the data format and the list of supported data, please check [DAT
32
30
-[How to Add Inference Code for a VLM Model](#how-to-add-inference-code-for-a-vlm-model)
33
31
-[How to Add Dependencies](#how-to-add-dependencies)
34
32
-[Formatting and Linting with ruff](#formatting-and-linting-with-ruff)
33
+
-[Testing](#testing)
35
34
-[How to Release to PyPI](#how-to-release-to-pypi)
36
35
-[How to Update the Website](#how-to-update-the-website)
37
36
-[Acknowledgements](#acknowledgements)
38
37
39
-
## Environment Setup
38
+
## Getting Started
40
39
41
-
You can also use this tool via PyPI.
40
+
You can use this tool via GitHub (Recommended).
42
41
43
-
### Install via PyPI
44
-
45
-
1. Use the `pip` command to include `eval_mm` in the virtual environment where you want to run it:
46
-
47
-
```bash
48
-
pip install eval_mm
49
-
```
50
-
51
-
2. This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
52
-
53
-
That’s it for environment setup.
54
-
55
-
If you prefer to clone the repository and use it, please follow the instructions below.
56
-
57
-
### Clone the GitHub Repo
58
-
59
-
`eval-mm` uses `uv` to manage virtual environments.
3. Following the sample [.env.sample](./.env.sample), create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`, or `OPENAI_API_KEY`.
53
+
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
77
54
78
-
That’s all you need for the setup.
55
+
That’s it! You’re ready to evaluate your VLM model.
79
56
80
57
## How to Evaluate
81
58
82
59
### Running an Evaluation
83
60
84
-
(Currently, the llm-jp-eval-mm repository is private. You can download the `examples` directory from the Source Distribution at [https://pypi.org/project/eval-mm/#files](https://pypi.org/project/eval-mm/#files).)
85
-
86
61
We provide a sample code `examples/sample.py` for running an evaluation.
87
62
88
63
Models listed as `examples/{model_name}.py` are supported only in terms of their inference method.
89
64
90
65
If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing `examples/{model_name}.py`, and you can run the evaluation in the same way.
91
66
92
-
For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on the japanese-heron-bench task, run the following command:
67
+
For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on japanese-heron-bench task, run the following command:
93
68
94
69
```bash
95
70
uv sync --group normal
@@ -103,7 +78,7 @@ uv run --group normal python examples/sample.py \
103
78
```
104
79
105
80
The evaluation score and output results will be saved in
106
-
`test/{task_id}/evaluation/{model_id}.jsonl` and `test/{task_id}/prediction/{model_id}.jsonl`.
81
+
`test/{task_id}/{model_id}/evaluation.jsonl` and `test/{task_id}/{model_id}/prediction/.jsonl`.
107
82
108
83
If you want to evaluate multiple models on multiple tasks, please check `eval_all.sh`.
109
84
@@ -166,19 +141,28 @@ If you add a new group, don’t forget to configure [conflict](https://docs.astr
166
141
## Benchmark-Specific Required Libraries
167
142
168
143
- JDocQA
169
-
For constructing the JDocQA dataset, you need the [pdf2image](https://pypi.org/project/pdf2image/) library. Since pdf2image depends on poppler-utils, please install it with:
170
144
171
-
```bash
172
-
sudo apt-get install poppler-utils
173
-
```
145
+
To prepare the JDocQA dataset, [pdf2image](https://pypi.org/project/pdf2image/) library is needed. Since pdf2image depends on poppler-utils, please install it with:
146
+
147
+
```bash
148
+
sudo apt-get install poppler-utils
149
+
```
150
+
151
+
- JIC-VQA
152
+
153
+
JIC-VQA only provide the image URL, so you need to download the images from the URL. You can use the following code to prepare the JIC-VQA dataset with the image download.
154
+
155
+
```python
156
+
python scripts/prepare_jic_vqa.py
157
+
```
174
158
175
159
## License
176
160
177
161
This repository is licensed under the Apache-2.0 License.
178
162
179
163
## Contribution
180
164
181
-
- If you find any issues or have suggestions, please report them on the Issue tracker.
165
+
- If you find any issues or have suggestions, please report them on the Issue.
182
166
- If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.
183
167
184
168
### How to Add a Benchmark Task
@@ -191,7 +175,7 @@ Please reference the code in [src/eval_mm/metrics](https://github.com/llm-jp/llm
191
175
192
176
### How to Add Inference Code for a VLM Model
193
177
Inference code for VLM models is defined in the `VLM` class.
194
-
Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to produce output text from images and prompts.
178
+
Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to output text given images and text inputs.
195
179
196
180
### How to Add Dependencies
197
181
@@ -206,6 +190,18 @@ uv run ruff format src
206
190
uv run ruff check --fix src
207
191
```
208
192
193
+
### Testing
194
+
195
+
You can test task classes and metric classes with the following command:
196
+
```bash
197
+
bash test.sh
198
+
```
199
+
You can also test each model's inference code with the following command:
0 commit comments