You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add error message log
* Refactoring sample.py
* Add random_choice option in JMMMU and MMMU tasks
* Add Qwen2.5-VL-32B-Instruct
* Fix to generate json file that is used in github pages
* Add Action for github pages
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
52
+
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API.
53
+
You need to configure the API keys in a .env file:
54
+
- For Azure:`AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`
55
+
- For OpenAI: `OPENAI_API_KEY`
56
+
57
+
If you're not using the LLM-as-a-judge method, you can set any value in the .env file to bypass the error.
53
58
54
-
That’s it! You’re ready to evaluate your VLM model.
55
59
56
60
## How to Evaluate
57
61
58
62
### Running an Evaluation
59
63
60
-
We provide a sample code `examples/sample.py` for running an evaluation.
61
-
62
-
Models listed as `examples/{model_name}.py` are supported only in terms of their inference method.
63
-
64
-
If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing `examples/{model_name}.py`, and you can run the evaluation in the same way.
64
+
To evaluate your model on a specific task, we provide an example script: `examples/sample.py`.
65
65
66
-
For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on japanese-heron-bench task, run the following command:
66
+
For example, to evaluate the `llava-hf/llava-1.5-7b-hf` model on the japanese-heron-bench task, run:
67
67
68
68
```bash
69
69
uv sync --group normal
@@ -72,11 +72,11 @@ uv run --group normal python examples/sample.py \
72
72
--task_id japanese-heron-bench \
73
73
--result_dir result \
74
74
--metrics "heron-bench" \
75
-
--judge_model "gpt-4o-2024-05-13" \
75
+
--judge_model "gpt-4o-2024-11-20" \
76
76
--overwrite
77
77
```
78
78
79
-
The evaluation score and model outputs will be saved in the `result` directory like below:
79
+
The evaluation results will be saved in the result directory:
80
80
```
81
81
├── japanese-heron-bench
82
82
│ ├── llava-hf
@@ -87,27 +87,63 @@ The evaluation score and model outputs will be saved in the `result` directory l
87
87
88
88
If you want to evaluate multiple models on multiple tasks, please check `eval_all.sh`.
89
89
90
+
91
+
### Use llm-jp-eval-mm as a Library
92
+
93
+
You can also integrate llm-jp-eval-mm into your own code. Here's an example:
94
+
```python
95
+
fromPILimport Image
96
+
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
## Required Libraries for Each VLM Model Inference
125
161
126
-
Different models require different libraries.
127
-
In this repository, we use uv’s [Dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) to manage the libraries needed for each model.
162
+
Each VLM model may have different dependencies.
163
+
To manage these, llm-jp-eval-mm uses uv's dependency groups.
128
164
129
-
For example, when you use `llm-jp/llm-jp-3-vila-14b`, please specify the `vilaja` group:
165
+
For example, to use llm-jp/llm-jp-3-vila-14b, run:
130
166
```bash
131
167
uv sync --group vilaja
132
168
uv run --group vilaja python examples/VILA_ja.py
133
169
```
134
170
135
-
For other models, please see the `eval_all.sh` script for the required group.
171
+
Refer to eval_all.shfor a full list of model dependencies.
136
172
137
173
When you add a new group, don’t forget to configure [conflict](https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies).
138
174
139
175
## Benchmark-Specific Required Libraries
140
176
141
177
- JIC-VQA
142
178
143
-
JIC-VQA only provide the image URL, so you need to download the images from the URL. You can use the following code to prepare the JIC-VQA dataset with the image download.
179
+
For the JIC-VQA dataset, you need to download images from URLs. Use the following script to prepare the dataset:
144
180
145
181
```python
146
182
python scripts/prepare_jic_vqa.py
147
183
```
148
184
149
185
## Analyze VLMs Prediction
150
186
151
-
Let's analyze VLMs prediction!
187
+
Visualize your model’s predictions with the following Streamlit app:
152
188
```bash
153
189
uv run streamlit run scripts/browse_prediction.py --task_id "japanese-heron-bench" --result_dir "result"
154
190
```
155
-
You can see the visualization like below.
191
+
You will be able to see the visualized predictions, like this:
This repository is licensed under the Apache-2.0 License.
162
-
163
195
## Contribution
164
196
165
-
- If you find any issues or have suggestions, please report them on the Issue.
166
-
- If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.
197
+
We welcome contributions! If you encounter issues, or if you have suggestions or improvements, please open an issue or submit a pull request.
167
198
168
199
### How to Add a Benchmark Task
169
-
Tasks are defined in the `Task` class.
170
-
Please reference the code in [src/eval_mm/tasks](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/src/eval_mm/tasks) and implement your `Task` class. You’ll need methods to convert the dataset into a format for input to the VLM model, and methods to calculate the score.
200
+
Refer to the `src/eval_mm/tasks` directory to implement new benchmark tasks.
171
201
172
202
### How to Add a Metric
173
-
Metrics are defined in the `Scorer` class.
174
-
Please reference the code in [src/eval_mm/metrics](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/src/eval_mm/metrics) and implement your `Scorer` class. You’ll need to implement a `score()` method for sample-level scoring comparing references and generated outputs, and an `aggregate()` method for population-level metric calculation.
203
+
To add new metrics, implement them in the Scorer class. The code for existing scorers can be found in `src/eval_mm/metrics`.
175
204
176
205
### How to Add Inference Code for a VLM Model
177
-
Inference code for VLM models is defined in the `VLM` class.
178
-
Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to output text given images and text inputs.
206
+
Implement the inference code for VLM models in the VLM class. For reference, check `examples/base_vlm.py`.
0 commit comments