llm-jp
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 22 additions & 0 deletions b/‎.github/workflows/test.yml‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 45 additions & 49 deletions b/‎README.md‎
Lines changed: 45 additions & 49 deletions
diff --git a/‎eval_all.sh‎
Lines changed: 7 additions & 5 deletions b/‎eval_all.sh‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎examples/EvoVLM_JP_v1_7B.py‎
Lines changed: 4 additions & 8 deletions b/‎examples/EvoVLM_JP_v1_7B.py‎
Lines changed: 4 additions & 8 deletions
diff --git a/‎examples/GPT_4o.py‎
Lines changed: 22 additions & 44 deletions b/‎examples/GPT_4o.py‎
Lines changed: 22 additions & 44 deletions
diff --git a/‎examples/InternVL2.py‎
Lines changed: 14 additions & 25 deletions b/‎examples/InternVL2.py‎
Lines changed: 14 additions & 25 deletions
@@ -0,0 +1,22 @@
+name: Test workflow
+
+on:
+  push:
+
+jobs:
+  uv-example:
+    name: python
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+
+      - name: Install the project
+        run: uv sync --dev
+
+      - name: Run tests
+        # For example, using `pytest`
+        run: uv run pytest src/eval_mm/metrics/*.py
@@ -14,11 +14,9 @@ For details on the data format and the list of supported data, please check [DAT
 
 ## Table of Contents
 
-- [LLM-jp-eval-mm](#llm-jp-eval-mm)
+- [llm-jp-eval-mm](#llm-jp-eval-mm)
   - [Table of Contents](#table-of-contents)
-  - [Environment Setup](#environment-setup)
-    - [Install via PyPI](#install-via-pypi)
-    - [Clone the GitHub Repo](#clone-the-github-repo)
+  - [Getting Started](#getting-started)
   - [How to Evaluate](#how-to-evaluate)
     - [Running an Evaluation](#running-an-evaluation)
     - [Leaderboard](#leaderboard)
@@ -32,64 +30,41 @@ For details on the data format and the list of supported data, please check [DAT
     - [How to Add Inference Code for a VLM Model](#how-to-add-inference-code-for-a-vlm-model)
     - [How to Add Dependencies](#how-to-add-dependencies)
     - [Formatting and Linting with ruff](#formatting-and-linting-with-ruff)
+    - [Testing](#testing)
     - [How to Release to PyPI](#how-to-release-to-pypi)
     - [How to Update the Website](#how-to-update-the-website)
   - [Acknowledgements](#acknowledgements)
 
-## Environment Setup
+## Getting Started
 
-You can also use this tool via PyPI.
+You can use this tool via GitHub (Recommended).
 
-### Install via PyPI
-
-1. Use the `pip` command to include `eval_mm` in the virtual environment where you want to run it:
-
-   ```bash
-   pip install eval_mm
-   ```
-
-2. This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
-
-That’s it for environment setup.
-
-If you prefer to clone the repository and use it, please follow the instructions below.
-
-### Clone the GitHub Repo
-
-`eval-mm` uses `uv` to manage virtual environments.
-
-1. Clone the repository and move into it:
-   ```bash
-   git clone [email protected]:llm-jp/llm-jp-eval-mm.git
-   cd llm-jp-eval-mm
-   ```
-
-2. Build the environment with `uv`.
-
-   Please install `uv` by referring to the [official doc](https://docs.astral.sh/uv/getting-started/installation/).
+```bash
+git clone [email protected]:llm-jp/llm-jp-eval-mm.git
+cd llm-jp-eval-mm
+uv sync
+```
 
-   ```bash
-   cd llm-jp-eval-mm
-   uv sync
-   ```
+Or you can install it via PyPI.
+```bash
+pip install eval_mm
+```
 
-3. Following the sample [.env.sample](./.env.sample), create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`, or `OPENAI_API_KEY`.
+This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
 
-That’s all you need for the setup.
+That’s it! You’re ready to evaluate your VLM model.
 
 ## How to Evaluate
 
 ### Running an Evaluation
 
-(Currently, the llm-jp-eval-mm repository is private. You can download the `examples` directory from the Source Distribution at [https://pypi.org/project/eval-mm/#files](https://pypi.org/project/eval-mm/#files).)
-
 We provide a sample code `examples/sample.py` for running an evaluation.
 
 Models listed as `examples/{model_name}.py` are supported only in terms of their inference method.
 
 If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing `examples/{model_name}.py`, and you can run the evaluation in the same way.
 
-For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on the japanese-heron-bench task, run the following command:
+For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on japanese-heron-bench task, run the following command:
 
 ```bash
 uv sync --group normal
@@ -103,7 +78,7 @@ uv run --group normal python examples/sample.py \
 ```
 
 The evaluation score and output results will be saved in
-`test/{task_id}/evaluation/{model_id}.jsonl` and `test/{task_id}/prediction/{model_id}.jsonl`.
+`test/{task_id}/{model_id}/evaluation.jsonl` and `test/{task_id}/{model_id}/prediction/.jsonl`.
 
 If you want to evaluate multiple models on multiple tasks, please check `eval_all.sh`.
 
@@ -166,19 +141,28 @@ If you add a new group, don’t forget to configure [conflict](https://docs.astr
 ## Benchmark-Specific Required Libraries
 
 - JDocQA
-  For constructing the JDocQA dataset, you need the [pdf2image](https://pypi.org/project/pdf2image/) library. Since pdf2image depends on poppler-utils, please install it with:
 
-  ```bash
-  sudo apt-get install poppler-utils
-  ```
+To prepare the JDocQA dataset, [pdf2image](https://pypi.org/project/pdf2image/) library is needed. Since pdf2image depends on poppler-utils, please install it with:
+
+```bash
+sudo apt-get install poppler-utils
+```
+
+- JIC-VQA
+
+JIC-VQA only provide the image URL, so you need to download the images from the URL. You can use the following code to prepare the JIC-VQA dataset with the image download.
+
+```python
+python scripts/prepare_jic_vqa.py
+```
 
 ## License
 
 This repository is licensed under the Apache-2.0 License.
 
 ## Contribution
 
-- If you find any issues or have suggestions, please report them on the Issue tracker.
+- If you find any issues or have suggestions, please report them on the Issue.
 - If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.
 
 ### How to Add a Benchmark Task
@@ -191,7 +175,7 @@ Please reference the code in [src/eval_mm/metrics](https://github.com/llm-jp/llm
 
 ### How to Add Inference Code for a VLM Model
 Inference code for VLM models is defined in the `VLM` class.
-Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to produce output text from images and prompts.
+Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to output text given images and text inputs.
 
 ### How to Add Dependencies
 
@@ -206,6 +190,18 @@ uv run ruff format src
 uv run ruff check --fix src
 ```
 
+### Testing
+
+You can test task classes and metric classes with the following command:
+```bash
+bash test.sh
+```
+You can also test each model's inference code with the following command:
+```bash
+bash test_model.sh
+```
+
+
 ### How to Release to PyPI
 
 ```
 
@@ -15,16 +15,18 @@ declare -A MODEL_GROUP_MAP=(
     ["Qwen/Qwen2-VL-7B-Instruct"]="normal"
     ["OpenGVLab/InternVL2-26B"]="normal"
     ["Qwen/Qwen2-VL-72B-Instruct"]="normal"
+    ["Qwen/Qwen2.5-VL-7B-Instruct"]="normal"
+    ["Qwen/Qwen2.5-VL-72B-Instruct"]="normal"
     ["gpt-4o-2024-05-13"]="normal"
     ["mistralai/Pixtral-12B-2409"]="pixtral"
     ["llm-jp/llm-jp-3-vila-14b"]="vilaja"
     ["Efficient-Large-Model/VILA1.5-13b"]="vilaja"
     ["SakanaAI/Llama-3-EvoVLM-JP-v2"]="evovlm"
+    ["google/gemma-3-12b-it"]="gemma"
+    ["sbintuitions/sarashina2-vision-8b"]="sarashina"
+    ["sbintuitions/sarashina2-vision-14b"]="sarashina"
 )
 
-model_name="stabilityai/japanese-instructblip-alpha"
-echo "Model group: ${MODEL_GROUP_MAP[$model_name]}"
-
 # Task list
 declare -a task_list=(
     "japanese-heron-bench"
@@ -45,11 +47,11 @@ declare -A METRIC_MAP=(
     ["ja-vlm-bench-in-the-wild"]="llm_as_a_judge,rougel"
     ["ja-vg-vqa-500"]="llm_as_a_judge,rougel"
     ["jmmmu"]="jmmmu"
-    ["ja-multi-image-vqa"]="rougel"
+    ["ja-multi-image-vqa"]="llm_as_a_judge,rougel"
     ["jdocqa"]="jdocqa,llm_as_a_judge"
     ["mmmu"]="mmmu"
     ["llava-bench-in-the-wild"]="llm_as_a_judge,rougel"
-    ["jic-vqa"]="jic-vqa"
+    ["jic-vqa"]="jic_vqa"
     ["mecha-ja"]="mecha-ja"
 )
 
 
@@ -16,13 +16,9 @@ def __init__(self, model_id: str = "SakanaAI/EvoVLM-JP-v1-7B") -> None:
         self.model.to(self.device)
 
     def generate(
-        self, image, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
-    ):
-        text = text.replace("<image>", "")
-        if isinstance(image, list):
-            text = "<image>" * len(image) + f"{text}"
-        else:
-            text = f"<image>{text}"
+        self, images, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
+    ) -> str:
+        text = "<image>" * len(image) + f"{text}"
 
         messages = [
             {
@@ -31,7 +27,7 @@ def generate(
             },
             {"role": "user", "content": text},
         ]
-        inputs = self.processor.image_processor(images=image, return_tensors="pt")
+        inputs = self.processor.image_processor(images=images, return_tensors="pt")
         inputs["input_ids"] = self.processor.tokenizer.apply_chat_template(
             messages, return_tensors="pt"
         )
 
@@ -23,53 +23,31 @@ def __init__(self, model_id: str = "gpt-4o-2024-05-13") -> None:
         )
 
     def generate(
-        self, image, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
-    ):
-        if "<image>" in text:
-            text = text.replace("<image>", "")
+        self, images, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
+    ) -> str:
         message = []
-        if isinstance(image, list):
-            image__base64_list = [encode_image_to_base64(img) for img in image]
-            message_base = {
-                "role": "user",
-                "content": [
-                    {
-                        "type": "text",
-                        "text": text,
-                    },
-                ],
-            }
-            for image_base64 in image__base64_list:
-                message_base["content"].append(
-                    {
-                        "type": "image_url",
-                        "image_url": {
-                            "url": f"data:image/jpeg;base64,{image_base64}",
-                            "detail": "auto",
-                        },
-                    }
-                )
-            message = [message_base]
-        else:
-            image_base64 = encode_image_to_base64(image)
-            message = [
+        image__base64_list = [encode_image_to_base64(img) for img in images]
+        message_base = {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": text,
+                },
+            ],
+        }
+        for image_base64 in image__base64_list:
+            message_base["content"].append(
                 {
-                    "role": "user",
-                    "content": [
-                        {
-                            "type": "text",
-                            "text": text,
-                        },
-                        {
-                            "type": "image_url",
-                            "image_url": {
-                                "url": f"data:image/jpeg;base64,{image_base64}",
-                                "detail": "auto",
-                            },
-                        },
-                    ],
+                    "type": "image_url",
+                    "image_url": {
+                        "url": f"data:image/jpeg;base64,{image_base64}",
+                        "detail": "auto",
+                    },
                 }
-            ]
+            )
+        message = [message_base]
+
         try:
             response = self.client.chat.completions.create(
                 model=self.model_id,
 
@@ -136,36 +136,25 @@ def __init__(self, model_id: str = "OpenGVLab/InternVL2-8B") -> None:
         )
 
     def generate(
-        self, image, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
-    ):
+        self, images, text: str, gen_kwargs: GenerationConfig = GenerationConfig()
+    ) -> str:
         text = text.replace("<image>", "")
         if "<image>" not in text:
-            if isinstance(image, list):
-                image_tokens = ["<image>"] * len(image)
-                image_tokens = " ".join(image_tokens)
-                text = f"{image_tokens}\n{text}"
-            else:
-                text = f"<image>\n{text}"
-        if isinstance(image, list):
-            pixel_values_list = []
-            for img in image:
-                pixel_values = (
-                    load_image(img, max_num=12)
-                    .to(self.model.device)
-                    .to(self.model.dtype)
-                )
-                pixel_values_list.append(pixel_values)
-            num_patches_list = [
-                pixel_values.size(0) for pixel_values in pixel_values_list
-            ]
-            pixel_values = torch.cat(pixel_values_list, dim=0)
-
-        else:
-            num_patches_list = None
+            image_tokens = ["<image>"] * len(images)
+            image_tokens = " ".join(image_tokens)
+            text = f"{image_tokens}\n{text}"
+
+        pixel_values_list = []
+        for img in images:
             pixel_values = (
-                load_image(image, max_num=12).to(self.model.device).to(self.model.dtype)
+                load_image(img, max_num=12).to(self.model.device).to(self.model.dtype)
             )
+            pixel_values_list.append(pixel_values)
+        num_patches_list = [pixel_values.size(0) for pixel_values in pixel_values_list]
+        pixel_values = torch.cat(pixel_values_list, dim=0)
+
         import copy
+
         generation_config = copy.deepcopy(gen_kwargs.__dict__)
         generation_config.pop("use_cache")