llm-jp
diff --git a/‎.github/workflows/deploy-github-pages.yml‎
Lines changed: 57 additions & 0 deletions b/‎.github/workflows/deploy-github-pages.yml‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 83 additions & 58 deletions b/‎README.md‎
Lines changed: 83 additions & 58 deletions
diff --git a/‎assets/teaser.png‎
52.3 KB b/‎assets/teaser.png‎
52.3 KB
diff --git a/‎examples/japanese_instructblip_alpha.py‎
Lines changed: 1 addition & 0 deletions b/‎examples/japanese_instructblip_alpha.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/model_table.py‎
Lines changed: 1 addition & 0 deletions b/‎examples/model_table.py‎
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1,57 @@
+name: Deploy GitHub Pages
+
+on:
+  push:
+    branches:
+      - master
+    paths:
+      - .github/workflows/deploy-github-pages.yml
+      - github_pages/**
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+
+    defaults:
+      run:
+        working-directory: github_pages
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Instal Node.js
+        uses: actions/setup-node@v4
+
+      - name: Install dependencies
+        run: npm install
+
+      - name: Build
+        run: npm run build
+        env:
+          PUBLIC_URL: /llm-jp-eval-mm
+
+      - name: Upload Pages artifact
+        uses: actions/upload-pages-artifact@v3
+        with:
+          path: github_pages/build
+
+  deploy:
+    needs: build
+
+    permissions:
+      pages: write
+      id-token: write
+
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Deploy to GitHub Pages
+        uses: actions/deploy-pages@v4
+        id: deployment
@@ -3,10 +3,9 @@
 
 [ [**Japanese**](./README_ja.md) | English ]
 
-This tool automatically evaluates Japanese multi-modal large language models across multiple datasets. It offers the following features:
+llm-jp-eval-mm automates the evaluation of multi-modal large language models (VLMs) across various datasets, mainly focusing on Japanese tasks.
 
-- Uses existing Japanese evaluation data and converts it into multi-modal text generation tasks for evaluation.
-- Calculates task-specific evaluation metrics using inference results created by users.
+This tool supports multi-modal text generation tasks and calculates task-specific evaluation metrics based on the inference results provided by users.
 
 ![What llm-jp-eval-mm provides](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/assets/teaser.png)
 
@@ -17,53 +16,54 @@ This tool automatically evaluates Japanese multi-modal large language models acr
   - [Getting Started](#getting-started)
   - [How to Evaluate](#how-to-evaluate)
     - [Running an Evaluation](#running-an-evaluation)
+    - [Use llm-jp-eval-mm as a Library](#use-llm-jp-eval-mm-as-a-library)
     - [Leaderboard](#leaderboard)
   - [Supported Tasks](#supported-tasks)
   - [Required Libraries for Each VLM Model Inference](#required-libraries-for-each-vlm-model-inference)
   - [Benchmark-Specific Required Libraries](#benchmark-specific-required-libraries)
   - [Analyze VLMs Prediction](#analyze-vlms-prediction)
-  - [License](#license)
   - [Contribution](#contribution)
     - [How to Add a Benchmark Task](#how-to-add-a-benchmark-task)
     - [How to Add a Metric](#how-to-add-a-metric)
     - [How to Add Inference Code for a VLM Model](#how-to-add-inference-code-for-a-vlm-model)
     - [How to Add Dependencies](#how-to-add-dependencies)
     - [Testing](#testing)
-    - [Formatting and Linting with ruff](#formatting-and-linting-with-ruff)
-    - [How to Release to PyPI](#how-to-release-to-pypi)
-    - [How to Update the Website](#how-to-update-the-website)
+    - [Formatting and Linting with Ruff](#formatting-and-linting-with-ruff)
+    - [Releasing to PyPI](#releasing-to-pypi)
+    - [Updating the Website](#updating-the-website)
   - [Acknowledgements](#acknowledgements)
 
 ## Getting Started
 
-You can use this tool via GitHub (Recommended).
+You can install llm-jp-eval-mm from GitHub or via PyPI.
 
+- Option 1: Clone from GitHub (Recommended)
 ```bash
 git clone git@github.com:llm-jp/llm-jp-eval-mm.git
 cd llm-jp-eval-mm
 uv sync
 ```
 
-Or you can install it via PyPI.
+- Option 2: Install via PyPI
 ```bash
 pip install eval_mm
 ```
 
-This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a `.env` file and set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY` if you’re using Azure, or `OPENAI_API_KEY` if you’re using the OpenAI API.
+This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API.
+You need to configure the API keys in a .env file:
+- For Azure:`AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`
+- For OpenAI: `OPENAI_API_KEY`
+
+If you're not using the LLM-as-a-judge method, you can set any value in the .env file to bypass the error.
 
-That’s it! You’re ready to evaluate your VLM model.
 
 ## How to Evaluate
 
 ### Running an Evaluation
 
-We provide a sample code `examples/sample.py` for running an evaluation.
-
-Models listed as `examples/{model_name}.py` are supported only in terms of their inference method.
-
-If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing `examples/{model_name}.py`, and you can run the evaluation in the same way.
+To evaluate your model on a specific task, we provide an example script: `examples/sample.py`.
 
-For example, if you want to evaluate the `llava-hf/llava-1.5-7b-hf` model on japanese-heron-bench task, run the following command:
+For example, to evaluate the `llava-hf/llava-1.5-7b-hf` model on the japanese-heron-bench task, run:
 
 ```bash
 uv sync --group normal
@@ -72,11 +72,11 @@ uv run --group normal python examples/sample.py \
   --task_id japanese-heron-bench  \
   --result_dir result  \
   --metrics "heron-bench" \
-  --judge_model "gpt-4o-2024-05-13" \
+  --judge_model "gpt-4o-2024-11-20" \
   --overwrite
 ```
 
-The evaluation score and model outputs will be saved in the `result` directory like below:
+The evaluation results will be saved in the result directory:
 ```
 ├── japanese-heron-bench
 │   ├── llava-hf
@@ -87,27 +87,63 @@ The evaluation score and model outputs will be saved in the `result` directory l
 
 If you want to evaluate multiple models on multiple tasks, please check `eval_all.sh`.
 
+
+### Use llm-jp-eval-mm as a Library
+
+You can also integrate llm-jp-eval-mm into your own code. Here's an example:
+```python
+from PIL import Image
+from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
+
+class MockVLM:
+    def generate(self, images: list[Image.Image], text: str) -> str:
+        return "宮崎駿"
+
+task = TaskRegistry.load_task("japanese-heron-bench")
+example = task.dataset[0]
+
+input_text = task.doc_to_text(example)
+images = task.doc_to_visual(example)
+reference = task.doc_to_answer(example)
+
+model = MockVLM()
+prediction = model.generate(images, input_text)
+
+scorer = ScorerRegistry.load_scorer(
+    "rougel",
+    ScorerConfig(docs=task.dataset)
+)
+result = scorer.aggregate(scorer.score([reference], [prediction]))
+print(result)
+# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})
+```
+
 ### Leaderboard
 
-You can create a leaderboard.md file by running the following command:
+To generate a leaderboard from your evaluation results, run:
 ```bash
 python scripts/make_leaderboard.py --result_dir result
 ```
 
-Table like below will be created in `leaderboard.md` file.
-| Model                    | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
-| :----------------------- | --------: | ----------: | ------------: |
-| llava-hf/llava-1.5-7b-hf |   36.9038 |         2.7 |       40.7525 |
+This will create a `leaderboard.md` file with your model performance:
+
+| Model                                    | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
+| :--------------------------------------- | :-------- | :---------- | :------------ |
+| llm-jp/llm-jp-3-vila-14b                 | 68.03     | 4.08        | **52.4**      |
+| Qwen/Qwen2.5-VL-7B-Instruct              | 70.29     | 4.28        | 29.63         |
+| google/gemma-3-27b-it                    | 69.15     | 4.36        | 30.89         |
+| microsoft/Phi-4-multimodal-instruct      | 45.52     | 3.2         | 26.8          |
+| gpt-4o-2024-11-20                        | **93.7**  | **4.44**    | 32.2          |
 
 
 
-Official Leaderboard is [here](https://llm-jp.github.io/llm-jp-eval-mm/)
+The official leaderboard is available [here](https://llm-jp.github.io/llm-jp-eval-mm/)
 
 ## Supported Tasks
 
-Right now, the following benchmark tasks are supported:
+Currently, the following benchmark tasks are supported:
 
-Japanese Task:
+Japanese Tasks:
 - [Japanese Heron Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench)
 - [JA-VG-VQA500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)
 - [JA-VLM-Bench-In-the-Wild](https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild)
@@ -117,68 +153,60 @@ Japanese Task:
 - [JIC-VQA](https://huggingface.co/datasets/line-corporation/JIC-VQA)
 - [MECHA-ja](https://huggingface.co/datasets/llm-jp/MECHA-ja)
 
-English Task:
+English Tasks:
 - [MMMU](https://huggingface.co/datasets/MMMU/MMMU)
 - [LlaVA-Bench-In-the-Wild](https://huggingface.co/datasets/lmms-lab/llava-bench-in-the-wild)
 
 ## Required Libraries for Each VLM Model Inference
 
-Different models require different libraries.
-In this repository, we use uv’s [Dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) to manage the libraries needed for each model.
+Each VLM model may have different dependencies.
+To manage these, llm-jp-eval-mm uses uv's dependency groups.
 
-For example, when you use `llm-jp/llm-jp-3-vila-14b`, please specify the `vilaja` group:
+For example, to use llm-jp/llm-jp-3-vila-14b, run:
 ```bash
 uv sync --group vilaja
 uv run --group vilaja python examples/VILA_ja.py
 ```
 
-For other models, please see the `eval_all.sh` script for the required group.
+Refer to eval_all.sh for a full list of model dependencies.
 
 When you add a new group, don’t forget to configure [conflict](https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies).
 
 ## Benchmark-Specific Required Libraries
 
 - JIC-VQA
 
-JIC-VQA only provide the image URL, so you need to download the images from the URL. You can use the following code to prepare the JIC-VQA dataset with the image download.
+For the JIC-VQA dataset, you need to download images from URLs. Use the following script to prepare the dataset:
 
 ```python
 python scripts/prepare_jic_vqa.py
 ```
 
 ## Analyze VLMs Prediction
 
-Let's analyze VLMs prediction!
+Visualize your model’s predictions with the following Streamlit app:
 ```bash
 uv run streamlit run scripts/browse_prediction.py --task_id "japanese-heron-bench" --result_dir "result"
 ```
-You can see the visualization like below.
+You will be able to see the visualized predictions, like this:
 ![Streamlit](./assets/streamlit_visualization.png)
 
 
-## License
-
-This repository is licensed under the Apache-2.0 License.
-
 ## Contribution
 
-- If you find any issues or have suggestions, please report them on the Issue.
-- If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.
+We welcome contributions! If you encounter issues, or if you have suggestions or improvements, please open an issue or submit a pull request.
 
 ### How to Add a Benchmark Task
-Tasks are defined in the `Task` class.
-Please reference the code in [src/eval_mm/tasks](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/src/eval_mm/tasks) and implement your `Task` class. You’ll need methods to convert the dataset into a format for input to the VLM model, and methods to calculate the score.
+Refer to the `src/eval_mm/tasks` directory to implement new benchmark tasks.
 
 ### How to Add a Metric
-Metrics are defined in the `Scorer` class.
-Please reference the code in [src/eval_mm/metrics](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/src/eval_mm/metrics) and implement your `Scorer` class. You’ll need to implement a `score()` method for sample-level scoring comparing references and generated outputs, and an `aggregate()` method for population-level metric calculation.
+To add new metrics, implement them in the Scorer class. The code for existing scorers can be found in `src/eval_mm/metrics`.
 
 ### How to Add Inference Code for a VLM Model
-Inference code for VLM models is defined in the `VLM` class.
-Please reference [examples/base_vlm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/examples/base_vlm.py) and implement your `VLM` class. You’ll need a `generate()` method to output text given images and text inputs.
+Implement the inference code for VLM models in the VLM class. For reference, check `examples/base_vlm.py`.
 
 ### How to Add Dependencies
-
+To add a new dependency, run:
 ```
 uv add <package_name>
 uv add --group <group_name> <package_name>
@@ -187,32 +215,29 @@ uv add --group <group_name> <package_name>
 
 ### Testing
 
-You can test task classes and metric classes with the following command:
+Run the following commands to test the task classes and metrics and to test the VLM models:
 ```bash
 bash test.sh
-```
-You can also test each model's inference code with the following command:
-```bash
 bash test_model.sh
 ```
 
-### Formatting and Linting with ruff
+### Formatting and Linting with Ruff
 ```
 uv run ruff format src
 uv run ruff check --fix src
 ```
 
-### How to Release to PyPI
-
+### Releasing to PyPI
+To release a new version to PyPI:
 ```
 git tag -a v0.x.x -m "version 0.x.x"
 git push origin --tags
 ```
-Or you can manually create a new release on GitHub.
 
 
-### How to Update the Website
-Please refer to [github_pages/README.md](./github_pages/README.md).
+### Updating the Website
+For website updates, refer to the [github_pages/README.md](./github_pages/README.md).
+
 
 ## Acknowledgements
 - [Heron](https://github.com/turingmotors/heron): We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
 
@@ -196,6 +196,7 @@ def generate(
         # TODO: white space return problem some times
         response = self.tokenizer.batch_decode(output, skip_special_tokens=True)
         generated_text = response[0].strip()
+        generated_text = generated_text.split("### 応答:")[-1].strip()
         return generated_text
 
 
 
@@ -22,6 +22,7 @@
     "Qwen/Qwen2-VL-72B-Instruct": "Qwen2_VL.VLM",
     "Qwen/Qwen2.5-VL-3B-Instruct": "Qwen2_VL.VLM",
     "Qwen/Qwen2.5-VL-7B-Instruct": "Qwen2_VL.VLM",
+    "Qwen/Qwen2.5-VL-32B-Instruct": "Qwen2_VL.VLM",
     "Qwen/Qwen2.5-VL-72B-Instruct": "Qwen2_VL.VLM",
     "llm-jp/llm-jp-3-vila-14b": "VILA_ja.VLM",
     "stabilityai/japanese-instructblip-alpha": "japanese_instructblip_alpha.VLM",