|
| 1 | +# Quick Start |
| 2 | + |
| 3 | +Before running the evaluation script, you need to configure the VLMs and correctly set the model paths or API keys. Then, you can use the `run.py` script with relevant arguments to perform inference and evaluation on multiple VLMs and benchmarks. |
| 4 | + |
| 5 | +## Step 0: Installation and Key Setup |
| 6 | + |
| 7 | +### Installation |
| 8 | + |
| 9 | +```bash |
| 10 | +git clone https://github.com/open-compass/VLMEvalKit.git |
| 11 | +cd VLMEvalKit |
| 12 | +pip install -e . |
| 13 | +``` |
| 14 | + |
| 15 | +### Setup Keys |
| 16 | + |
| 17 | +To use API models (e.g., GPT-4v, Gemini-Pro-V) for inference, you must set up API keys first. |
| 18 | + |
| 19 | +> **Note:** Some datasets require an LLM as a Judge and have default evaluation models configured (see *Extra Notes*). You also need to configure the corresponding APIs when evaluating these datasets. |
| 20 | +
|
| 21 | +You can place the required keys in `$VLMEvalKit/.env` or set them directly as environment variables. If you choose to create a `.env` file, the content should look like this: |
| 22 | + |
| 23 | +```bash |
| 24 | +# .env file, place it under $VLMEvalKit |
| 25 | + |
| 26 | +# --- API Keys for Proprietary VLMs --- |
| 27 | +# QwenVL APIs |
| 28 | +DASHSCOPE_API_KEY= |
| 29 | +# Gemini w. Google Cloud Backends |
| 30 | +GOOGLE_API_KEY= |
| 31 | +# OpenAI API |
| 32 | +OPENAI_API_KEY= |
| 33 | +OPENAI_API_BASE= |
| 34 | +# StepAI API |
| 35 | +STEPAI_API_KEY= |
| 36 | +# REKA API |
| 37 | +REKA_API_KEY= |
| 38 | +# GLMV API |
| 39 | +GLMV_API_KEY= |
| 40 | +# CongRong API |
| 41 | +CW_API_BASE= |
| 42 | +CW_API_KEY= |
| 43 | +# SenseNova API |
| 44 | +SENSENOVA_API_KEY= |
| 45 | +# Hunyuan-Vision API |
| 46 | +HUNYUAN_SECRET_KEY= |
| 47 | +HUNYUAN_SECRET_ID= |
| 48 | +# LMDeploy API |
| 49 | +LMDEPLOY_API_BASE= |
| 50 | + |
| 51 | +# --- Evaluation Specific Settings --- |
| 52 | +# You can set an evaluation proxy; API calls generated during the evaluation phase will go through this proxy. |
| 53 | +EVAL_PROXY= |
| 54 | +# You can also set keys and base URLs dedicated for evaluation by appending the _EVAL suffix: |
| 55 | +OPENAI_API_KEY_EVAL= |
| 56 | +OPENAI_API_BASE_EVAL= |
| 57 | +``` |
| 58 | + |
| 59 | +Fill in your keys where applicable. These API keys will be automatically loaded during inference and evaluation. |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## Step 1: Configuration |
| 64 | + |
| 65 | +**VLM Configuration:** All VLMs are configured in `vlmeval/config.py`. For some VLMs (e.g., MiniGPT-4, LLaVA-v1-7B), additional configuration is required (setting the code/model weight root directory in the config file). |
| 66 | + |
| 67 | +When evaluating, you should use the model name specified in `supported_VLM` in `vlmeval/config.py`. Ensure you can successfully run inference with the VLM before starting the evaluation. |
| 68 | + |
| 69 | +**Check Command:** |
| 70 | + |
| 71 | +```bash |
| 72 | +vlmutil check {MODEL_NAME} |
| 73 | +``` |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | +## Step 2: Evaluation |
| 78 | + |
| 79 | +We use `run.py` for evaluation. You can use `$VLMEvalKit/run.py` or create a soft link to the script. |
| 80 | + |
| 81 | +### Basic Arguments |
| 82 | + |
| 83 | +* `--data` (list[str]): Set the dataset names supported in VLMEvalKit (refer to `vlmeval/dataset/__init__.py` or use `vlmutil dlist all` to check). |
| 84 | +* `--model` (list[str]): Set the VLM names supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`). |
| 85 | +* `--mode` (str, default `'all'`): Running mode, choices are `['all', 'infer', 'eval']`. |
| 86 | + * `"all"`: Perform both inference and evaluation. |
| 87 | + * `"infer"`: Perform inference only. |
| 88 | + * `"eval"`: Perform evaluation only. |
| 89 | +* `--api-nproc` (int, default 4): The number of threads for API calling. |
| 90 | +* `--work-dir` (str, default `'.'`): The directory to save the results. |
| 91 | +* `--config` (str): Path to a configuration JSON file. This is a more fine-grained configuration method compared to specifying data and model (**Recommended**). See *ConfigSystem* for details. |
| 92 | + |
| 93 | +### Example Commands |
| 94 | + |
| 95 | +You can use `python` or `torchrun` to run the script. |
| 96 | + |
| 97 | +#### 1. Using python |
| 98 | +Instantiates only one VLM, which may use multiple GPUs. Recommended for evaluating very large VLMs (e.g., IDEFICS-80B-Instruct). |
| 99 | + |
| 100 | +```bash |
| 101 | +# Inference and Evaluation on MaScQA and ChemBench using IDEFICS-80B-Instruct |
| 102 | +python run.py --data MaScQA ChemBench --model idefics_80b_instruct --verbose |
| 103 | + |
| 104 | +# Inference only on MaScQA and ChemBench using IDEFICS-80B-Instruct |
| 105 | +python run.py --data MaScQA ChemBench --model idefics_80b_instruct --verbose --mode infer |
| 106 | +``` |
| 107 | + |
| 108 | +#### 2. Using torchrun |
| 109 | +Instantiates one VLM instance per GPU. This speeds up inference but is only suitable for VLMs that consume less GPU memory. |
| 110 | + |
| 111 | +```bash |
| 112 | +# Inference and Eval on MaScQA and ChemBench using IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 |
| 113 | +# On a node with 8 GPUs |
| 114 | +torchrun --nproc-per-node=8 run.py --data MaScQA ChemBench --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose |
| 115 | + |
| 116 | +# On MaScQA using Qwen-VL-Chat. On a node with 2 GPUs |
| 117 | +torchrun --nproc-per-node=2 run.py --data MaScQA --model qwen_chat --verbose |
| 118 | +``` |
| 119 | + |
| 120 | +#### 3. API Model Evaluation |
| 121 | + |
| 122 | +```bash |
| 123 | +# Inference and Eval on SFE using GPT-4o |
| 124 | +# Set API concurrency to 32. Requires OpenAI base URL and Key. |
| 125 | +# Note: SFE evaluation requires OpenAI configuration by default. |
| 126 | +python run.py --data SFE --model GPT4o --verbose --api-nproc 32 |
| 127 | +``` |
| 128 | + |
| 129 | +#### 4. Using Config File |
| 130 | + |
| 131 | +```bash |
| 132 | +# Evaluate using config. Do not use --data and --model in this case. |
| 133 | +python run.py --config config.json |
| 134 | +``` |
| 135 | + |
| 136 | +**Results:** Evaluation results will be printed as logs. Additionally, result files will be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending in `.csv` contain the evaluation metrics. |
| 137 | + |
| 138 | +--- |
| 139 | + |
| 140 | +## Extra Settings |
| 141 | + |
| 142 | +### Additional Arguments |
| 143 | + |
| 144 | +* `--judge` (str): Specify the evaluation model for datasets that require model-based evaluation. |
| 145 | + * If not specified, the configured default model will be used. |
| 146 | + * The model can be a VLM supported in VLMEvalKit or a custom model. |
| 147 | +* `--judge-args` (str): Arguments for the judge model (in JSON string format). |
| 148 | + * You can pass parameters like `temperature`, `max_tokens` when specifying the judge via `--judge`. |
| 149 | + * Specific args depend on the model initialization class (e.g., `scieval.api.gpt.OpenAIWrapper`). |
| 150 | + * You can specify the instantiation class via the `class` argument (e.g., `OpenAIWrapper` or `Claude_Wrapper`). |
| 151 | + * You can also specify the model attribute here, but it has lower priority than the model specified by `--judge`. |
| 152 | + * *Some datasets require unique evaluation parameter settings, see Extra Notes below.* |
| 153 | +* `--reuse` (bool, default `false`): Reuse previous results. |
| 154 | +* `--ignore` (bool, default `false`): |
| 155 | + * By default (`false`), when loading old inference results, if failed items (exceptions) are found, the program will rerun them. |
| 156 | + * If set to `true`, failed items will be ignored, and only successful ones will be evaluated. |
| 157 | +* `--fail-fast` (bool, default `false`): |
| 158 | + * If enabled, the program will stop immediately upon encountering an exception during inference, instead of writing the exception to the result file. |
| 159 | + * Effective only for API inference. |
| 160 | +* `--ignore-patterns` (list[str]): |
| 161 | + * Used with `fail-fast`. |
| 162 | + * Scenario: You enabled fail-fast but want to ignore specific non-fatal errors (e.g., "content policy violation"). |
| 163 | + * Set this to a list of string patterns. Exceptions containing these patterns will be recorded as results instead of crashing the program. |
| 164 | + * *Some common safety policy violation patterns are configured by default.* |
| 165 | +* `--stream` (bool, default `false`): |
| 166 | + * Enable streaming output for the model. |
| 167 | + * Effective only for API inference. |
| 168 | + * Highly recommended for slow-responding models to prevent HTTP connection timeouts. |
| 169 | + * *Tip:* When using `--config`, you can also configure this per model in the config file, which takes precedence over the command line. |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## Extra Notes |
| 174 | + |
| 175 | +### Special Dataset Configurations |
| 176 | + |
| 177 | +Some datasets have specific requirements during evaluation: |
| 178 | + |
| 179 | +* **Clima_QA:** |
| 180 | + * Does not calculate FA score by default. |
| 181 | + * Specify via `--judge-args '{"use_fa": true}'`. |
| 182 | + * Requires an LLM for evaluation (default is GPT-4). |
| 183 | + * You can specify the judge model via `--judge`, but it must follow the OpenAI format and have Base URL/Key configured. |
| 184 | +* **PHYSICS:** |
| 185 | + * Uses model-based evaluation incompatible with the framework's standard model access. |
| 186 | + * If you want to use a model other than the default GPT-4o, you must specify `base_url` and `api_key` separately (defaults to `OPENAI_API_KEY`, `OPENAI_API_BASE` in env). |
| 187 | +* **AstroVisBench:** |
| 188 | + * **Environment:** Must download dependencies following the official guide and set the `AstroVisBench_Env` environment variable. |
| 189 | + * **Python Env:** Due to complex dependencies, it is recommended to create a separate environment, install VLMEvalKit dependencies, and then install the official dependencies to avoid conflicts. |
| 190 | + * **Concurrency:** Default concurrency is 4. Can be changed via `--judge-args '{"max_workers": <nums>}'`. |
| 191 | + * **Judge Model:** Requires Claude 3.5 Sonnet for evaluation. Ensure `ANTHROPIC_API_KEY` is set. |
| 192 | + |
| 193 | +### Default Judge Models |
| 194 | + |
| 195 | +The following datasets use specific models as default Judges: |
| 196 | + |
| 197 | +| Dataset Name | Default Judge | Note | |
| 198 | +| :--- | :--- | :--- | |
| 199 | +| **SFE** | `gpt-4o-1120` | | |
| 200 | +| **EarthSE** | `gpt-4o-1120` | | |
| 201 | +| **ResearchbenchGenerate** | `gpt-4o-mini` | | |
| 202 | +| **TRQA** | `chatgpt-0125` | Used only if rule parsing fails | |
| 203 | +| **MaScQA** | `chatgpt-0125` | Used only if rule parsing fails | |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +## FAQ |
| 208 | + |
| 209 | +### Building Input Prompt: `build_prompt()` |
| 210 | + |
| 211 | +If the model output for a benchmark does not match expectations, it might be due to incorrect prompt construction. |
| 212 | + |
| 213 | +In VLMEvalKit, each dataset class has a `build_prompt()` function. For example, `ImageMCQDataset.build_prompt()` combines hint, question, and options into a standard format: |
| 214 | + |
| 215 | +```text |
| 216 | +HINT |
| 217 | +QUESTION |
| 218 | +Options: |
| 219 | +A. Option A |
| 220 | +B. Option B |
| 221 | +··· |
| 222 | +Please select the correct answer from the options above. |
| 223 | +``` |
| 224 | + |
| 225 | +VLMEvalKit also supports **Model-Level custom prompt building** via `model.build_prompt()`. |
| 226 | +* **Priority:** `model.build_prompt()` overrides `dataset.build_prompt()`. |
| 227 | + |
| 228 | +**Custom `use_custom_prompt()`:** |
| 229 | +You can define `model.use_custom_prompt()` to decide when to use the model-specific prompt logic: |
| 230 | + |
| 231 | +```python |
| 232 | +def use_custom_prompt(self, dataset: str) -> bool: |
| 233 | + from vlmeval.dataset import DATASET_TYPE, DATASET_MODALITY |
| 234 | + dataset_type = DATASET_TYPE(dataset, default=None) |
| 235 | + |
| 236 | + if not self._use_custom_prompt: |
| 237 | + return False |
| 238 | + if listinstr(['MMVet'], dataset): |
| 239 | + return True |
| 240 | + if dataset_type == 'MCQ': |
| 241 | + return True |
| 242 | + return False |
| 243 | +``` |
| 244 | + |
| 245 | +### Model Splitting & GPU Allocation |
| 246 | + |
| 247 | +VLMEvalKit supports automatic GPU resource division for `lmdeploy` or `transformers` backends. |
| 248 | + |
| 249 | +* **Python:** Defaults to all visible GPUs. Use `CUDA_VISIBLE_DEVICES` to restrict. |
| 250 | +* **Torchrun:** |
| 251 | + * GPUs per instance = $N_{GPU} // N_{PROC}$. |
| 252 | + * $N_{PROC}$: Process count from `-nproc-per-node`. |
| 253 | + * $N_{GPU}$: Count of GPUs in `CUDA_VISIBLE_DEVICES` (or all if unset). |
| 254 | + |
| 255 | +**Example (8 GPU Node):** |
| 256 | + |
| 257 | +```bash |
| 258 | +# 2 instances, 4 GPUs each |
| 259 | +torchrun --nproc-per-node=2 run.py --data MaScQA --model InternVL3-78B |
| 260 | + |
| 261 | +# 1 instance, 8 GPUs |
| 262 | +python run.py --data MaScQA --model InternVL3-78B |
| 263 | + |
| 264 | +# 3 instances, 2 GPUs each (GPUs 0 and 7 unused) |
| 265 | +CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 torchrun --nproc-per-node=3 run.py --data MaScQA --model InternVL3-38B |
| 266 | +``` |
| 267 | + |
| 268 | +> **Note:** This does not apply to the `vllm` backend. For `vllm`, use the python command, which uses all visible GPUs by default. |
| 269 | +
|
| 270 | +### Deploying Local LLM as Judge |
| 271 | + |
| 272 | +You can use LMDeploy to serve a local LLM as a judge replacement for OpenAI. |
| 273 | + |
| 274 | +**1. Install** |
| 275 | +```bash |
| 276 | +pip install lmdeploy openai |
| 277 | +``` |
| 278 | + |
| 279 | +**2. Serve (e.g., internlm2-chat-1.8b)** |
| 280 | +```bash |
| 281 | +lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333 |
| 282 | +``` |
| 283 | + |
| 284 | +**3. Get Model ID** |
| 285 | +```python |
| 286 | +from openai import OpenAI |
| 287 | +client = OpenAI(api_key='sk-123456', base_url="http://0.0.0.0:23333/v1") |
| 288 | +print(client.models.list().data[0].id) |
| 289 | +``` |
| 290 | + |
| 291 | +**4. Configure Env (in .env)** |
| 292 | +```bash |
| 293 | +OPENAI_API_KEY=sk-123456 |
| 294 | +OPENAI_API_BASE=http://0.0.0.0:23333/v1/chat/completions |
| 295 | +LOCAL_LLM=<model_ID_you_got> |
| 296 | +``` |
| 297 | + |
| 298 | +**5. Run Evaluation** |
| 299 | +Execute `run.py` as normal. |
0 commit comments