Skip to content

Commit 3cc915c

Browse files
committed
updates readme
1 parent 0234203 commit 3cc915c

File tree

7 files changed

+616
-174
lines changed

7 files changed

+616
-174
lines changed

docs/en/Quickstart.md

Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
# Quick Start
2+
3+
Before running the evaluation script, you need to configure the VLMs and correctly set the model paths or API keys. Then, you can use the `run.py` script with relevant arguments to perform inference and evaluation on multiple VLMs and benchmarks.
4+
5+
## Step 0: Installation and Key Setup
6+
7+
### Installation
8+
9+
```bash
10+
git clone https://github.com/open-compass/VLMEvalKit.git
11+
cd VLMEvalKit
12+
pip install -e .
13+
```
14+
15+
### Setup Keys
16+
17+
To use API models (e.g., GPT-4v, Gemini-Pro-V) for inference, you must set up API keys first.
18+
19+
> **Note:** Some datasets require an LLM as a Judge and have default evaluation models configured (see *Extra Notes*). You also need to configure the corresponding APIs when evaluating these datasets.
20+
21+
You can place the required keys in `$VLMEvalKit/.env` or set them directly as environment variables. If you choose to create a `.env` file, the content should look like this:
22+
23+
```bash
24+
# .env file, place it under $VLMEvalKit
25+
26+
# --- API Keys for Proprietary VLMs ---
27+
# QwenVL APIs
28+
DASHSCOPE_API_KEY=
29+
# Gemini w. Google Cloud Backends
30+
GOOGLE_API_KEY=
31+
# OpenAI API
32+
OPENAI_API_KEY=
33+
OPENAI_API_BASE=
34+
# StepAI API
35+
STEPAI_API_KEY=
36+
# REKA API
37+
REKA_API_KEY=
38+
# GLMV API
39+
GLMV_API_KEY=
40+
# CongRong API
41+
CW_API_BASE=
42+
CW_API_KEY=
43+
# SenseNova API
44+
SENSENOVA_API_KEY=
45+
# Hunyuan-Vision API
46+
HUNYUAN_SECRET_KEY=
47+
HUNYUAN_SECRET_ID=
48+
# LMDeploy API
49+
LMDEPLOY_API_BASE=
50+
51+
# --- Evaluation Specific Settings ---
52+
# You can set an evaluation proxy; API calls generated during the evaluation phase will go through this proxy.
53+
EVAL_PROXY=
54+
# You can also set keys and base URLs dedicated for evaluation by appending the _EVAL suffix:
55+
OPENAI_API_KEY_EVAL=
56+
OPENAI_API_BASE_EVAL=
57+
```
58+
59+
Fill in your keys where applicable. These API keys will be automatically loaded during inference and evaluation.
60+
61+
---
62+
63+
## Step 1: Configuration
64+
65+
**VLM Configuration:** All VLMs are configured in `vlmeval/config.py`. For some VLMs (e.g., MiniGPT-4, LLaVA-v1-7B), additional configuration is required (setting the code/model weight root directory in the config file).
66+
67+
When evaluating, you should use the model name specified in `supported_VLM` in `vlmeval/config.py`. Ensure you can successfully run inference with the VLM before starting the evaluation.
68+
69+
**Check Command:**
70+
71+
```bash
72+
vlmutil check {MODEL_NAME}
73+
```
74+
75+
---
76+
77+
## Step 2: Evaluation
78+
79+
We use `run.py` for evaluation. You can use `$VLMEvalKit/run.py` or create a soft link to the script.
80+
81+
### Basic Arguments
82+
83+
* `--data` (list[str]): Set the dataset names supported in VLMEvalKit (refer to `vlmeval/dataset/__init__.py` or use `vlmutil dlist all` to check).
84+
* `--model` (list[str]): Set the VLM names supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
85+
* `--mode` (str, default `'all'`): Running mode, choices are `['all', 'infer', 'eval']`.
86+
* `"all"`: Perform both inference and evaluation.
87+
* `"infer"`: Perform inference only.
88+
* `"eval"`: Perform evaluation only.
89+
* `--api-nproc` (int, default 4): The number of threads for API calling.
90+
* `--work-dir` (str, default `'.'`): The directory to save the results.
91+
* `--config` (str): Path to a configuration JSON file. This is a more fine-grained configuration method compared to specifying data and model (**Recommended**). See *ConfigSystem* for details.
92+
93+
### Example Commands
94+
95+
You can use `python` or `torchrun` to run the script.
96+
97+
#### 1. Using python
98+
Instantiates only one VLM, which may use multiple GPUs. Recommended for evaluating very large VLMs (e.g., IDEFICS-80B-Instruct).
99+
100+
```bash
101+
# Inference and Evaluation on MaScQA and ChemBench using IDEFICS-80B-Instruct
102+
python run.py --data MaScQA ChemBench --model idefics_80b_instruct --verbose
103+
104+
# Inference only on MaScQA and ChemBench using IDEFICS-80B-Instruct
105+
python run.py --data MaScQA ChemBench --model idefics_80b_instruct --verbose --mode infer
106+
```
107+
108+
#### 2. Using torchrun
109+
Instantiates one VLM instance per GPU. This speeds up inference but is only suitable for VLMs that consume less GPU memory.
110+
111+
```bash
112+
# Inference and Eval on MaScQA and ChemBench using IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2
113+
# On a node with 8 GPUs
114+
torchrun --nproc-per-node=8 run.py --data MaScQA ChemBench --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
115+
116+
# On MaScQA using Qwen-VL-Chat. On a node with 2 GPUs
117+
torchrun --nproc-per-node=2 run.py --data MaScQA --model qwen_chat --verbose
118+
```
119+
120+
#### 3. API Model Evaluation
121+
122+
```bash
123+
# Inference and Eval on SFE using GPT-4o
124+
# Set API concurrency to 32. Requires OpenAI base URL and Key.
125+
# Note: SFE evaluation requires OpenAI configuration by default.
126+
python run.py --data SFE --model GPT4o --verbose --api-nproc 32
127+
```
128+
129+
#### 4. Using Config File
130+
131+
```bash
132+
# Evaluate using config. Do not use --data and --model in this case.
133+
python run.py --config config.json
134+
```
135+
136+
**Results:** Evaluation results will be printed as logs. Additionally, result files will be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending in `.csv` contain the evaluation metrics.
137+
138+
---
139+
140+
## Extra Settings
141+
142+
### Additional Arguments
143+
144+
* `--judge` (str): Specify the evaluation model for datasets that require model-based evaluation.
145+
* If not specified, the configured default model will be used.
146+
* The model can be a VLM supported in VLMEvalKit or a custom model.
147+
* `--judge-args` (str): Arguments for the judge model (in JSON string format).
148+
* You can pass parameters like `temperature`, `max_tokens` when specifying the judge via `--judge`.
149+
* Specific args depend on the model initialization class (e.g., `scieval.api.gpt.OpenAIWrapper`).
150+
* You can specify the instantiation class via the `class` argument (e.g., `OpenAIWrapper` or `Claude_Wrapper`).
151+
* You can also specify the model attribute here, but it has lower priority than the model specified by `--judge`.
152+
* *Some datasets require unique evaluation parameter settings, see Extra Notes below.*
153+
* `--reuse` (bool, default `false`): Reuse previous results.
154+
* `--ignore` (bool, default `false`):
155+
* By default (`false`), when loading old inference results, if failed items (exceptions) are found, the program will rerun them.
156+
* If set to `true`, failed items will be ignored, and only successful ones will be evaluated.
157+
* `--fail-fast` (bool, default `false`):
158+
* If enabled, the program will stop immediately upon encountering an exception during inference, instead of writing the exception to the result file.
159+
* Effective only for API inference.
160+
* `--ignore-patterns` (list[str]):
161+
* Used with `fail-fast`.
162+
* Scenario: You enabled fail-fast but want to ignore specific non-fatal errors (e.g., "content policy violation").
163+
* Set this to a list of string patterns. Exceptions containing these patterns will be recorded as results instead of crashing the program.
164+
* *Some common safety policy violation patterns are configured by default.*
165+
* `--stream` (bool, default `false`):
166+
* Enable streaming output for the model.
167+
* Effective only for API inference.
168+
* Highly recommended for slow-responding models to prevent HTTP connection timeouts.
169+
* *Tip:* When using `--config`, you can also configure this per model in the config file, which takes precedence over the command line.
170+
171+
---
172+
173+
## Extra Notes
174+
175+
### Special Dataset Configurations
176+
177+
Some datasets have specific requirements during evaluation:
178+
179+
* **Clima_QA:**
180+
* Does not calculate FA score by default.
181+
* Specify via `--judge-args '{"use_fa": true}'`.
182+
* Requires an LLM for evaluation (default is GPT-4).
183+
* You can specify the judge model via `--judge`, but it must follow the OpenAI format and have Base URL/Key configured.
184+
* **PHYSICS:**
185+
* Uses model-based evaluation incompatible with the framework's standard model access.
186+
* If you want to use a model other than the default GPT-4o, you must specify `base_url` and `api_key` separately (defaults to `OPENAI_API_KEY`, `OPENAI_API_BASE` in env).
187+
* **AstroVisBench:**
188+
* **Environment:** Must download dependencies following the official guide and set the `AstroVisBench_Env` environment variable.
189+
* **Python Env:** Due to complex dependencies, it is recommended to create a separate environment, install VLMEvalKit dependencies, and then install the official dependencies to avoid conflicts.
190+
* **Concurrency:** Default concurrency is 4. Can be changed via `--judge-args '{"max_workers": <nums>}'`.
191+
* **Judge Model:** Requires Claude 3.5 Sonnet for evaluation. Ensure `ANTHROPIC_API_KEY` is set.
192+
193+
### Default Judge Models
194+
195+
The following datasets use specific models as default Judges:
196+
197+
| Dataset Name | Default Judge | Note |
198+
| :--- | :--- | :--- |
199+
| **SFE** | `gpt-4o-1120` | |
200+
| **EarthSE** | `gpt-4o-1120` | |
201+
| **ResearchbenchGenerate** | `gpt-4o-mini` | |
202+
| **TRQA** | `chatgpt-0125` | Used only if rule parsing fails |
203+
| **MaScQA** | `chatgpt-0125` | Used only if rule parsing fails |
204+
205+
---
206+
207+
## FAQ
208+
209+
### Building Input Prompt: `build_prompt()`
210+
211+
If the model output for a benchmark does not match expectations, it might be due to incorrect prompt construction.
212+
213+
In VLMEvalKit, each dataset class has a `build_prompt()` function. For example, `ImageMCQDataset.build_prompt()` combines hint, question, and options into a standard format:
214+
215+
```text
216+
HINT
217+
QUESTION
218+
Options:
219+
A. Option A
220+
B. Option B
221+
···
222+
Please select the correct answer from the options above.
223+
```
224+
225+
VLMEvalKit also supports **Model-Level custom prompt building** via `model.build_prompt()`.
226+
* **Priority:** `model.build_prompt()` overrides `dataset.build_prompt()`.
227+
228+
**Custom `use_custom_prompt()`:**
229+
You can define `model.use_custom_prompt()` to decide when to use the model-specific prompt logic:
230+
231+
```python
232+
def use_custom_prompt(self, dataset: str) -> bool:
233+
from vlmeval.dataset import DATASET_TYPE, DATASET_MODALITY
234+
dataset_type = DATASET_TYPE(dataset, default=None)
235+
236+
if not self._use_custom_prompt:
237+
return False
238+
if listinstr(['MMVet'], dataset):
239+
return True
240+
if dataset_type == 'MCQ':
241+
return True
242+
return False
243+
```
244+
245+
### Model Splitting & GPU Allocation
246+
247+
VLMEvalKit supports automatic GPU resource division for `lmdeploy` or `transformers` backends.
248+
249+
* **Python:** Defaults to all visible GPUs. Use `CUDA_VISIBLE_DEVICES` to restrict.
250+
* **Torchrun:**
251+
* GPUs per instance = $N_{GPU} // N_{PROC}$.
252+
* $N_{PROC}$: Process count from `-nproc-per-node`.
253+
* $N_{GPU}$: Count of GPUs in `CUDA_VISIBLE_DEVICES` (or all if unset).
254+
255+
**Example (8 GPU Node):**
256+
257+
```bash
258+
# 2 instances, 4 GPUs each
259+
torchrun --nproc-per-node=2 run.py --data MaScQA --model InternVL3-78B
260+
261+
# 1 instance, 8 GPUs
262+
python run.py --data MaScQA --model InternVL3-78B
263+
264+
# 3 instances, 2 GPUs each (GPUs 0 and 7 unused)
265+
CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 torchrun --nproc-per-node=3 run.py --data MaScQA --model InternVL3-38B
266+
```
267+
268+
> **Note:** This does not apply to the `vllm` backend. For `vllm`, use the python command, which uses all visible GPUs by default.
269+
270+
### Deploying Local LLM as Judge
271+
272+
You can use LMDeploy to serve a local LLM as a judge replacement for OpenAI.
273+
274+
**1. Install**
275+
```bash
276+
pip install lmdeploy openai
277+
```
278+
279+
**2. Serve (e.g., internlm2-chat-1.8b)**
280+
```bash
281+
lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
282+
```
283+
284+
**3. Get Model ID**
285+
```python
286+
from openai import OpenAI
287+
client = OpenAI(api_key='sk-123456', base_url="http://0.0.0.0:23333/v1")
288+
print(client.models.list().data[0].id)
289+
```
290+
291+
**4. Configure Env (in .env)**
292+
```bash
293+
OPENAI_API_KEY=sk-123456
294+
OPENAI_API_BASE=http://0.0.0.0:23333/v1/chat/completions
295+
LOCAL_LLM=<model_ID_you_got>
296+
```
297+
298+
**5. Run Evaluation**
299+
Execute `run.py` as normal.

0 commit comments

Comments
 (0)