Skip to content

Commit ddaa1d5

Browse files
committed
refactor inference core, unify error handling, and enhance judge flexibility
Major changes to the inference architecture and evaluation pipeline: 1. Core Architecture & Error Handling: - Centralized retry and exception handling logic in `BaseAPI` and `BaseModel`. - Implemented `fail-fast` mechanism to exit immediately on critical errors (OOM, Auth failures). - Introduced `ignore-patterns` ("Green Light" mechanism) to gracefully handle and record specific errors (e.g., policy violations, content filtering) as valid responses. - Cleaned up `generate_inner` across all API wrappers by removing redundant try-except blocks and loops. 2. Streaming & Performance: - Added configurable `--stream` support for major API wrappers (OpenAI, Claude, Gemini, etc.). - Implemented `image_mem` support in wrappers to allow zero-IO in-memory Base64 image passing, bypassing temporary file creation. 3. Judge & Model Initialization: - Refactored `build_judge_model` to support dynamic class loading, config-based routing, and fallback to OpenAI-compatible protocols. - Unified model initialization logic to respect `Config > CLI` priority for parameters like `retry`, `verbose`, and `stream`. 4. Utilities & Environment: - Implemented Lazy Loading proxy for heavy-dependency datasets (e.g., AstroVisBench) to resolve import hell. - Added environment variable isolation context: supports `_EVAL` suffixed env vars (e.g., `OPENAI_API_KEY_EVAL`) to override settings during evaluation only.
1 parent 4bcce23 commit ddaa1d5

File tree

698 files changed

+1561
-3164
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

698 files changed

+1561
-3164
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ GPT4o_MINI/
196196
#apple.jpg
197197
#assets/LOGO.png
198198
#api_list.txt
199-
#vlmeval/gemini_tmp.py
199+
#scieval/gemini_tmp.py
200200
#run.sh
201201
#run_g.sh
202202
#tmp/

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ English | [简体中文](/docs/zh-CN/README_zh-CN.md) | [日本語](/docs/ja/REA
3636
- **[2025-07-07]** Supported [**SeePhys**](https://seephys.github.io/), which is a ​full spectrum multimodal benchmark for evaluating physics reasoning across different knowledge levels. thanks to [**Quinn777**](https://github.com/Quinn777) 🔥🔥🔥
3737
- **[2025-07-02]** Supported [**OvisU1**](https://huggingface.co/AIDC-AI/Ovis-U1-3B), thanks to [**liyang-7**](https://github.com/liyang-7) 🔥🔥🔥
3838
- **[2025-06-16]** Supported [**PhyX**](https://phyx-bench.github.io/), a benchmark aiming to assess capacity for physics-grounded reasoning in visual scenarios. 🔥🔥🔥
39-
- **[2025-05-24]** To facilitate faster evaluations for large-scale or thinking models, **VLMEvalKit supports multi-node distributed inference** using **LMDeploy** (supports *InternVL Series, QwenVL Series, LLaMa4*) or **VLLM**(supports *QwenVL Series, LLaMa4*). You can activate this feature by adding the ```use_lmdeploy``` or ```use_vllm``` flag to your custom model configuration in [config.py](vlmeval/config.py) . Leverage these tools to significantly speed up your evaluation workflows 🔥🔥🔥
39+
- **[2025-05-24]** To facilitate faster evaluations for large-scale or thinking models, **VLMEvalKit supports multi-node distributed inference** using **LMDeploy** (supports *InternVL Series, QwenVL Series, LLaMa4*) or **VLLM**(supports *QwenVL Series, LLaMa4*). You can activate this feature by adding the ```use_lmdeploy``` or ```use_vllm``` flag to your custom model configuration in [config.py](scieval/config.py) . Leverage these tools to significantly speed up your evaluation workflows 🔥🔥🔥
4040
- **[2025-05-24]** Supported Models: **InternVL3 Series, Gemini-2.5-Pro, Kimi-VL, LLaMA4, NVILA, Qwen2.5-Omni, Phi4, SmolVLM2, Grok, SAIL-VL-1.5, WeThink-Qwen2.5VL-7B, Bailingmm, VLM-R1, Taichu-VLR**. Supported Benchmarks: **HLE-Bench, MMVP, MM-AlignBench, Creation-MMBench, MM-IFEval, OmniDocBench, OCR-Reasoning, EMMA, ChaXiv,MedXpertQA, Physics, MSEarthMCQ, MicroBench, MMSci, VGRP-Bench, wildDoc, TDBench, VisuLogic, CVBench, LEGO-Puzzles, Video-MMLU, QBench-Video, MME-CoT, VLM2Bench, VMCBench, MOAT, Spatial457 Benchmark**. Please refer to [**VLMEvalKit Features**](https://aicarrier.feishu.cn/wiki/Qp7wwSzQ9iK1Y6kNUJVcr6zTnPe?table=tblsdEpLieDoCxtb) for more details. Thanks to all contributors 🔥🔥🔥
4141
- **[2025-02-20]** Supported Models: **InternVL2.5 Series, Qwen2.5VL Series, QVQ-72B, Doubao-VL, Janus-Pro-7B, MiniCPM-o-2.6, InternVL2-MPO, LLaVA-CoT, Hunyuan-Standard-Vision, Ovis2, Valley, SAIL-VL, Ross, Long-VITA, EMU3, SmolVLM**. Supported Benchmarks: **MMMU-Pro, WeMath, 3DSRBench, LogicVista, VL-RewardBench, CC-OCR, CG-Bench, CMMMU, WorldSense**. Thanks to all contributors 🔥🔥🔥
4242
- **[2024-12-11]** Supported [**NaturalBench**](https://huggingface.co/datasets/BaiqiL/NaturalBench), a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
@@ -87,7 +87,8 @@ Note that some VLMs may not be able to run under certain flash-attention version
8787

8888
```python
8989
# Demo
90-
from vlmeval.config import supported_VLM
90+
from scieval.config import supported_VLM
91+
9192
model = supported_VLM['idefics_9b_instruct']()
9293
# Forward Single Image
9394
ret = model.generate(['assets/apple.jpg', 'What is in this image?'])

docs/en/Quickstart.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
9696
# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
9797
# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
9898

99-
# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs. MMBench_Video_8frame_nopack is a defined dataset setting in `vlmeval/dataset/video_dataset_config.py`.
99+
# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs. MMBench_Video_8frame_nopack is a defined dataset setting in `scieval/dataset/video_dataset_config.py`.
100100
torchrun --nproc-per-node=8 run.py --data MMBench_Video_8frame_nopack --model idefics2_8
101101
# GPT-4o (API model) on MMBench-Video, with 1 frame per second as inputs and pack evaluation (all questions of a video in a single query).
102102
python run.py --data MMBench_Video_1fps_pack --model GPT4o
@@ -131,7 +131,7 @@ Some models, such as Qwen2VL and InternVL, define extensive prompt-building meth
131131

132132
```python
133133
def use_custom_prompt(self, dataset: str) -> bool:
134-
from vlmeval.dataset import DATASET_TYPE, DATASET_MODALITY
134+
from scieval.dataset import DATASET_TYPE, DATASET_MODALITY
135135
dataset_type = DATASET_TYPE(dataset, default=None)
136136
if not self._use_custom_prompt:
137137
return False

docs/en/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
author = 'VLMEvalKit Authors'
2929

3030
# The full version, including alpha/beta/rc tags
31-
version_file = '../../vlmeval/__init__.py'
31+
version_file = '../../scieval/__init__.py'
3232

3333

3434
def get_version():

docs/ja/README_ja.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,8 @@ PS: 日本語の README には最新のアップデートがすべて含まれ
4747

4848
```python
4949
# デモ
50-
from vlmeval.config import supported_VLM
50+
from scieval.config import supported_VLM
51+
5152
model = supported_VLM['idefics_9b_instruct']()
5253
# 単一画像のフォワード
5354
ret = model.generate(['assets/apple.jpg', 'この画像には何がありますか?'])

docs/zh-CN/Quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
9595
# 使用 `python` 运行时,只实例化一个 VLM,并且它可能使用多个 GPU。
9696
# 这推荐用于评估参数量非常大的 VLMs(如 IDEFICS-80B-Instruct)。
9797

98-
# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入,不采用 pack 模式评测. MMBench_Video_8frame_nopack 是一个定义在 `vlmeval/dataset/video_dataset_config.py` 的数据集设定.
98+
# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入,不采用 pack 模式评测. MMBench_Video_8frame_nopack 是一个定义在 `scieval/dataset/video_dataset_config.py` 的数据集设定.
9999
torchrun --nproc-per-node=8 run.py --data MMBench_Video_8frame_nopack --model idefics2_8
100100
# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样每秒一帧作为输入,采用 pack 模式评测
101101
python run.py --data MMBench_Video_1fps_pack --model GPT4o

docs/zh-CN/README_zh-CN.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,8 @@
6565
**如何测试一个 VLM 是否可以正常运行:**
6666

6767
```python
68-
from vlmeval.config import supported_VLM
68+
from scieval.config import supported_VLM
69+
6970
model = supported_VLM['idefics_9b_instruct']()
7071
# 前向单张图片
7172
ret = model.generate(['assets/apple.jpg', 'What is in this image?'])

docs/zh-CN/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
author = 'VLMEvalKit Authors'
2929

3030
# The full version, including alpha/beta/rc tags
31-
version_file = '../../vlmeval/__init__.py'
31+
version_file = '../../scieval/__init__.py'
3232

3333

3434
def get_version():

requirements.txt

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,17 @@ transformers
3838
typing_extensions
3939
validators
4040
xlsxwriter
41+
datasets
42+
## clima_qa
43+
bert_score
44+
tensorflow-hub
45+
scikit-learn
46+
## CMPhysBench
47+
wrapt_timeout_decorator
48+
latex2sympy2-extended
49+
## PHYSICS
50+
pylatexenc
51+
math-verify
52+
# wrapt_timeout_decorator
53+
## chemBench
54+
loguru

0 commit comments

Comments
 (0)