Skip to content

Commit 978eb7f

Browse files
Title: Add Benchmark from "Vision-Language Models Can’t See the Obvious" (ICCV 2025) (#744)
* add salbench tasks * Apply pre-commit formatting * remove duplicates * 1. Optimize salbench utils\n2. Recover qwen2.5vl example/n3. Pre-commit
1 parent 83a1f57 commit 978eb7f

File tree

16 files changed

+364
-9
lines changed

16 files changed

+364
-9
lines changed

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
repos:
22
- repo: https://github.com/psf/black
3-
rev: 23.12.1
3+
rev: 25.1.0
44
hooks:
55
- id: black
66
language_version: python3
77
args: ["--line-length=240"]
88
- repo: https://github.com/PyCQA/isort
9-
rev: 5.13.2
9+
rev: 6.0.1
1010
hooks:
1111
- id: isort
1212
language_version: python3

docs/current_tasks.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,13 @@
159159
- WildVision 0617(wildvision_0617)
160160
- WildVision 0630 (wildvision_0630)
161161
- [SeedBench 2 Plus](https://huggingface.co/datasets/AILab-CVC/SEED-Bench-2-plus) (seedbench_2_plus)
162+
- [SalBench](https://salbench.github.io/)
163+
- p3
164+
- p3_box
165+
- p3_box_img
166+
- o3
167+
- o3_box
168+
- o3_box_img
162169

163170
## 2. Multi-image tasks:
164171

examples/models/qwen25vl.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@ accelerate launch --num_processes=8 --main_process_port=12346 -m lmms_eval \
1515
--model qwen2_5_vl \
1616
--model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=12845056,attn_implementation=flash_attention_2,interleave_visuals=False \
1717
--tasks mme \
18-
--batch_size 1
18+
--batch_size 1

lmms_eval/api/samplers.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,7 @@ def get_context(self, doc, num_fewshot):
3737
+ (
3838
str(self.doc_to_target(doc)[0])
3939
if type(self.doc_to_target(doc)) is list
40-
else self.doc_to_target(doc)
41-
if (self.config.doc_to_choice is None or type(self.doc_to_target(doc)) is str)
42-
else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
40+
else self.doc_to_target(doc) if (self.config.doc_to_choice is None or type(self.doc_to_target(doc)) is str) else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
4341
)
4442
for doc in selected_docs
4543
]

lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
15-
""" MplugOwl model configuration """
15+
"""MplugOwl model configuration"""
1616
import copy
1717
import os
1818
from typing import Union

lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
15-
""" PyTorch MplugOwl model. """
15+
"""PyTorch MplugOwl model."""
1616

1717
import math
1818
from typing import Any, Optional, Tuple, Union

lmms_eval/tasks/librispeech/cn_tn.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,12 @@
4141

4242
FILLER_CHARS = ["呃", "啊"]
4343

44-
ER_WHITELIST = "(儿女|儿子|儿孙|女儿|儿媳|妻儿|" "胎儿|婴儿|新生儿|婴幼儿|幼儿|少儿|小儿|儿歌|儿童|儿科|托儿所|孤儿|" "儿戏|儿化|台儿庄|鹿儿岛|正儿八经|吊儿郎当|生儿育女|托儿带女|养儿防老|痴儿呆女|" "佳儿佳妇|儿怜兽扰|儿无常父|儿不嫌母丑|儿行千里母担忧|儿大不由爷|苏乞儿)"
44+
ER_WHITELIST = (
45+
"(儿女|儿子|儿孙|女儿|儿媳|妻儿|"
46+
"胎儿|婴儿|新生儿|婴幼儿|幼儿|少儿|小儿|儿歌|儿童|儿科|托儿所|孤儿|"
47+
"儿戏|儿化|台儿庄|鹿儿岛|正儿八经|吊儿郎当|生儿育女|托儿带女|养儿防老|痴儿呆女|"
48+
"佳儿佳妇|儿怜兽扰|儿无常父|儿不嫌母丑|儿行千里母担忧|儿大不由爷|苏乞儿)"
49+
)
4550
ER_WHITELIST_PATTERN = re.compile(ER_WHITELIST)
4651

4752
# 中文数字系统类型
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
dataset_path: salbench-vlm/salbench
2+
dataset_kwargs:
3+
token: True
4+
5+
test_split: test
6+
output_type: generate_until
7+
doc_to_visual: !function utils.p3o3_doc_to_visual
8+
doc_to_text: !function utils.p3o3_doc_to_text
9+
doc_to_target: "answer"
10+
generation_kwargs:
11+
max_new_tokens: 128
12+
# temperature: 0
13+
# top_p: 0
14+
# num_beams: 1
15+
# do_sample: false
16+
17+
process_results: !function utils.o3_process_results
18+
metric_list:
19+
- metric: exact_match
20+
aggregation: !function utils.aggregate_per_sample_score
21+
higher_is_better: true
22+
- metric: sample_precision
23+
aggregation: !function utils.aggregate_per_sample_score
24+
higher_is_better: true
25+
- metric: sample_recall
26+
aggregation: !function utils.aggregate_per_sample_score
27+
higher_is_better: true
28+
- metric: sample_f1
29+
aggregation: !function utils.aggregate_per_sample_score
30+
higher_is_better: true
31+
32+
- metric: all_cat_precision
33+
aggregation: !function utils.p3_aggregate_all_category_precision
34+
higher_is_better: true
35+
- metric: all_cat_recall
36+
aggregation: !function utils.p3_aggregate_all_category_recall
37+
higher_is_better: true
38+
- metric: all_cat_f1
39+
aggregation: !function utils.p3_aggregate_all_category_f1
40+
higher_is_better: true
41+
42+
- metric: orientation_precision
43+
aggregation: !function utils.aggregate_per_category_precision
44+
higher_is_better: true
45+
- metric: orientation_recall
46+
aggregation: !function utils.aggregate_per_category_recall
47+
higher_is_better: true
48+
- metric: orientation_f1
49+
aggregation: !function utils.aggregate_per_category_f1
50+
higher_is_better: true
51+
52+
- metric: color_precision
53+
aggregation: !function utils.aggregate_per_category_precision
54+
higher_is_better: true
55+
- metric: color_recall
56+
aggregation: !function utils.aggregate_per_category_recall
57+
higher_is_better: true
58+
- metric: color_f1
59+
aggregation: !function utils.aggregate_per_category_f1
60+
higher_is_better: true
61+
62+
- metric: size_precision
63+
aggregation: !function utils.aggregate_per_category_precision
64+
higher_is_better: true
65+
- metric: size_recall
66+
aggregation: !function utils.aggregate_per_category_recall
67+
higher_is_better: true
68+
- metric: size_f1
69+
aggregation: !function utils.aggregate_per_category_f1
70+
higher_is_better: true
71+
72+
metadata:
73+
- version: 0.0
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
dataset_path: salbench-vlm/salbench
2+
dataset_kwargs:
3+
token: True
4+
5+
test_split: test
6+
output_type: generate_until
7+
doc_to_visual: !function utils.p3o3_doc_to_visual
8+
doc_to_text: !function utils.p3o3_doc_to_text
9+
doc_to_target: "answer"
10+
generation_kwargs:
11+
max_new_tokens: 128
12+
# temperature: 0
13+
# top_p: 0
14+
# num_beams: 1
15+
# do_sample: false
16+
17+
process_results: !function utils.p3_process_results
18+
metric_list:
19+
- metric: exact_match
20+
aggregation: !function utils.aggregate_per_sample_score
21+
higher_is_better: true
22+
- metric: sample_precision
23+
aggregation: !function utils.aggregate_per_sample_score
24+
higher_is_better: true
25+
- metric: sample_recall
26+
aggregation: !function utils.aggregate_per_sample_score
27+
higher_is_better: true
28+
- metric: sample_f1
29+
aggregation: !function utils.aggregate_per_sample_score
30+
higher_is_better: true
31+
32+
- metric: all_cat_precision
33+
aggregation: !function utils.p3_aggregate_all_category_precision
34+
higher_is_better: true
35+
- metric: all_cat_recall
36+
aggregation: !function utils.p3_aggregate_all_category_recall
37+
higher_is_better: true
38+
- metric: all_cat_f1
39+
aggregation: !function utils.p3_aggregate_all_category_f1
40+
higher_is_better: true
41+
42+
- metric: orientation_precision
43+
aggregation: !function utils.aggregate_per_category_precision
44+
higher_is_better: true
45+
- metric: orientation_recall
46+
aggregation: !function utils.aggregate_per_category_recall
47+
higher_is_better: true
48+
- metric: orientation_f1
49+
aggregation: !function utils.aggregate_per_category_f1
50+
higher_is_better: true
51+
52+
- metric: color_precision
53+
aggregation: !function utils.aggregate_per_category_precision
54+
higher_is_better: true
55+
- metric: color_recall
56+
aggregation: !function utils.aggregate_per_category_recall
57+
higher_is_better: true
58+
- metric: color_f1
59+
aggregation: !function utils.aggregate_per_category_f1
60+
higher_is_better: true
61+
62+
- metric: size_precision
63+
aggregation: !function utils.aggregate_per_category_precision
64+
higher_is_better: true
65+
- metric: size_recall
66+
aggregation: !function utils.aggregate_per_category_recall
67+
higher_is_better: true
68+
- metric: size_f1
69+
aggregation: !function utils.aggregate_per_category_f1
70+
higher_is_better: true
71+
72+
metadata:
73+
- version: 0.0

lmms_eval/tasks/salbench/o3.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
dataset_name: O3
2+
task: "o3"
3+
include: _o3_default

0 commit comments

Comments
 (0)