[Task] Added VSIBench debiased & pruned #975

oscarqjh · 2026-01-07T13:58:45Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Ask for review

Once you feel comfortable for your PR, feel free to @ one of the contributors to review

General: @Luodian @kcz358 @pufanyi
Audio: @pbcong @ngquangtrung57

Thank you for your contributions!

[Task] Vsibench debiased & pruned

fixed yaml name

oscarqjh · 2026-01-07T14:00:18Z

@pufanyi

Luodian · 2026-01-07T14:01:36Z

@claude

claude · 2026-01-07T14:01:52Z

Claude finished @Luodian's task in 4m 46s —— View job

PR Review: VSIBench debiased & pruned

Tasks

Read repository guidelines (CLAUDE.md)
Review changed files and refactoring pattern
Analyze code structure against codebase standards
Provide comprehensive feedback

Summary

This PR refactors the VSIBench task configuration to follow the standard task variant pattern used throughout the lmms-eval codebase. The refactoring extracts common configuration into a _default_template_yaml file and creates two new task variants: vsibench_debiased and vsibench_pruned.

Pattern Compliance ✅

The refactoring follows the exact same pattern used by other tasks in the codebase:

Similar Examples:

egothink: Has _default_template_yaml with 12 variants (activity, affordance, assistance, etc.)
refcoco: Has _default_template_bbox_yaml with multiple split variants (val, test, testA, testB)
where2place: Has _default_template_yaml included by variant files
open_asr: Has multiple dataset variants (tedlium, ami, librispeech, etc.)

Code Structure Review

Based on the file changes and comparing with the current vsibench.yaml:

Current Structure (41 lines):

All configuration in a single file
No variants available

New Structure (Expected):

vsibench/
├── _default_template_yaml        # ~35 lines of common config
├── vsibench.yaml                 # ~5 lines (task + include)
├── vsibench_debiased.yaml        # ~8 lines (task + split + include)
└── vsibench_pruned.yaml          # ~8 lines (task + split + include)

This is excellent refactoring that:

✅ Eliminates duplication (DRY principle from CLAUDE.md)
✅ Makes adding future variants trivial
✅ Follows established codebase patterns
✅ Maintains backward compatibility for the base vsibench task

Key Review Points

1. File Naming Convention ✅

_default_template_yaml (no .yaml extension) - Matches pattern from other tasks
Variant files use descriptive suffixes (_debiased, _pruned)

2. Expected Content Structure

The _default_template_yaml should contain:

dataset_path: nyu-visionx/VSI-Bench
dataset_kwargs (token, cache_dir, video)
output_type: generate_until
All function references (!function utils.*)
generation_kwargs
process_results and metric_list
lmms_eval_specific_kwargs
metadata

The variant files should contain:

task: vsibench_debiased or task: vsibench_pruned
test_split: (likely "debiased" or "pruned" to select dataset subset)
include: _default_template_yaml

3. Potential Issues to Check

⚠️ Important: utils.py Line 37 Hardcoded Reference

In lmms_eval/tasks/vsibench/utils.py:37, there's a hardcoded reference:

with open(Path(__file__).parent / "vsibench.yaml", "r") as f:

This will break if the common configuration is moved to _default_template_yaml. The code needs to be updated to:

# Option 1: Read from the default template
with open(Path(__file__).parent / "_default_template_yaml", "r") as f:

# Option 2: Keep reading vsibench.yaml but ensure it has dataset_kwargs
# (if vsibench.yaml still includes dataset_kwargs or inherits it)

Action Required: Fix hardcoded vsibench.yaml reference →

4. Dataset Split Naming

Question for PR Author (@oscarqjh):

What are the actual test_split values for debiased and pruned?
Are these named splits in the HuggingFace dataset (nyu-visionx/VSI-Bench)?
Or are they dataset_name values (similar to how open_asr uses dataset_name)?

The pattern could be either:

# Pattern A: Using test_split (like refcoco)
test_split: debiased

# Pattern B: Using dataset_name (like open_asr, egothink)
dataset_name: debiased
test_split: test

Please clarify which pattern is used for the VSI-Bench dataset structure.

5. Pre-commit Checks

From CLAUDE.md, please ensure you've run:

uv run ruff format .
uv run ruff check . --fix

Also verify:

✅ Line length ≤ 88 chars
✅ YAML formatting (Prettier)
✅ No syntax errors in YAML files

Recommendations

Fix the utils.py hardcoded path (see Issue Not Support LLava 1.6or llava-Next? #3 above) - This is critical
Add a comment in _default_template_yaml explaining it's a shared template
Consider adding a group file like _vsibench.yaml to list all variants (optional, see refcoco pattern):
```
group: vsibench
task:
  - vsibench
  - vsibench_debiased
  - vsibench_pruned
```

Testing Checklist

Before merging, please verify:

utils.py can correctly read dataset_kwargs.cache_dir with the new structure
All three tasks load correctly: vsibench, vsibench_debiased, vsibench_pruned
The debiased and pruned splits exist in the HuggingFace dataset

Run the launch command with each variant to ensure they work:

python -m lmms_eval --model <model> --tasks vsibench --limit 2
python -m lmms_eval --model <model> --tasks vsibench_debiased --limit 2
python -m lmms_eval --model <model> --tasks vsibench_pruned --limit 2

Conclusion

This is a solid refactoring that follows best practices and established patterns in the codebase. The main blocker is the hardcoded vsibench.yaml reference in utils.py which must be addressed.

Once the utils.py issue is fixed and the variant split names are clarified, this PR will be ready to merge.

Note: I couldn't directly view the PR diff due to permission constraints, so this review is based on:

The file change summary (+56/-36 across 4 files)
Current vsibench.yaml content
Comparison with similar patterns in the codebase (egothink, refcoco, where2place, open_asr)

If I've misunderstood any aspect of the implementation, please let me know!

oscarqjh and others added 6 commits January 7, 2026 21:15

added vsibench_debiased and vsibench_pruned subset

acb0607

Merge branch 'EvolvingLMMs-Lab:main' into vsibench_debiased

7bd0c29

Merge pull request #1 from oscarqjh/vsibench_debiased

9ad704b

[Task] Vsibench debiased & pruned

fixed yaml name

1dff515

Merge pull request #2 from oscarqjh/vsibench_debiased

9b26517

fixed yaml name

fixed yaml files

6675faf

pufanyi self-requested a review January 7, 2026 14:06

pufanyi approved these changes Jan 7, 2026

View reviewed changes

kcz358 merged commit e5d743e into EvolvingLMMs-Lab:main Jan 8, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Task] Added VSIBench debiased & pruned #975

[Task] Added VSIBench debiased & pruned #975

oscarqjh commented Jan 7, 2026

Uh oh!

oscarqjh commented Jan 7, 2026

Uh oh!

Luodian commented Jan 7, 2026

Uh oh!

claude bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Task] Added VSIBench debiased & pruned #975

[Task] Added VSIBench debiased & pruned #975

Conversation

oscarqjh commented Jan 7, 2026

When you open a pull-request, please be sure to include the following

Ask for review

Uh oh!

oscarqjh commented Jan 7, 2026

Uh oh!

Luodian commented Jan 7, 2026

Uh oh!

claude bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: VSIBench debiased & pruned

Tasks

Summary

Pattern Compliance ✅

Code Structure Review

Key Review Points

1. File Naming Convention ✅

2. Expected Content Structure

3. Potential Issues to Check

4. Dataset Split Naming

5. Pre-commit Checks

Recommendations

Testing Checklist

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

claude bot commented Jan 7, 2026 •

edited

Loading