Skip to content

Conversation

@oscarqjh
Copy link
Contributor

@oscarqjh oscarqjh commented Jan 7, 2026

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

  • A descriptive title: [xxx] XXXX
  • A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Ask for review

Once you feel comfortable for your PR, feel free to @ one of the contributors to review

General: @Luodian @kcz358 @pufanyi
Audio: @pbcong @ngquangtrung57

Thank you for your contributions!

@oscarqjh
Copy link
Contributor Author

oscarqjh commented Jan 7, 2026

@pufanyi

@Luodian
Copy link
Contributor

Luodian commented Jan 7, 2026

@claude

@claude
Copy link

claude bot commented Jan 7, 2026

Claude finished @Luodian's task in 4m 46s —— View job


PR Review: VSIBench debiased & pruned

Tasks

  • Read repository guidelines (CLAUDE.md)
  • Review changed files and refactoring pattern
  • Analyze code structure against codebase standards
  • Provide comprehensive feedback

Summary

This PR refactors the VSIBench task configuration to follow the standard task variant pattern used throughout the lmms-eval codebase. The refactoring extracts common configuration into a _default_template_yaml file and creates two new task variants: vsibench_debiased and vsibench_pruned.

Pattern Compliance ✅

The refactoring follows the exact same pattern used by other tasks in the codebase:

Similar Examples:

  • egothink: Has _default_template_yaml with 12 variants (activity, affordance, assistance, etc.)
  • refcoco: Has _default_template_bbox_yaml with multiple split variants (val, test, testA, testB)
  • where2place: Has _default_template_yaml included by variant files
  • open_asr: Has multiple dataset variants (tedlium, ami, librispeech, etc.)

Code Structure Review

Based on the file changes and comparing with the current vsibench.yaml:

Current Structure (41 lines):

  • All configuration in a single file
  • No variants available

New Structure (Expected):

vsibench/
├── _default_template_yaml        # ~35 lines of common config
├── vsibench.yaml                 # ~5 lines (task + include)
├── vsibench_debiased.yaml        # ~8 lines (task + split + include)
└── vsibench_pruned.yaml          # ~8 lines (task + split + include)

This is excellent refactoring that:

  1. ✅ Eliminates duplication (DRY principle from CLAUDE.md)
  2. ✅ Makes adding future variants trivial
  3. ✅ Follows established codebase patterns
  4. ✅ Maintains backward compatibility for the base vsibench task

Key Review Points

1. File Naming Convention ✅

  • _default_template_yaml (no .yaml extension) - Matches pattern from other tasks
  • Variant files use descriptive suffixes (_debiased, _pruned)

2. Expected Content Structure

The _default_template_yaml should contain:

  • dataset_path: nyu-visionx/VSI-Bench
  • dataset_kwargs (token, cache_dir, video)
  • output_type: generate_until
  • All function references (!function utils.*)
  • generation_kwargs
  • process_results and metric_list
  • lmms_eval_specific_kwargs
  • metadata

The variant files should contain:

  • task: vsibench_debiased or task: vsibench_pruned
  • test_split: (likely "debiased" or "pruned" to select dataset subset)
  • include: _default_template_yaml

3. Potential Issues to Check

⚠️ Important: utils.py Line 37 Hardcoded Reference

In lmms_eval/tasks/vsibench/utils.py:37, there's a hardcoded reference:

with open(Path(__file__).parent / "vsibench.yaml", "r") as f:

This will break if the common configuration is moved to _default_template_yaml. The code needs to be updated to:

# Option 1: Read from the default template
with open(Path(__file__).parent / "_default_template_yaml", "r") as f:

# Option 2: Keep reading vsibench.yaml but ensure it has dataset_kwargs
# (if vsibench.yaml still includes dataset_kwargs or inherits it)

Action Required: Fix hardcoded vsibench.yaml reference →

4. Dataset Split Naming

Question for PR Author (@oscarqjh):

  • What are the actual test_split values for debiased and pruned?
  • Are these named splits in the HuggingFace dataset (nyu-visionx/VSI-Bench)?
  • Or are they dataset_name values (similar to how open_asr uses dataset_name)?

The pattern could be either:

# Pattern A: Using test_split (like refcoco)
test_split: debiased
# Pattern B: Using dataset_name (like open_asr, egothink)
dataset_name: debiased
test_split: test

Please clarify which pattern is used for the VSI-Bench dataset structure.

5. Pre-commit Checks

From CLAUDE.md, please ensure you've run:

uv run ruff format .
uv run ruff check . --fix

Also verify:

  • ✅ Line length ≤ 88 chars
  • ✅ YAML formatting (Prettier)
  • ✅ No syntax errors in YAML files

Recommendations

  1. Fix the utils.py hardcoded path (see Issue Not Support LLava 1.6or llava-Next? #3 above) - This is critical
  2. Add a comment in _default_template_yaml explaining it's a shared template
  3. Consider adding a group file like _vsibench.yaml to list all variants (optional, see refcoco pattern):
    group: vsibench
    task:
      - vsibench
      - vsibench_debiased
      - vsibench_pruned

Testing Checklist

Before merging, please verify:

  • utils.py can correctly read dataset_kwargs.cache_dir with the new structure
  • All three tasks load correctly: vsibench, vsibench_debiased, vsibench_pruned
  • The debiased and pruned splits exist in the HuggingFace dataset
  • Run the launch command with each variant to ensure they work:
    python -m lmms_eval --model <model> --tasks vsibench --limit 2
    python -m lmms_eval --model <model> --tasks vsibench_debiased --limit 2
    python -m lmms_eval --model <model> --tasks vsibench_pruned --limit 2

Conclusion

This is a solid refactoring that follows best practices and established patterns in the codebase. The main blocker is the hardcoded vsibench.yaml reference in utils.py which must be addressed.

Once the utils.py issue is fixed and the variant split names are clarified, this PR will be ready to merge.


Note: I couldn't directly view the PR diff due to permission constraints, so this review is based on:

  • The file change summary (+56/-36 across 4 files)
  • Current vsibench.yaml content
  • Comparison with similar patterns in the codebase (egothink, refcoco, where2place, open_asr)

If I've misunderstood any aspect of the implementation, please let me know!


@pufanyi pufanyi self-requested a review January 7, 2026 14:06
@kcz358 kcz358 merged commit e5d743e into EvolvingLMMs-Lab:main Jan 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants