Skip to content

Conversation

@dunghuynhandy
Copy link
Contributor

@dunghuynhandy dunghuynhandy commented Jul 9, 2025

HI EvolvingLMMs-Lab,

This PR adds our benchmark from the paper:

"Vision-Language Models Can’t See the Obvious"
Accepted at ICCV 2025
arXiv:2507.04741

The benchmark evaluates vision-language models on their ability to recognize basic and intuitive visual relationships — a capability that current state-of-the-art models often struggle with.

We believe this dataset will be a valuable addition to lmms-eval, offering a new perspective on commonsense grounding.
Please let us know if any changes or additions are needed. We're happy to adjust to align with the repo's standards.

Best Regards,
Ngoc Dung Huynh

Summary by CodeRabbit

  • New Features

    • Added support for the new "SalBench" dataset group with six tasks: p3, p3_box, p3_box_img, o3, o3_box, and o3_box_img.
    • Introduced detailed evaluation metrics for each new SalBench task, including exact match, precision, recall, and F1 scores at both sample and category levels.
    • Added utility functions for processing and evaluating results specific to the SalBench datasets.
  • Documentation

    • Updated documentation to include the new "SalBench" dataset group and its task subsets.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 9, 2025

Walkthrough

The changes introduce a new "SalBench" dataset group with six task variants to the documentation and add corresponding YAML configuration files for each task under the evaluation framework. A utility module is also added to support data conversion, result processing, and metric aggregation for these tasks. Additionally, pre-commit hooks are updated, minor formatting fixes are applied to some modules, and a regex pattern is reformatted without functional changes.

Changes

File(s) Change Summary
docs/current_tasks.md Added "SalBench" dataset group with six new subsets to the list of current image tasks.
lmms_eval/tasks/salbench/p3.yaml
p3_box.yaml
p3_box_img.yaml
o3.yaml
o3_box.yaml
o3_box_img.yaml
Added YAML configuration files for six SalBench tasks, specifying dataset parameters, generation settings, metrics, and metadata.
lmms_eval/tasks/salbench/_p3_default
_o3_default
Added default configuration files defining dataset paths, generation parameters, result processing, and detailed metrics for p3 and o3 tasks.
lmms_eval/tasks/salbench/utils.py New utility module with functions for document conversion, result processing, and metric aggregation for SalBench tasks.
.pre-commit-config.yaml Updated Black and isort hooks to newer versions; added "black" profile argument to isort.
lmms_eval/api/samplers.py Refactored nested conditional expression in list comprehension for better formatting without logic change.
lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py Removed leading/trailing spaces from module docstring.
lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py Removed leading space from module docstring.
lmms_eval/tasks/librispeech/cn_tn.py Reformatted ER_WHITELIST regex pattern string into multi-line for readability without changing functionality.
examples/models/qwen25vl.sh Removed trailing newline at end of file; no functional changes made.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SalBench Task Config (YAML)
    participant Evaluation Framework
    participant Utils Module

    User->>Evaluation Framework: Selects SalBench task (e.g., p3, o3_box_img)
    Evaluation Framework->>SalBench Task Config (YAML): Loads task configuration
    Evaluation Framework->>Utils Module: Converts document to visual/text formats
    Evaluation Framework->>Model: Sends input (image + prompt)
    Model-->>Evaluation Framework: Returns prediction
    Evaluation Framework->>Utils Module: Processes results (metrics calculation)
    Utils Module-->>Evaluation Framework: Returns aggregated metrics
    Evaluation Framework-->>User: Presents evaluation results
Loading

Suggested reviewers

  • Luodian

Poem

In the meadow of tasks, new seeds are sown,
With SalBench’s puzzles, our dataset has grown.
YAMLs sprout metrics, so precise and neat,
Utilities nibble data, making results complete.
Six new adventures for bunnies to try—
Hop, hop, hooray, let evaluations fly!
🥕🐇

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (4)
docs/current_tasks.md (1)

162-169: Add the task-name parentheses for consistency with the rest of the list

Every other entry follows the pattern
- [DatasetName](url) (task_name).
The newly-added SalBench block omits the parentheses, so downstream users can’t copy-paste the task id directly.

-- [SalBench](https://salbench.github.io/)
-  - p3
+- [SalBench](https://salbench.github.io/) (salbench)
+  - p3

(Feel free to add the six subset ids after the parent dataset in the same way MME, RefCOCO, etc. are documented.)

lmms_eval/tasks/salbench/p3_box.yaml (1)

71-73: metadata should be a mapping, not a single-element list

Other task YAMLs define:

metadata:
  version: 0.0

Serialising as a list (- version: 0.0) may break tools that expect a dict.

lmms_eval/tasks/salbench/o3.yaml (1)

122-123: Convert metadata to dict for consistency

See earlier remark.

lmms_eval/tasks/salbench/p3_box_img.yaml (1)

72-73: metadata formatting

Use a mapping instead of a list for metadata for parity with other task files.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e24a7d8 and ce63834.

📒 Files selected for processing (8)
  • docs/current_tasks.md (1 hunks)
  • lmms_eval/tasks/salbench/o3.yaml (1 hunks)
  • lmms_eval/tasks/salbench/o3_box.yaml (1 hunks)
  • lmms_eval/tasks/salbench/o3_box_img.yaml (1 hunks)
  • lmms_eval/tasks/salbench/p3.yaml (1 hunks)
  • lmms_eval/tasks/salbench/p3_box.yaml (1 hunks)
  • lmms_eval/tasks/salbench/p3_box_img.yaml (1 hunks)
  • lmms_eval/tasks/salbench/utils.py (1 hunks)
🔇 Additional comments (5)
lmms_eval/tasks/salbench/p3_box.yaml (1)

8-9: Verify the dotted path used in !function directives

utils.p3o3_doc_to_visual assumes that the module is imported as lmms_eval.tasks.salbench.utils with an alias utils.
Most existing YAMLs in the repo use the fully-qualified path (tasks.salbench.utils.p3o3_doc_to_visual). Mis-resolution will surface only at runtime.

Please confirm the loader can locate these functions or update the path:

-doc_to_visual: !function utils.p3o3_doc_to_visual
-doc_to_text:   !function utils.p3o3_doc_to_text
+doc_to_visual: !function tasks.salbench.utils.p3o3_doc_to_visual
+doc_to_text:   !function tasks.salbench.utils.p3o3_doc_to_text
lmms_eval/tasks/salbench/o3_box.yaml (1)

8-9: Same loader-path concern as in p3_box.yaml

See previous comment about fully-qualified module paths.

lmms_eval/tasks/salbench/o3.yaml (1)

8-9: Confirm function module path resolution

Same note about utils.p3o3_doc_to_visual / utils.p3o3_doc_to_text.

lmms_eval/tasks/salbench/p3_box_img.yaml (1)

8-9: Check function path – potential runtime import error

Ensure the loader resolves utils.p3o3_doc_to_visual and utils.p3o3_doc_to_text correctly, or switch to the fully-qualified path.

lmms_eval/tasks/salbench/p3.yaml (1)

1-74: Configuration looks well-structured and comprehensive.

The YAML configuration follows standard patterns for evaluation tasks with proper dataset setup, generation parameters, and comprehensive metric definitions. The function references to the utils module are consistent and the metric hierarchy (sample-level, all-category, per-category) provides thorough evaluation coverage.

@Luodian
Copy link
Contributor

Luodian commented Jul 9, 2025

@dunghuynhandy

Hi thanks! I can see there are some suggestions provided by our codebase manage agent, could you help to take a look and fix them (you may pick some to fix, no need for all)

Copy link
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the parts look good to merge to me. Can you run a 'pre-commit run --all-files` command to fix the format of the code? Then I would merge this PR. Thanks!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
.pre-commit-config.yaml (1)

9-13: isort 6 is a major release—run the hook locally once and commit autofixes

Version 6 rewrites the default section definitions and may reshuffle existing imports the first time it runs. Execute pre-commit run --all-files isort locally, inspect the diff, and commit the changes so CI doesn’t churn on every run.

lmms_eval/api/samplers.py (1)

40-40: Consider refactoring for improved readability and performance.

While the logic appears correct, this single-line conditional expression is difficult to read and maintain. The expression also calls self.doc_to_target(doc) multiple times, which could be inefficient.

Consider extracting the target value and breaking down the conditional logic:

-                        else self.doc_to_target(doc) if (self.config.doc_to_choice is None or type(self.doc_to_target(doc)) is str) else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
+                        else self._format_target(doc)

And add a helper method to the class:

def _format_target(self, doc):
    target = self.doc_to_target(doc)
    if self.config.doc_to_choice is None or type(target) is str:
        return target
    else:
        return str(self.doc_to_choice(doc)[target])

This approach improves readability, avoids repeated method calls, and makes the logic easier to test and maintain.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce63834 and 1c1f254.

📒 Files selected for processing (6)
  • .pre-commit-config.yaml (1 hunks)
  • lmms_eval/api/samplers.py (1 hunks)
  • lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py (1 hunks)
  • lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py (1 hunks)
  • lmms_eval/tasks/librispeech/cn_tn.py (1 hunks)
  • lmms_eval/tasks/salbench/utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py
  • lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py
  • lmms_eval/tasks/librispeech/cn_tn.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • lmms_eval/tasks/salbench/utils.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
lmms_eval/api/samplers.py (1)
lmms_eval/api/task.py (4)
  • doc_to_target (378-379)
  • doc_to_target (1330-1365)
  • config (291-293)
  • doc_to_choice (1385-1405)
🔇 Additional comments (1)
.pre-commit-config.yaml (1)

3-7: Confirm Black 25.1.0 is safe for your Python version & CI images

Black 25.x drops support for Python < 3.8 and slightly tightens some formatting rules (e.g. numeric literal normalization). Double-check that the lowest Python version declared in pyproject.toml, your Docker images, and all CI jobs is ≥ 3.8; otherwise the hook will fail.

@dunghuynhandy
Copy link
Contributor Author

Dear @kcz358 ,
I ran
pre-commit run --all-files
to fix the format

@kcz358
Copy link
Collaborator

kcz358 commented Jul 14, 2025

Thanks @dunghuynhandy, do you think any of the comment from the bot is useful and needs to fix? I think most of its suggestion is to try to use common code so that the code is more simple.

@dunghuynhandy
Copy link
Contributor Author

Thanks! I will try to make my code simpler based on the sugesstions

@Luodian
Copy link
Contributor

Luodian commented Jul 14, 2025

Agree, I think should merge some redundant code.

@dunghuynhandy
Copy link
Contributor Author

Dear @Luodian and @kcz358,
i refactored the code by removing some redundant code

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
lmms_eval/tasks/salbench/o3_box.yaml (1)

1-3: Trailing newline + re-check duplicate metrics in _o3_default.

  1. Add the missing LF as in the other task files.
  2. Past reviews flagged duplicated pattern_* metrics inside _o3_default. Verify they were removed; otherwise evaluations will crash with ValueError for duplicate keys.

[tag applies to both points]

lmms_eval/tasks/salbench/o3_box_img.yaml (1)

1-3: Trailing newline + duplicated metric check.

Same two points as o3_box.yaml:

 dataset_name: O3_box_img
 task: "o3_box_img"
 include: _o3_default
+
lmms_eval/tasks/salbench/utils.py (2)

31-75: Use global constants instead of hardcoded categories.

The function defines categories locally instead of using the global P3_CATEGORIES constant defined at the top of the file.

 def p3_process_results(doc, results):
     pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
     gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}

     exact_match = int(pred == gt_ans)

     # Per sample
     matches = pred.intersection(gt_ans)
     # how many retrieved categories are relevant
     precision = len(matches) / (len(pred) + 1e-8)
     # how many applicable categories are retrieved
     recall = len(matches) / len(gt_ans)
     f1 = 2 * (precision * recall) / (precision + recall + 1e-8)

-    categories = [
-        "orientation",
-        "color",
-        "size",
-    ]
+    categories = P3_CATEGORIES
     cat_preds = {}

78-138: Use global constants and eliminate code duplication.

This function duplicates most of the logic from p3_process_results and uses hardcoded categories instead of the global O3_CATEGORIES constant.

Consider extracting the common logic into a shared helper function as suggested in previous reviews, and use the global constant:

 def o3_process_results(doc, results):
     pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
     gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}

     exact_match = int(pred == gt_ans)

     # Per sample
     matches = pred.intersection(gt_ans)
     # how many retrieved categories are relevant
     precision = len(matches) / (len(pred) + 1e-8)
     # how many applicable categories are retrieved
     recall = len(matches) / len(gt_ans)
     f1 = 2 * (precision * recall) / (precision + recall + 1e-8)

-    categories = [
-        "orientation",
-        "color",
-        "focus",
-        "shape",
-        "size",
-        "location",
-        "pattern",
-    ]
+    categories = O3_CATEGORIES
     cat_preds = {}
🧹 Nitpick comments (5)
lmms_eval/tasks/salbench/p3.yaml (1)

1-3: Add missing newline at end-of-file (YAMLlint error).

All newly-added SalBench task YAMLs lack a trailing newline, which YAML-lint flags (new-line-at-end-of-file).
Add a single LF to keep tooling quiet and stay consistent with the rest of the repo.

 dataset_name: P3
 task: "p3"
 include: _p3_default
+
lmms_eval/tasks/salbench/p3_box.yaml (1)

1-3: Same newline issue as above.

Please append a trailing LF here as well.

 dataset_name: P3_box
 task: "p3_box"
 include: _p3_default
+
lmms_eval/tasks/salbench/p3_box_img.yaml (1)

1-3: Trailing newline.

Just add the missing LF.

 dataset_name: P3_box_img
 task: "p3_box_img"
 include: _p3_default
+
lmms_eval/tasks/salbench/o3.yaml (1)

3-3: Add missing newline at end of file.

YAML files should end with a newline character for better compatibility.

-include: _o3_default
+include: _o3_default
+
lmms_eval/tasks/salbench/utils.py (1)

145-148: Use ternary operator for cleaner code.

The if-else block can be simplified using a ternary operator.

-    if not pred:
-        pred = ""
-    else:
-        pred = pred[0]
+    pred = "" if not pred else pred[0]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c1f254 and c542634.

📒 Files selected for processing (10)
  • examples/models/qwen25vl.sh (1 hunks)
  • lmms_eval/tasks/salbench/_o3_default (1 hunks)
  • lmms_eval/tasks/salbench/_p3_default (1 hunks)
  • lmms_eval/tasks/salbench/o3.yaml (1 hunks)
  • lmms_eval/tasks/salbench/o3_box.yaml (1 hunks)
  • lmms_eval/tasks/salbench/o3_box_img.yaml (1 hunks)
  • lmms_eval/tasks/salbench/p3.yaml (1 hunks)
  • lmms_eval/tasks/salbench/p3_box.yaml (1 hunks)
  • lmms_eval/tasks/salbench/p3_box_img.yaml (1 hunks)
  • lmms_eval/tasks/salbench/utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • lmms_eval/tasks/salbench/_p3_default
🧰 Additional context used
🪛 YAMLlint (1.37.1)
lmms_eval/tasks/salbench/o3.yaml

[error] 3-3: no new line character at the end of file

(new-line-at-end-of-file)

lmms_eval/tasks/salbench/o3_box.yaml

[error] 3-3: no new line character at the end of file

(new-line-at-end-of-file)

lmms_eval/tasks/salbench/o3_box_img.yaml

[error] 3-3: no new line character at the end of file

(new-line-at-end-of-file)

lmms_eval/tasks/salbench/p3.yaml

[error] 3-3: no new line character at the end of file

(new-line-at-end-of-file)

lmms_eval/tasks/salbench/p3_box.yaml

[error] 3-3: no new line character at the end of file

(new-line-at-end-of-file)

lmms_eval/tasks/salbench/p3_box_img.yaml

[error] 3-3: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff (0.11.9)
lmms_eval/tasks/salbench/utils.py

145-148: Use ternary operator pred = "" if not pred else pred[0] instead of if-else-block

Replace if-else-block with pred = "" if not pred else pred[0]

(SIM108)

🔇 Additional comments (3)
examples/models/qwen25vl.sh (1)

14-18: LGTM! Configuration changes look appropriate.

The switch to the 3B model with reduced process count and simplified arguments aligns well with evaluating the new p3 task.

lmms_eval/tasks/salbench/_o3_default (1)

1-73: Well-structured configuration with comprehensive metrics.

The O3 default configuration properly defines all necessary parameters including dataset path, generation settings, document processing functions, and a thorough set of evaluation metrics for different categories.

lmms_eval/tasks/salbench/utils.py (1)

6-28: LGTM! Robust implementation with proper error handling.

The functions now include proper error handling for image conversion and input validation for text processing as requested in previous reviews.

Copy link
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

  • Can try to reuse some of the code between the process result
  • Pre-commit version changes, cc @Luodian to determine whether this change is necessary
  • Changes of the examples for qwen25_vl. Maybe you can add this as Readme in your task folder if you want to demonstrate how to run your benchmarks

Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no need to change this example file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Luodian, do we want to keep this pre-commit version or is it okay to upgrade?

Comment on lines 31 to 139
def p3_process_results(doc, results):
pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}

exact_match = int(pred == gt_ans)

# Per sample
matches = pred.intersection(gt_ans)
# how many retrieved categories are relevant
precision = len(matches) / (len(pred) + 1e-8)
# how many applicable categories are retrieved
recall = len(matches) / len(gt_ans)
f1 = 2 * (precision * recall) / (precision + recall + 1e-8)

categories = [
"orientation",
"color",
"size",
]
cat_preds = {}
for cat in categories:
cat_preds[cat] = {}
cat_preds[cat]["true_pos"] = int(cat in pred and cat in gt_ans)
cat_preds[cat]["false_pos"] = int(cat in pred and cat not in gt_ans)
cat_preds[cat]["true_neg"] = int(cat not in pred and cat not in gt_ans)
cat_preds[cat]["false_neg"] = int(cat not in pred and cat in gt_ans)

return {
"exact_match": {"score": exact_match},
"sample_precision": {"score": precision},
"sample_recall": {"score": recall},
"sample_f1": {"score": f1},
"all_cat_precision": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_recall": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_f1": {"id": doc["image_id"], "pred": cat_preds},
"orientation_precision": {"pred": cat_preds["orientation"]},
"orientation_recall": {"pred": cat_preds["orientation"]},
"orientation_f1": {"pred": cat_preds["orientation"]},
"color_precision": {"pred": cat_preds["color"]},
"color_recall": {"pred": cat_preds["color"]},
"color_f1": {"pred": cat_preds["color"]},
"size_precision": {"pred": cat_preds["size"]},
"size_recall": {"pred": cat_preds["size"]},
"size_f1": {"pred": cat_preds["size"]},
}


def o3_process_results(doc, results):
pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}

exact_match = int(pred == gt_ans)

# Per sample
matches = pred.intersection(gt_ans)
# how many retrieved categories are relevant
precision = len(matches) / (len(pred) + 1e-8)
# how many applicable categories are retrieved
recall = len(matches) / len(gt_ans)
f1 = 2 * (precision * recall) / (precision + recall + 1e-8)

categories = [
"orientation",
"color",
"focus",
"shape",
"size",
"location",
"pattern",
]
cat_preds = {}
for cat in categories:
cat_preds[cat] = {}
cat_preds[cat]["true_pos"] = int(cat in pred and cat in gt_ans)
cat_preds[cat]["false_pos"] = int(cat in pred and cat not in gt_ans)
cat_preds[cat]["true_neg"] = int(cat not in pred and cat not in gt_ans)
cat_preds[cat]["false_neg"] = int(cat not in pred and cat in gt_ans)

return {
"exact_match": {"score": exact_match},
"sample_precision": {"score": precision},
"sample_recall": {"score": recall},
"sample_f1": {"score": f1},
"all_cat_precision": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_recall": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_f1": {"id": doc["image_id"], "pred": cat_preds},
"orientation_precision": {"pred": cat_preds["orientation"]},
"orientation_recall": {"pred": cat_preds["orientation"]},
"orientation_f1": {"pred": cat_preds["orientation"]},
"color_precision": {"pred": cat_preds["color"]},
"color_recall": {"pred": cat_preds["color"]},
"color_f1": {"pred": cat_preds["color"]},
"focus_precision": {"pred": cat_preds["focus"]},
"focus_recall": {"pred": cat_preds["focus"]},
"focus_f1": {"pred": cat_preds["focus"]},
"shape_precision": {"pred": cat_preds["shape"]},
"shape_recall": {"pred": cat_preds["shape"]},
"shape_f1": {"pred": cat_preds["shape"]},
"size_precision": {"pred": cat_preds["size"]},
"size_recall": {"pred": cat_preds["size"]},
"size_f1": {"pred": cat_preds["size"]},
"location_precision": {"pred": cat_preds["location"]},
"location_recall": {"pred": cat_preds["location"]},
"location_f1": {"pred": cat_preds["location"]},
"pattern_precision": {"pred": cat_preds["pattern"]},
"pattern_recall": {"pred": cat_preds["pattern"]},
"pattern_f1": {"pred": cat_preds["pattern"]},
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two are not reusing any codes?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
lmms_eval/tasks/salbench/utils.py (1)

81-84: Consider using ternary operator for more concise code.

The if-else block can be simplified using a ternary operator as suggested by static analysis.

-    if not pred:
-        pred = ""
-    else:
-        pred = pred[0]
+    pred = "" if not pred else pred[0]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c542634 and 37e9ec1.

📒 Files selected for processing (2)
  • examples/models/qwen25vl.sh (1 hunks)
  • lmms_eval/tasks/salbench/utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • examples/models/qwen25vl.sh
🧰 Additional context used
🪛 Ruff (0.12.2)
lmms_eval/tasks/salbench/utils.py

81-84: Use ternary operator pred = "" if not pred else pred[0] instead of if-else-block

Replace if-else-block with pred = "" if not pred else pred[0]

(SIM108)

🔇 Additional comments (5)
lmms_eval/tasks/salbench/utils.py (5)

2-3: Well-implemented constants that eliminate code duplication.

Good solution to extract the category definitions into module-level constants, addressing the previous review feedback about reducing duplicated category lists across functions.


6-16: Robust image processing with proper error handling.

The function correctly handles both single images and image lists, with appropriate error handling for conversion failures. The implementation addresses previous review concerns effectively.


19-28: Well-implemented text processing with proper validation.

The function includes appropriate validation for required keys and handles optional prompt_kwargs correctly. The error handling addresses previous review concerns about robustness.


69-74: Excellent refactoring that eliminates code duplication.

The wrapper functions provide a clean interface while reusing the common process_results logic. This successfully addresses the previous review feedback about code duplication.


102-181: Well-structured aggregation functions with good separation of concerns.

The aggregation functions are cleanly implemented with proper use of the defined constants. The helper functions provide good separation of concerns and the epsilon values (1e-8) appropriately handle division by zero scenarios.

Comment on lines +94 to +99
def aggregate_per_sample_score(results):
total_score = 0
for result in results:
total_score += result["score"]
avg_score = total_score / len(results)
return avg_score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation for empty results list.

The function should handle the case where the results list is empty to avoid division by zero.

 def aggregate_per_sample_score(results):
+    if not results:
+        return 0.0
+    
     total_score = 0
     for result in results:
         total_score += result["score"]
     avg_score = total_score / len(results)
     return avg_score
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def aggregate_per_sample_score(results):
total_score = 0
for result in results:
total_score += result["score"]
avg_score = total_score / len(results)
return avg_score
def aggregate_per_sample_score(results):
if not results:
return 0.0
total_score = 0
for result in results:
total_score += result["score"]
avg_score = total_score / len(results)
return avg_score
🤖 Prompt for AI Agents
In lmms_eval/tasks/salbench/utils.py around lines 94 to 99, the function
aggregate_per_sample_score does not check if the results list is empty, which
can cause a division by zero error. Add a validation at the start of the
function to check if results is empty and handle this case appropriately, for
example by returning 0 or None before performing the division.

Comment on lines +31 to +66
def process_results(doc, results, categories):
pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}

exact_match = int(pred == gt_ans)
matches = pred.intersection(gt_ans)
precision = len(matches) / (len(pred) + 1e-8)
recall = len(matches) / len(gt_ans)
f1 = 2 * (precision * recall) / (precision + recall + 1e-8)

cat_preds = {}
for cat in categories:
cat_preds[cat] = {
"true_pos": int(cat in pred and cat in gt_ans),
"false_pos": int(cat in pred and cat not in gt_ans),
"true_neg": int(cat not in pred and cat not in gt_ans),
"false_neg": int(cat not in pred and cat in gt_ans),
}

output = {
"exact_match": {"score": exact_match},
"sample_precision": {"score": precision},
"sample_recall": {"score": recall},
"sample_f1": {"score": f1},
"all_cat_precision": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_recall": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_f1": {"id": doc["image_id"], "pred": cat_preds},
}

# Add per-category metrics
for cat in categories:
output[f"{cat}_precision"] = {"pred": cat_preds[cat]}
output[f"{cat}_recall"] = {"pred": cat_preds[cat]}
output[f"{cat}_f1"] = {"pred": cat_preds[cat]}

return output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add input validation for results parameter.

The function should validate that the results list is not empty before accessing results[0].

 def process_results(doc, results, categories):
+    if not results:
+        raise ValueError("Results list cannot be empty")
+    
     pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
     gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def process_results(doc, results, categories):
pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}
exact_match = int(pred == gt_ans)
matches = pred.intersection(gt_ans)
precision = len(matches) / (len(pred) + 1e-8)
recall = len(matches) / len(gt_ans)
f1 = 2 * (precision * recall) / (precision + recall + 1e-8)
cat_preds = {}
for cat in categories:
cat_preds[cat] = {
"true_pos": int(cat in pred and cat in gt_ans),
"false_pos": int(cat in pred and cat not in gt_ans),
"true_neg": int(cat not in pred and cat not in gt_ans),
"false_neg": int(cat not in pred and cat in gt_ans),
}
output = {
"exact_match": {"score": exact_match},
"sample_precision": {"score": precision},
"sample_recall": {"score": recall},
"sample_f1": {"score": f1},
"all_cat_precision": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_recall": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_f1": {"id": doc["image_id"], "pred": cat_preds},
}
# Add per-category metrics
for cat in categories:
output[f"{cat}_precision"] = {"pred": cat_preds[cat]}
output[f"{cat}_recall"] = {"pred": cat_preds[cat]}
output[f"{cat}_f1"] = {"pred": cat_preds[cat]}
return output
def process_results(doc, results, categories):
if not results:
raise ValueError("Results list cannot be empty")
pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}
exact_match = int(pred == gt_ans)
matches = pred.intersection(gt_ans)
precision = len(matches) / (len(pred) + 1e-8)
recall = len(matches) / len(gt_ans)
f1 = 2 * (precision * recall) / (precision + recall + 1e-8)
cat_preds = {}
for cat in categories:
cat_preds[cat] = {
"true_pos": int(cat in pred and cat in gt_ans),
"false_pos": int(cat in pred and cat not in gt_ans),
"true_neg": int(cat not in pred and cat not in gt_ans),
"false_neg": int(cat not in pred and cat in gt_ans),
}
output = {
"exact_match": {"score": exact_match},
"sample_precision": {"score": precision},
"sample_recall": {"score": recall},
"sample_f1": {"score": f1},
"all_cat_precision": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_recall": {"id": doc["image_id"], "pred": cat_preds},
"all_cat_f1": {"id": doc["image_id"], "pred": cat_preds},
}
# Add per-category metrics
for cat in categories:
output[f"{cat}_precision"] = {"pred": cat_preds[cat]}
output[f"{cat}_recall"] = {"pred": cat_preds[cat]}
output[f"{cat}_f1"] = {"pred": cat_preds[cat]}
return output
🤖 Prompt for AI Agents
In lmms_eval/tasks/salbench/utils.py around lines 31 to 66, the function
process_results accesses results[0] without checking if the results list is
empty, which can cause an IndexError. Add a validation at the start of the
function to check if results is not empty; if it is empty, handle the case
appropriately, such as returning an empty output or raising a clear exception
before proceeding.

@dunghuynhandy
Copy link
Contributor Author

Hi @kcz358 and @Luodian,
Thank you for your comments. I have solved your concern!

@Luodian Luodian merged commit 978eb7f into EvolvingLMMs-Lab:main Jul 16, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants