-
Notifications
You must be signed in to change notification settings - Fork 474
Title: Add Benchmark from "Vision-Language Models Can’t See the Obvious" (ICCV 2025) #744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes introduce a new "SalBench" dataset group with six task variants to the documentation and add corresponding YAML configuration files for each task under the evaluation framework. A utility module is also added to support data conversion, result processing, and metric aggregation for these tasks. Additionally, pre-commit hooks are updated, minor formatting fixes are applied to some modules, and a regex pattern is reformatted without functional changes. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SalBench Task Config (YAML)
participant Evaluation Framework
participant Utils Module
User->>Evaluation Framework: Selects SalBench task (e.g., p3, o3_box_img)
Evaluation Framework->>SalBench Task Config (YAML): Loads task configuration
Evaluation Framework->>Utils Module: Converts document to visual/text formats
Evaluation Framework->>Model: Sends input (image + prompt)
Model-->>Evaluation Framework: Returns prediction
Evaluation Framework->>Utils Module: Processes results (metrics calculation)
Utils Module-->>Evaluation Framework: Returns aggregated metrics
Evaluation Framework-->>User: Presents evaluation results
Suggested reviewers
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (4)
docs/current_tasks.md (1)
162-169: Add the task-name parentheses for consistency with the rest of the listEvery other entry follows the pattern
- [DatasetName](url) (task_name).
The newly-added SalBench block omits the parentheses, so downstream users can’t copy-paste the task id directly.-- [SalBench](https://salbench.github.io/) - - p3 +- [SalBench](https://salbench.github.io/) (salbench) + - p3(Feel free to add the six subset ids after the parent dataset in the same way MME, RefCOCO, etc. are documented.)
lmms_eval/tasks/salbench/p3_box.yaml (1)
71-73:metadatashould be a mapping, not a single-element listOther task YAMLs define:
metadata: version: 0.0Serialising as a list (
- version: 0.0) may break tools that expect a dict.lmms_eval/tasks/salbench/o3.yaml (1)
122-123: Convertmetadatato dict for consistencySee earlier remark.
lmms_eval/tasks/salbench/p3_box_img.yaml (1)
72-73:metadataformattingUse a mapping instead of a list for
metadatafor parity with other task files.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/current_tasks.md(1 hunks)lmms_eval/tasks/salbench/o3.yaml(1 hunks)lmms_eval/tasks/salbench/o3_box.yaml(1 hunks)lmms_eval/tasks/salbench/o3_box_img.yaml(1 hunks)lmms_eval/tasks/salbench/p3.yaml(1 hunks)lmms_eval/tasks/salbench/p3_box.yaml(1 hunks)lmms_eval/tasks/salbench/p3_box_img.yaml(1 hunks)lmms_eval/tasks/salbench/utils.py(1 hunks)
🔇 Additional comments (5)
lmms_eval/tasks/salbench/p3_box.yaml (1)
8-9: Verify the dotted path used in!functiondirectives
utils.p3o3_doc_to_visualassumes that the module is imported aslmms_eval.tasks.salbench.utilswith an aliasutils.
Most existing YAMLs in the repo use the fully-qualified path (tasks.salbench.utils.p3o3_doc_to_visual). Mis-resolution will surface only at runtime.Please confirm the loader can locate these functions or update the path:
-doc_to_visual: !function utils.p3o3_doc_to_visual -doc_to_text: !function utils.p3o3_doc_to_text +doc_to_visual: !function tasks.salbench.utils.p3o3_doc_to_visual +doc_to_text: !function tasks.salbench.utils.p3o3_doc_to_textlmms_eval/tasks/salbench/o3_box.yaml (1)
8-9: Same loader-path concern as inp3_box.yamlSee previous comment about fully-qualified module paths.
lmms_eval/tasks/salbench/o3.yaml (1)
8-9: Confirm function module path resolutionSame note about
utils.p3o3_doc_to_visual/utils.p3o3_doc_to_text.lmms_eval/tasks/salbench/p3_box_img.yaml (1)
8-9: Check function path – potential runtime import errorEnsure the loader resolves
utils.p3o3_doc_to_visualandutils.p3o3_doc_to_textcorrectly, or switch to the fully-qualified path.lmms_eval/tasks/salbench/p3.yaml (1)
1-74: Configuration looks well-structured and comprehensive.The YAML configuration follows standard patterns for evaluation tasks with proper dataset setup, generation parameters, and comprehensive metric definitions. The function references to the utils module are consistent and the metric hierarchy (sample-level, all-category, per-category) provides thorough evaluation coverage.
|
Hi thanks! I can see there are some suggestions provided by our codebase manage agent, could you help to take a look and fix them (you may pick some to fix, no need for all) |
kcz358
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the parts look good to merge to me. Can you run a 'pre-commit run --all-files` command to fix the format of the code? Then I would merge this PR. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
.pre-commit-config.yaml (1)
9-13: isort 6 is a major release—run the hook locally once and commit autofixesVersion 6 rewrites the default section definitions and may reshuffle existing imports the first time it runs. Execute
pre-commit run --all-files isortlocally, inspect the diff, and commit the changes so CI doesn’t churn on every run.lmms_eval/api/samplers.py (1)
40-40: Consider refactoring for improved readability and performance.While the logic appears correct, this single-line conditional expression is difficult to read and maintain. The expression also calls
self.doc_to_target(doc)multiple times, which could be inefficient.Consider extracting the target value and breaking down the conditional logic:
- else self.doc_to_target(doc) if (self.config.doc_to_choice is None or type(self.doc_to_target(doc)) is str) else str(self.doc_to_choice(doc)[self.doc_to_target(doc)]) + else self._format_target(doc)And add a helper method to the class:
def _format_target(self, doc): target = self.doc_to_target(doc) if self.config.doc_to_choice is None or type(target) is str: return target else: return str(self.doc_to_choice(doc)[target])This approach improves readability, avoids repeated method calls, and makes the logic easier to test and maintain.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.pre-commit-config.yaml(1 hunks)lmms_eval/api/samplers.py(1 hunks)lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py(1 hunks)lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py(1 hunks)lmms_eval/tasks/librispeech/cn_tn.py(1 hunks)lmms_eval/tasks/salbench/utils.py(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py
- lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py
- lmms_eval/tasks/librispeech/cn_tn.py
🚧 Files skipped from review as they are similar to previous changes (1)
- lmms_eval/tasks/salbench/utils.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
lmms_eval/api/samplers.py (1)
lmms_eval/api/task.py (4)
doc_to_target(378-379)doc_to_target(1330-1365)config(291-293)doc_to_choice(1385-1405)
🔇 Additional comments (1)
.pre-commit-config.yaml (1)
3-7: Confirm Black 25.1.0 is safe for your Python version & CI imagesBlack 25.x drops support for Python < 3.8 and slightly tightens some formatting rules (e.g. numeric literal normalization). Double-check that the lowest Python version declared in
pyproject.toml, your Docker images, and all CI jobs is ≥ 3.8; otherwise the hook will fail.
|
Dear @kcz358 , |
|
Thanks @dunghuynhandy, do you think any of the comment from the bot is useful and needs to fix? I think most of its suggestion is to try to use common code so that the code is more simple. |
|
Thanks! I will try to make my code simpler based on the sugesstions |
|
Agree, I think should merge some redundant code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
lmms_eval/tasks/salbench/o3_box.yaml (1)
1-3: Trailing newline + re-check duplicate metrics in_o3_default.
- Add the missing LF as in the other task files.
- Past reviews flagged duplicated
pattern_*metrics inside_o3_default. Verify they were removed; otherwise evaluations will crash withValueErrorfor duplicate keys.[tag applies to both points]
lmms_eval/tasks/salbench/o3_box_img.yaml (1)
1-3: Trailing newline + duplicated metric check.Same two points as
o3_box.yaml:dataset_name: O3_box_img task: "o3_box_img" include: _o3_default +lmms_eval/tasks/salbench/utils.py (2)
31-75: Use global constants instead of hardcoded categories.The function defines categories locally instead of using the global
P3_CATEGORIESconstant defined at the top of the file.def p3_process_results(doc, results): pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")} gt_ans = {x.strip() for x in doc["answer"].lower().split(",")} exact_match = int(pred == gt_ans) # Per sample matches = pred.intersection(gt_ans) # how many retrieved categories are relevant precision = len(matches) / (len(pred) + 1e-8) # how many applicable categories are retrieved recall = len(matches) / len(gt_ans) f1 = 2 * (precision * recall) / (precision + recall + 1e-8) - categories = [ - "orientation", - "color", - "size", - ] + categories = P3_CATEGORIES cat_preds = {}
78-138: Use global constants and eliminate code duplication.This function duplicates most of the logic from
p3_process_resultsand uses hardcoded categories instead of the globalO3_CATEGORIESconstant.Consider extracting the common logic into a shared helper function as suggested in previous reviews, and use the global constant:
def o3_process_results(doc, results): pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")} gt_ans = {x.strip() for x in doc["answer"].lower().split(",")} exact_match = int(pred == gt_ans) # Per sample matches = pred.intersection(gt_ans) # how many retrieved categories are relevant precision = len(matches) / (len(pred) + 1e-8) # how many applicable categories are retrieved recall = len(matches) / len(gt_ans) f1 = 2 * (precision * recall) / (precision + recall + 1e-8) - categories = [ - "orientation", - "color", - "focus", - "shape", - "size", - "location", - "pattern", - ] + categories = O3_CATEGORIES cat_preds = {}
🧹 Nitpick comments (5)
lmms_eval/tasks/salbench/p3.yaml (1)
1-3: Add missing newline at end-of-file (YAMLlint error).All newly-added SalBench task YAMLs lack a trailing newline, which YAML-lint flags (
new-line-at-end-of-file).
Add a single LF to keep tooling quiet and stay consistent with the rest of the repo.dataset_name: P3 task: "p3" include: _p3_default +lmms_eval/tasks/salbench/p3_box.yaml (1)
1-3: Same newline issue as above.Please append a trailing LF here as well.
dataset_name: P3_box task: "p3_box" include: _p3_default +lmms_eval/tasks/salbench/p3_box_img.yaml (1)
1-3: Trailing newline.Just add the missing LF.
dataset_name: P3_box_img task: "p3_box_img" include: _p3_default +lmms_eval/tasks/salbench/o3.yaml (1)
3-3: Add missing newline at end of file.YAML files should end with a newline character for better compatibility.
-include: _o3_default +include: _o3_default +lmms_eval/tasks/salbench/utils.py (1)
145-148: Use ternary operator for cleaner code.The if-else block can be simplified using a ternary operator.
- if not pred: - pred = "" - else: - pred = pred[0] + pred = "" if not pred else pred[0]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
examples/models/qwen25vl.sh(1 hunks)lmms_eval/tasks/salbench/_o3_default(1 hunks)lmms_eval/tasks/salbench/_p3_default(1 hunks)lmms_eval/tasks/salbench/o3.yaml(1 hunks)lmms_eval/tasks/salbench/o3_box.yaml(1 hunks)lmms_eval/tasks/salbench/o3_box_img.yaml(1 hunks)lmms_eval/tasks/salbench/p3.yaml(1 hunks)lmms_eval/tasks/salbench/p3_box.yaml(1 hunks)lmms_eval/tasks/salbench/p3_box_img.yaml(1 hunks)lmms_eval/tasks/salbench/utils.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- lmms_eval/tasks/salbench/_p3_default
🧰 Additional context used
🪛 YAMLlint (1.37.1)
lmms_eval/tasks/salbench/o3.yaml
[error] 3-3: no new line character at the end of file
(new-line-at-end-of-file)
lmms_eval/tasks/salbench/o3_box.yaml
[error] 3-3: no new line character at the end of file
(new-line-at-end-of-file)
lmms_eval/tasks/salbench/o3_box_img.yaml
[error] 3-3: no new line character at the end of file
(new-line-at-end-of-file)
lmms_eval/tasks/salbench/p3.yaml
[error] 3-3: no new line character at the end of file
(new-line-at-end-of-file)
lmms_eval/tasks/salbench/p3_box.yaml
[error] 3-3: no new line character at the end of file
(new-line-at-end-of-file)
lmms_eval/tasks/salbench/p3_box_img.yaml
[error] 3-3: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Ruff (0.11.9)
lmms_eval/tasks/salbench/utils.py
145-148: Use ternary operator pred = "" if not pred else pred[0] instead of if-else-block
Replace if-else-block with pred = "" if not pred else pred[0]
(SIM108)
🔇 Additional comments (3)
examples/models/qwen25vl.sh (1)
14-18: LGTM! Configuration changes look appropriate.The switch to the 3B model with reduced process count and simplified arguments aligns well with evaluating the new p3 task.
lmms_eval/tasks/salbench/_o3_default (1)
1-73: Well-structured configuration with comprehensive metrics.The O3 default configuration properly defines all necessary parameters including dataset path, generation settings, document processing functions, and a thorough set of evaluation metrics for different categories.
lmms_eval/tasks/salbench/utils.py (1)
6-28: LGTM! Robust implementation with proper error handling.The functions now include proper error handling for image conversion and input validation for text processing as requested in previous reviews.
kcz358
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments
- Can try to reuse some of the code between the process result
- Pre-commit version changes, cc @Luodian to determine whether this change is necessary
- Changes of the examples for qwen25_vl. Maybe you can add this as Readme in your task folder if you want to demonstrate how to run your benchmarks
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no need to change this example file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Luodian, do we want to keep this pre-commit version or is it okay to upgrade?
lmms_eval/tasks/salbench/utils.py
Outdated
| def p3_process_results(doc, results): | ||
| pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")} | ||
| gt_ans = {x.strip() for x in doc["answer"].lower().split(",")} | ||
|
|
||
| exact_match = int(pred == gt_ans) | ||
|
|
||
| # Per sample | ||
| matches = pred.intersection(gt_ans) | ||
| # how many retrieved categories are relevant | ||
| precision = len(matches) / (len(pred) + 1e-8) | ||
| # how many applicable categories are retrieved | ||
| recall = len(matches) / len(gt_ans) | ||
| f1 = 2 * (precision * recall) / (precision + recall + 1e-8) | ||
|
|
||
| categories = [ | ||
| "orientation", | ||
| "color", | ||
| "size", | ||
| ] | ||
| cat_preds = {} | ||
| for cat in categories: | ||
| cat_preds[cat] = {} | ||
| cat_preds[cat]["true_pos"] = int(cat in pred and cat in gt_ans) | ||
| cat_preds[cat]["false_pos"] = int(cat in pred and cat not in gt_ans) | ||
| cat_preds[cat]["true_neg"] = int(cat not in pred and cat not in gt_ans) | ||
| cat_preds[cat]["false_neg"] = int(cat not in pred and cat in gt_ans) | ||
|
|
||
| return { | ||
| "exact_match": {"score": exact_match}, | ||
| "sample_precision": {"score": precision}, | ||
| "sample_recall": {"score": recall}, | ||
| "sample_f1": {"score": f1}, | ||
| "all_cat_precision": {"id": doc["image_id"], "pred": cat_preds}, | ||
| "all_cat_recall": {"id": doc["image_id"], "pred": cat_preds}, | ||
| "all_cat_f1": {"id": doc["image_id"], "pred": cat_preds}, | ||
| "orientation_precision": {"pred": cat_preds["orientation"]}, | ||
| "orientation_recall": {"pred": cat_preds["orientation"]}, | ||
| "orientation_f1": {"pred": cat_preds["orientation"]}, | ||
| "color_precision": {"pred": cat_preds["color"]}, | ||
| "color_recall": {"pred": cat_preds["color"]}, | ||
| "color_f1": {"pred": cat_preds["color"]}, | ||
| "size_precision": {"pred": cat_preds["size"]}, | ||
| "size_recall": {"pred": cat_preds["size"]}, | ||
| "size_f1": {"pred": cat_preds["size"]}, | ||
| } | ||
|
|
||
|
|
||
| def o3_process_results(doc, results): | ||
| pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")} | ||
| gt_ans = {x.strip() for x in doc["answer"].lower().split(",")} | ||
|
|
||
| exact_match = int(pred == gt_ans) | ||
|
|
||
| # Per sample | ||
| matches = pred.intersection(gt_ans) | ||
| # how many retrieved categories are relevant | ||
| precision = len(matches) / (len(pred) + 1e-8) | ||
| # how many applicable categories are retrieved | ||
| recall = len(matches) / len(gt_ans) | ||
| f1 = 2 * (precision * recall) / (precision + recall + 1e-8) | ||
|
|
||
| categories = [ | ||
| "orientation", | ||
| "color", | ||
| "focus", | ||
| "shape", | ||
| "size", | ||
| "location", | ||
| "pattern", | ||
| ] | ||
| cat_preds = {} | ||
| for cat in categories: | ||
| cat_preds[cat] = {} | ||
| cat_preds[cat]["true_pos"] = int(cat in pred and cat in gt_ans) | ||
| cat_preds[cat]["false_pos"] = int(cat in pred and cat not in gt_ans) | ||
| cat_preds[cat]["true_neg"] = int(cat not in pred and cat not in gt_ans) | ||
| cat_preds[cat]["false_neg"] = int(cat not in pred and cat in gt_ans) | ||
|
|
||
| return { | ||
| "exact_match": {"score": exact_match}, | ||
| "sample_precision": {"score": precision}, | ||
| "sample_recall": {"score": recall}, | ||
| "sample_f1": {"score": f1}, | ||
| "all_cat_precision": {"id": doc["image_id"], "pred": cat_preds}, | ||
| "all_cat_recall": {"id": doc["image_id"], "pred": cat_preds}, | ||
| "all_cat_f1": {"id": doc["image_id"], "pred": cat_preds}, | ||
| "orientation_precision": {"pred": cat_preds["orientation"]}, | ||
| "orientation_recall": {"pred": cat_preds["orientation"]}, | ||
| "orientation_f1": {"pred": cat_preds["orientation"]}, | ||
| "color_precision": {"pred": cat_preds["color"]}, | ||
| "color_recall": {"pred": cat_preds["color"]}, | ||
| "color_f1": {"pred": cat_preds["color"]}, | ||
| "focus_precision": {"pred": cat_preds["focus"]}, | ||
| "focus_recall": {"pred": cat_preds["focus"]}, | ||
| "focus_f1": {"pred": cat_preds["focus"]}, | ||
| "shape_precision": {"pred": cat_preds["shape"]}, | ||
| "shape_recall": {"pred": cat_preds["shape"]}, | ||
| "shape_f1": {"pred": cat_preds["shape"]}, | ||
| "size_precision": {"pred": cat_preds["size"]}, | ||
| "size_recall": {"pred": cat_preds["size"]}, | ||
| "size_f1": {"pred": cat_preds["size"]}, | ||
| "location_precision": {"pred": cat_preds["location"]}, | ||
| "location_recall": {"pred": cat_preds["location"]}, | ||
| "location_f1": {"pred": cat_preds["location"]}, | ||
| "pattern_precision": {"pred": cat_preds["pattern"]}, | ||
| "pattern_recall": {"pred": cat_preds["pattern"]}, | ||
| "pattern_f1": {"pred": cat_preds["pattern"]}, | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two are not reusing any codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
lmms_eval/tasks/salbench/utils.py (1)
81-84: Consider using ternary operator for more concise code.The if-else block can be simplified using a ternary operator as suggested by static analysis.
- if not pred: - pred = "" - else: - pred = pred[0] + pred = "" if not pred else pred[0]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/models/qwen25vl.sh(1 hunks)lmms_eval/tasks/salbench/utils.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- examples/models/qwen25vl.sh
🧰 Additional context used
🪛 Ruff (0.12.2)
lmms_eval/tasks/salbench/utils.py
81-84: Use ternary operator pred = "" if not pred else pred[0] instead of if-else-block
Replace if-else-block with pred = "" if not pred else pred[0]
(SIM108)
🔇 Additional comments (5)
lmms_eval/tasks/salbench/utils.py (5)
2-3: Well-implemented constants that eliminate code duplication.Good solution to extract the category definitions into module-level constants, addressing the previous review feedback about reducing duplicated category lists across functions.
6-16: Robust image processing with proper error handling.The function correctly handles both single images and image lists, with appropriate error handling for conversion failures. The implementation addresses previous review concerns effectively.
19-28: Well-implemented text processing with proper validation.The function includes appropriate validation for required keys and handles optional prompt_kwargs correctly. The error handling addresses previous review concerns about robustness.
69-74: Excellent refactoring that eliminates code duplication.The wrapper functions provide a clean interface while reusing the common
process_resultslogic. This successfully addresses the previous review feedback about code duplication.
102-181: Well-structured aggregation functions with good separation of concerns.The aggregation functions are cleanly implemented with proper use of the defined constants. The helper functions provide good separation of concerns and the epsilon values (1e-8) appropriately handle division by zero scenarios.
| def aggregate_per_sample_score(results): | ||
| total_score = 0 | ||
| for result in results: | ||
| total_score += result["score"] | ||
| avg_score = total_score / len(results) | ||
| return avg_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add validation for empty results list.
The function should handle the case where the results list is empty to avoid division by zero.
def aggregate_per_sample_score(results):
+ if not results:
+ return 0.0
+
total_score = 0
for result in results:
total_score += result["score"]
avg_score = total_score / len(results)
return avg_score📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def aggregate_per_sample_score(results): | |
| total_score = 0 | |
| for result in results: | |
| total_score += result["score"] | |
| avg_score = total_score / len(results) | |
| return avg_score | |
| def aggregate_per_sample_score(results): | |
| if not results: | |
| return 0.0 | |
| total_score = 0 | |
| for result in results: | |
| total_score += result["score"] | |
| avg_score = total_score / len(results) | |
| return avg_score |
🤖 Prompt for AI Agents
In lmms_eval/tasks/salbench/utils.py around lines 94 to 99, the function
aggregate_per_sample_score does not check if the results list is empty, which
can cause a division by zero error. Add a validation at the start of the
function to check if results is empty and handle this case appropriately, for
example by returning 0 or None before performing the division.
| def process_results(doc, results, categories): | ||
| pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")} | ||
| gt_ans = {x.strip() for x in doc["answer"].lower().split(",")} | ||
|
|
||
| exact_match = int(pred == gt_ans) | ||
| matches = pred.intersection(gt_ans) | ||
| precision = len(matches) / (len(pred) + 1e-8) | ||
| recall = len(matches) / len(gt_ans) | ||
| f1 = 2 * (precision * recall) / (precision + recall + 1e-8) | ||
|
|
||
| cat_preds = {} | ||
| for cat in categories: | ||
| cat_preds[cat] = { | ||
| "true_pos": int(cat in pred and cat in gt_ans), | ||
| "false_pos": int(cat in pred and cat not in gt_ans), | ||
| "true_neg": int(cat not in pred and cat not in gt_ans), | ||
| "false_neg": int(cat not in pred and cat in gt_ans), | ||
| } | ||
|
|
||
| output = { | ||
| "exact_match": {"score": exact_match}, | ||
| "sample_precision": {"score": precision}, | ||
| "sample_recall": {"score": recall}, | ||
| "sample_f1": {"score": f1}, | ||
| "all_cat_precision": {"id": doc["image_id"], "pred": cat_preds}, | ||
| "all_cat_recall": {"id": doc["image_id"], "pred": cat_preds}, | ||
| "all_cat_f1": {"id": doc["image_id"], "pred": cat_preds}, | ||
| } | ||
|
|
||
| # Add per-category metrics | ||
| for cat in categories: | ||
| output[f"{cat}_precision"] = {"pred": cat_preds[cat]} | ||
| output[f"{cat}_recall"] = {"pred": cat_preds[cat]} | ||
| output[f"{cat}_f1"] = {"pred": cat_preds[cat]} | ||
|
|
||
| return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add input validation for results parameter.
The function should validate that the results list is not empty before accessing results[0].
def process_results(doc, results, categories):
+ if not results:
+ raise ValueError("Results list cannot be empty")
+
pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")}
gt_ans = {x.strip() for x in doc["answer"].lower().split(",")}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def process_results(doc, results, categories): | |
| pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")} | |
| gt_ans = {x.strip() for x in doc["answer"].lower().split(",")} | |
| exact_match = int(pred == gt_ans) | |
| matches = pred.intersection(gt_ans) | |
| precision = len(matches) / (len(pred) + 1e-8) | |
| recall = len(matches) / len(gt_ans) | |
| f1 = 2 * (precision * recall) / (precision + recall + 1e-8) | |
| cat_preds = {} | |
| for cat in categories: | |
| cat_preds[cat] = { | |
| "true_pos": int(cat in pred and cat in gt_ans), | |
| "false_pos": int(cat in pred and cat not in gt_ans), | |
| "true_neg": int(cat not in pred and cat not in gt_ans), | |
| "false_neg": int(cat not in pred and cat in gt_ans), | |
| } | |
| output = { | |
| "exact_match": {"score": exact_match}, | |
| "sample_precision": {"score": precision}, | |
| "sample_recall": {"score": recall}, | |
| "sample_f1": {"score": f1}, | |
| "all_cat_precision": {"id": doc["image_id"], "pred": cat_preds}, | |
| "all_cat_recall": {"id": doc["image_id"], "pred": cat_preds}, | |
| "all_cat_f1": {"id": doc["image_id"], "pred": cat_preds}, | |
| } | |
| # Add per-category metrics | |
| for cat in categories: | |
| output[f"{cat}_precision"] = {"pred": cat_preds[cat]} | |
| output[f"{cat}_recall"] = {"pred": cat_preds[cat]} | |
| output[f"{cat}_f1"] = {"pred": cat_preds[cat]} | |
| return output | |
| def process_results(doc, results, categories): | |
| if not results: | |
| raise ValueError("Results list cannot be empty") | |
| pred = {x.strip() for x in results[0].lower().strip("[").strip("]").split(",")} | |
| gt_ans = {x.strip() for x in doc["answer"].lower().split(",")} | |
| exact_match = int(pred == gt_ans) | |
| matches = pred.intersection(gt_ans) | |
| precision = len(matches) / (len(pred) + 1e-8) | |
| recall = len(matches) / len(gt_ans) | |
| f1 = 2 * (precision * recall) / (precision + recall + 1e-8) | |
| cat_preds = {} | |
| for cat in categories: | |
| cat_preds[cat] = { | |
| "true_pos": int(cat in pred and cat in gt_ans), | |
| "false_pos": int(cat in pred and cat not in gt_ans), | |
| "true_neg": int(cat not in pred and cat not in gt_ans), | |
| "false_neg": int(cat not in pred and cat in gt_ans), | |
| } | |
| output = { | |
| "exact_match": {"score": exact_match}, | |
| "sample_precision": {"score": precision}, | |
| "sample_recall": {"score": recall}, | |
| "sample_f1": {"score": f1}, | |
| "all_cat_precision": {"id": doc["image_id"], "pred": cat_preds}, | |
| "all_cat_recall": {"id": doc["image_id"], "pred": cat_preds}, | |
| "all_cat_f1": {"id": doc["image_id"], "pred": cat_preds}, | |
| } | |
| # Add per-category metrics | |
| for cat in categories: | |
| output[f"{cat}_precision"] = {"pred": cat_preds[cat]} | |
| output[f"{cat}_recall"] = {"pred": cat_preds[cat]} | |
| output[f"{cat}_f1"] = {"pred": cat_preds[cat]} | |
| return output |
🤖 Prompt for AI Agents
In lmms_eval/tasks/salbench/utils.py around lines 31 to 66, the function
process_results accesses results[0] without checking if the results list is
empty, which can cause an IndexError. Add a validation at the start of the
function to check if results is not empty; if it is empty, handle the case
appropriately, such as returning an empty output or raising a clear exception
before proceeding.
HI EvolvingLMMs-Lab,
This PR adds our benchmark from the paper:
"Vision-Language Models Can’t See the Obvious"
Accepted at ICCV 2025
arXiv:2507.04741
The benchmark evaluates vision-language models on their ability to recognize basic and intuitive visual relationships — a capability that current state-of-the-art models often struggle with.
We believe this dataset will be a valuable addition to lmms-eval, offering a new perspective on commonsense grounding.
Please let us know if any changes or additions are needed. We're happy to adjust to align with the repo's standards.
Best Regards,
Ngoc Dung Huynh
Summary by CodeRabbit
New Features
Documentation