Skip to content

[Fix] update kontext_bench, refcoco#1229

Merged
Yunnglin merged 3 commits intomainfrom
fix/kontext_bench
Mar 16, 2026
Merged

[Fix] update kontext_bench, refcoco#1229
Yunnglin merged 3 commits intomainfrom
fix/kontext_bench

Conversation

@Yunnglin
Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings March 16, 2026 07:56
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving the robustness of data processing for the kontext_bench dataset and enhancing the user experience of performance reports. It ensures that image instructions are always available for kontext_bench and adds a dynamic category column to reports, making them more informative and adaptable to various evaluation scenarios.

Highlights

  • Robustness for kontext_bench dataset: Improved the handling of the kontext_bench dataset by providing a default instruction 'Describe the image' if the 'instruction' key is missing, preventing potential errors during data processing.
  • Enhanced Report Generation with Category Column: Introduced a new 'Category' column in the performance reports, which is dynamically displayed only when there are multiple unique categories or when categories are present, improving the clarity and detail of subset scores.
  • Test Case Update: Updated an existing performance test to use the kontext_bench dataset, aligning testing with recent dataset changes or additions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • evalscope/perf/plugin/datasets/kontext_bench.py
    • Changed 'instruction' key access from direct indexing to get() with a default value.
  • evalscope/report/renderer.py
    • Added a show_category flag to model sections, defaulting to False when no metrics are present.
    • Included 'category' in subset_rows dictionaries, derived from cat.name.
    • Implemented logic to determine show_category based on the uniqueness and presence of categories in subset_rows.
  • evalscope/report/template/js/i18n_eval.js
    • Added 'col.category' translation keys for both English and Chinese.
  • evalscope/report/template/report.html.j2
    • Conditionally rendered the 'Category' table header based on msec.show_category.
    • Conditionally rendered the 'Category' data cells within the table body based on msec.show_category.
  • tests/perf/test_perf.py
    • Updated the dataset parameter in test_run_perf_vl from 'flickr8k' to 'kontext_bench'.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'Category' column to the subset scores table in the HTML report, complete with the necessary backend, template, and internationalization changes. The logic to conditionally display this column is a good addition for improving report clarity. The PR also includes a robustness fix in evalscope/perf/plugin/datasets/kontext_bench.py to handle missing 'instruction' fields. The changes are well-implemented across the different files. I have one suggestion to improve the logic for showing the category column to make the report even cleaner by hiding redundant information.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the perf dataset configuration and enhances the HTML evaluation report to optionally display metric categories (with i18n support), while making the kontext_bench dataset plugin more robust to missing instructions.

Changes:

  • Switch perf VL test dataset from flickr8k to kontext_bench.
  • Add optional “Category” column to the report subset-score table (renderer + template + i18n strings).
  • Add a fallback instruction string in the kontext_bench dataset plugin when instruction is missing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/perf/test_perf.py Updates perf VL test to use kontext_bench.
evalscope/report/template/report.html.j2 Conditionally renders a Category column in the subset table.
evalscope/report/template/js/i18n_eval.js Adds col.category translation keys (EN/ZH).
evalscope/report/renderer.py Populates per-row category and computes whether to show the Category column.
evalscope/perf/plugin/datasets/kontext_bench.py Adds default instruction text if absent in dataset items.

You can also share your feedback on Copilot code review. Take the survey.

@Yunnglin Yunnglin merged commit ce1dd98 into main Mar 16, 2026
3 checks passed
@Yunnglin Yunnglin deleted the fix/kontext_bench branch March 16, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants