[Fix] update kontext_bench, refcoco by Yunnglin · Pull Request #1229 · modelscope/evalscope

Yunnglin · 2026-03-16T07:56:24Z

No description provided.

gemini-code-assist · 2026-03-16T07:56:41Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving the robustness of data processing for the kontext_bench dataset and enhancing the user experience of performance reports. It ensures that image instructions are always available for kontext_bench and adds a dynamic category column to reports, making them more informative and adaptable to various evaluation scenarios.

Highlights

Robustness for kontext_bench dataset: Improved the handling of the kontext_bench dataset by providing a default instruction 'Describe the image' if the 'instruction' key is missing, preventing potential errors during data processing.
Enhanced Report Generation with Category Column: Introduced a new 'Category' column in the performance reports, which is dynamically displayed only when there are multiple unique categories or when categories are present, improving the clarity and detail of subset scores.
Test Case Update: Updated an existing performance test to use the kontext_bench dataset, aligning testing with recent dataset changes or additions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

evalscope/perf/plugin/datasets/kontext_bench.py
- Changed 'instruction' key access from direct indexing to get() with a default value.
evalscope/report/renderer.py
- Added a show_category flag to model sections, defaulting to False when no metrics are present.
- Included 'category' in subset_rows dictionaries, derived from cat.name.
- Implemented logic to determine show_category based on the uniqueness and presence of categories in subset_rows.
evalscope/report/template/js/i18n_eval.js
- Added 'col.category' translation keys for both English and Chinese.
evalscope/report/template/report.html.j2
- Conditionally rendered the 'Category' table header based on msec.show_category.
- Conditionally rendered the 'Category' data cells within the table body based on msec.show_category.
tests/perf/test_perf.py
- Updated the dataset parameter in test_run_perf_vl from 'flickr8k' to 'kontext_bench'.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a 'Category' column to the subset scores table in the HTML report, complete with the necessary backend, template, and internationalization changes. The logic to conditionally display this column is a good addition for improving report clarity. The PR also includes a robustness fix in evalscope/perf/plugin/datasets/kontext_bench.py to handle missing 'instruction' fields. The changes are well-implemented across the different files. I have one suggestion to improve the logic for showing the category column to make the report even cleaner by hiding redundant information.

evalscope/report/renderer.py

Copilot

Pull request overview

Updates the perf dataset configuration and enhances the HTML evaluation report to optionally display metric categories (with i18n support), while making the kontext_bench dataset plugin more robust to missing instructions.

Changes:

Switch perf VL test dataset from flickr8k to kontext_bench.
Add optional “Category” column to the report subset-score table (renderer + template + i18n strings).
Add a fallback instruction string in the kontext_bench dataset plugin when instruction is missing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/perf/test_perf.py	Updates perf VL test to use `kontext_bench`.
evalscope/report/template/report.html.j2	Conditionally renders a Category column in the subset table.
evalscope/report/template/js/i18n_eval.js	Adds `col.category` translation keys (EN/ZH).
evalscope/report/renderer.py	Populates per-row category and computes whether to show the Category column.
evalscope/perf/plugin/datasets/kontext_bench.py	Adds default instruction text if absent in dataset items.

You can also share your feedback on Copilot code review. Take the survey.

evalscope/report/renderer.py

evalscope/report/template/report.html.j2

Yunnglin added 2 commits March 16, 2026 15:42

update

7176efa

update refcoco

fa06c58

Copilot AI review requested due to automatic review settings March 16, 2026 07:56

Copilot started reviewing on behalf of Yunnglin March 16, 2026 07:56 View session

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

evalscope/report/renderer.py Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 16, 2026

View reviewed changes

evalscope/report/renderer.py Outdated Show resolved Hide resolved

evalscope/report/template/report.html.j2 Show resolved Hide resolved

This was linked to issues Mar 16, 2026

使用 evalscope perf进行性能压测时使用内置数据集kontext_bench报错 KeyError: 'instruction' #1225

Closed

QWEN3.5 测试RefCOCO，需要修改prompt，否则准确率低 #1226

Closed

update

9482fb4

Yunnglin merged commit ce1dd98 into main Mar 16, 2026
3 checks passed

Yunnglin deleted the fix/kontext_bench branch March 16, 2026 08:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] update kontext_bench, refcoco#1229

[Fix] update kontext_bench, refcoco#1229
Yunnglin merged 3 commits intomainfrom
fix/kontext_bench

Yunnglin commented Mar 16, 2026

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yunnglin commented Mar 16, 2026

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants