Skip to content

chore: Fixup integration of accuracy runs (part of #4)#95

Merged
arekay-nv merged 9 commits intomainfrom
arekay/chores_benchmark_integration
Jan 15, 2026
Merged

chore: Fixup integration of accuracy runs (part of #4)#95
arekay-nv merged 9 commits intomainfrom
arekay/chores_benchmark_integration

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv commented Jan 13, 2026

What does this PR do?

Addresses comments in previous PR on integration of accuracy datasets in benchmark command.
#4

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv requested a review from a team as a code owner January 13, 2026 03:55
Copilot AI review requested due to automatic review settings January 13, 2026 03:55
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 13, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions bot requested a review from nvzhihanj January 13, 2026 03:56
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and configurability of accuracy evaluations within the benchmark command. By introducing a dedicated AccuracyConfiguration dataclass and a dynamic Scorer registration system, it allows for easier integration of various scoring methods and provides granular control over dataset repetition during accuracy runs. This makes the benchmarking framework more robust and adaptable to different evaluation requirements.

Highlights

  • Refactored Accuracy Evaluation: Introduced a new AccuracyConfiguration dataclass and a Scorer registry for more flexible and extensible accuracy evaluation.
  • Dynamic Scorer Selection: Enabled dynamic selection of scoring methods (e.g., PassAt1Scorer) based on configuration, replacing hardcoded usage.
  • Configurable Dataset Repeats: Added num_repeats to accuracy dataset configurations, allowing users to specify how many times a dataset should be repeated for evaluation.
  • YAML Configuration Updates: Updated example YAML files to include the new num_repeats parameter and a modified report directory name.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses feedback from a previous pull request by refactoring the integration of accuracy datasets in the benchmark command. The changes introduce a registration system for scorers, add support for configurable dataset repeats, and improve code organization through the use of a dataclass.

Changes:

  • Added a scorer registration system using __init_subclass__ to enable dynamic scorer lookup by name
  • Introduced num_repeats configuration parameter for accuracy evaluation datasets
  • Refactored accuracy evaluation configuration using a dedicated AccuracyConfiguration dataclass

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/inference_endpoint/evaluation/scoring.py Added scorer registration system with PREDEFINED class variable and lookup methods, updated PassAt1Scorer to use registration
src/inference_endpoint/dataset_manager/factory.py Added **kwargs support to pass through additional parameters like num_repeats to data loaders
src/inference_endpoint/config/schema.py Added num_repeats field to AccuracyConfig with default value of 1
src/inference_endpoint/commands/benchmark.py Replaced tuple-based accuracy configuration with AccuracyConfiguration dataclass, improved variable naming from dataset_id to dataset_name
examples/04_GPTOSS120B_Example/sglang_gptoss_120b_example.yaml Added num_repeats values to example accuracy configurations and updated report directory path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the integration of accuracy datasets in the benchmark command, addressing comments from a previous PR. The changes are a significant improvement, introducing a dataclass for accuracy configuration and a factory pattern for scorers, which makes the code more readable, modular, and extensible. The logic for handling multiple accuracy datasets is now much cleaner. I've identified a couple of issues in src/inference_endpoint/dataset_manager/factory.py. One is a bug where num_repeats is ignored for non-predefined datasets, and the other is a confusing type hint that affects maintainability. My detailed comments are below.

@nvzhihanj nvzhihanj changed the title chore: Fixup integration of accuracy runs chore: Fixup integration of accuracy runs (part of #4) Jan 13, 2026
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 14, 2026 15:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 14, 2026 20:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@arekay-nv arekay-nv force-pushed the arekay/chores_benchmark_integration branch from 64176d7 to e1d2c5c Compare January 14, 2026 20:59
Copilot AI review requested due to automatic review settings January 14, 2026 21:06
@arekay-nv arekay-nv force-pushed the arekay/chores_benchmark_integration branch from e1d2c5c to e7d070b Compare January 14, 2026 21:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv force-pushed the arekay/chores_benchmark_integration branch from e7d070b to fc5efa1 Compare January 14, 2026 21:09
Copilot AI review requested due to automatic review settings January 14, 2026 22:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv force-pushed the arekay/chores_benchmark_integration branch from b645ca7 to a365b9a Compare January 14, 2026 22:38
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 14, 2026 22:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 15, 2026 04:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@arekay-nv arekay-nv merged commit d53ad87 into main Jan 15, 2026
4 checks passed
@arekay-nv arekay-nv deleted the arekay/chores_benchmark_integration branch January 15, 2026 04:38
@github-actions github-actions bot locked and limited conversation to collaborators Jan 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants