chore: Fixup integration of accuracy runs (part of #4)#95
Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility and configurability of accuracy evaluations within the benchmark command. By introducing a dedicated Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR addresses feedback from a previous pull request by refactoring the integration of accuracy datasets in the benchmark command. The changes introduce a registration system for scorers, add support for configurable dataset repeats, and improve code organization through the use of a dataclass.
Changes:
- Added a scorer registration system using
__init_subclass__to enable dynamic scorer lookup by name - Introduced
num_repeatsconfiguration parameter for accuracy evaluation datasets - Refactored accuracy evaluation configuration using a dedicated
AccuracyConfigurationdataclass
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/inference_endpoint/evaluation/scoring.py | Added scorer registration system with PREDEFINED class variable and lookup methods, updated PassAt1Scorer to use registration |
| src/inference_endpoint/dataset_manager/factory.py | Added **kwargs support to pass through additional parameters like num_repeats to data loaders |
| src/inference_endpoint/config/schema.py | Added num_repeats field to AccuracyConfig with default value of 1 |
| src/inference_endpoint/commands/benchmark.py | Replaced tuple-based accuracy configuration with AccuracyConfiguration dataclass, improved variable naming from dataset_id to dataset_name |
| examples/04_GPTOSS120B_Example/sglang_gptoss_120b_example.yaml | Added num_repeats values to example accuracy configurations and updated report directory path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request refactors the integration of accuracy datasets in the benchmark command, addressing comments from a previous PR. The changes are a significant improvement, introducing a dataclass for accuracy configuration and a factory pattern for scorers, which makes the code more readable, modular, and extensible. The logic for handling multiple accuracy datasets is now much cleaner. I've identified a couple of issues in src/inference_endpoint/dataset_manager/factory.py. One is a bug where num_repeats is ignored for non-predefined datasets, and the other is a confusing type hint that affects maintainability. My detailed comments are below.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/inference_endpoint/dataset_manager/predefined/livecodebench/__init__.py
Outdated
Show resolved
Hide resolved
src/inference_endpoint/dataset_manager/predefined/livecodebench/__init__.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/inference_endpoint/dataset_manager/predefined/livecodebench/__init__.py
Outdated
Show resolved
Hide resolved
64176d7 to
e1d2c5c
Compare
e1d2c5c to
e7d070b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/inference_endpoint/dataset_manager/predefined/livecodebench/__init__.py
Outdated
Show resolved
Hide resolved
src/inference_endpoint/dataset_manager/predefined/gpqa/__init__.py
Outdated
Show resolved
Hide resolved
src/inference_endpoint/dataset_manager/predefined/aime25/__init__.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
e7d070b to
fc5efa1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
b645ca7 to
a365b9a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What does this PR do?
Addresses comments in previous PR on integration of accuracy datasets in benchmark command.
#4
Type of change
Related issues
Testing
Checklist