-
Notifications
You must be signed in to change notification settings - Fork 26
push-fix #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
push-fix #253
Conversation
|
@mergify rebase |
Signed-off-by: Oleg Silkin <[email protected]>
✅ Branch has been successfully rebased |
515a0ba to
d96e395
Compare
|
@sourcery-ai review |
Reviewer's GuideThis PR fixes the 0% benchmark failure by switching to an explicit None check in get_score_by_metric, and extends the evaluate_best_checkpoint CLI—renaming the main command, adding task filtering, enriching output formatting, and introducing two new subcommands for standalone evaluation and best‐checkpoint discovery. Sequence diagram for the new 'evaluate' CLI commandsequenceDiagram
actor User
participant CLI as Typer CLI
participant Evaluator as LeaderboardV2Evaluator
participant FileSystem
User->>CLI: Run 'evaluate' with input_dir and optional tasks
CLI->>FileSystem: Check input_dir exists and is directory
CLI->>Evaluator: Instantiate with input_dir, num_gpus, eval_config
CLI->>Evaluator: Set tasks (if provided)
CLI->>Evaluator: Call run()
Evaluator-->>CLI: Return result
CLI->>FileSystem: Write leaderboard_results.json
CLI->>User: Print formatted results
Sequence diagram for the new 'find_best' CLI commandsequenceDiagram
actor User
participant CLI as Typer CLI
participant FileSystem
User->>CLI: Run 'find_best' with input_dir and show_all
CLI->>FileSystem: Check input_dir exists and is directory
CLI->>FileSystem: Find all leaderboard_results.json files
loop For each result file
CLI->>FileSystem: Read and parse JSON
CLI->>CLI: Track best score and checkpoint
end
CLI->>User: Print best checkpoint and/or all results
Class diagram for updated CLI commands in evaluate_best_checkpoint.pyclassDiagram
class LeaderboardV2Evaluator {
+tasks: list[str]
+run()
}
class TyperApp {
+best_checkpoint(input_dir, output_file, tasks)
+evaluate(input_dir, tasks)
+find_best(input_dir, show_all)
}
TyperApp --> LeaderboardV2Evaluator : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @RobotSail - I've reviewed your changes - here's some feedback:
- Factor out the repeated metric‐printing logic into a shared helper to avoid duplicating all those
if "leaderboard_*" in resultblocks across commands. - Add a descriptive
help=string to the newtasksoption and update the docstring for thebest_checkpointcommand (formerlymain) to accurately describe its behavior. - Consider making
num_gpusa configurable CLI option instead of hardcoding it to 8 to give users more flexibility.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Factor out the repeated metric‐printing logic into a shared helper to avoid duplicating all those `if "leaderboard_*" in result` blocks across commands.
- Add a descriptive `help=` string to the new `tasks` option and update the docstring for the `best_checkpoint` command (formerly `main`) to accurately describe its behavior.
- Consider making `num_gpus` a configurable CLI option instead of hardcoding it to 8 to give users more flexibility.
## Individual Comments
### Comment 1
<location> `scripts/evaluate_best_checkpoint.py:158` </location>
<code_context>
+ if "leaderboard_musr" in result:
+ print(f"MUSR: {result['leaderboard_musr']['score'] * 100:.2f}%")
+
+ output_file = input_dir / "leaderboard_results.json"
+ output_file.write_text(json.dumps(result, indent=2))
+
</code_context>
<issue_to_address>
Writing output file directly to input_dir may overwrite existing results.
Consider checking if 'leaderboard_results.json' already exists before writing, or allow users to specify a custom output path to avoid accidental overwrites.
</issue_to_address>
### Comment 2
<location> `src/instructlab/eval/leaderboard.py:254` </location>
<code_context>
extracted_value = value
break
- if not extracted_value:
- if alias := score_dict.get("alias", None):
+ if extracted_value is None:
+ if alias := score_dict.get("alias", "[no-alias]"):
</code_context>
<issue_to_address>
Changing the check from 'not extracted_value' to 'extracted_value is None' improves correctness.
This prevents valid falsy values from being misclassified as missing, improving edge case handling.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
scripts/evaluate_best_checkpoint.py
Outdated
| if "leaderboard_musr" in result: | ||
| print(f"MUSR: {result['leaderboard_musr']['score'] * 100:.2f}%") | ||
|
|
||
| output_file = input_dir / "leaderboard_results.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Writing output file directly to input_dir may overwrite existing results.
Consider checking if 'leaderboard_results.json' already exists before writing, or allow users to specify a custom output path to avoid accidental overwrites.
| if not extracted_value: | ||
| if alias := score_dict.get("alias", None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Changing the check from 'not extracted_value' to 'extracted_value is None' improves correctness.
This prevents valid falsy values from being misclassified as missing, improving edge case handling.
|
|
||
|
|
||
| @app.command() | ||
| def find_best( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Low code quality found in find_best - 19% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
b609626 to
ffe9c94
Compare
This PR fixes an issue where if we get 0% on a particular benchmark, the entire evaluation fails due to how the null check is currently being performed.
Signed-off-by: Oleg Silkin [email protected]
Summary by Sourcery
Improve the evaluation CLI by adding tasks support, new commands for single/evaluation and best-checkpoint discovery, refine output formatting, and correct a zero-score null-check bug
New Features:
Bug Fixes:
Enhancements: