push-fix #253

RobotSail · 2025-05-13T05:14:59Z

This PR fixes an issue where if we get 0% on a particular benchmark, the entire evaluation fails due to how the null check is currently being performed.

Signed-off-by: Oleg Silkin [email protected]

Summary by Sourcery

Improve the evaluation CLI by adding tasks support, new commands for single/evaluation and best-checkpoint discovery, refine output formatting, and correct a zero-score null-check bug

New Features:

Add 'find_best' subcommand to locate and compare leaderboard results across checkpoints

Bug Fixes:

Fix null check in get_score_by_metric to correctly handle zero scores

Enhancements:

Enhance CLI output formatting with percent scores, best-checkpoint labeling, and rich printing

RobotSail · 2025-06-02T06:48:01Z

@mergify rebase

Signed-off-by: Oleg Silkin <[email protected]>

mergify · 2025-06-02T06:48:30Z

rebase

✅ Branch has been successfully rebased

RobotSail · 2025-07-08T04:46:00Z

@sourcery-ai review

sourcery-ai · 2025-07-08T04:47:03Z

Reviewer's Guide

This PR fixes the 0% benchmark failure by switching to an explicit None check in get_score_by_metric, and extends the evaluate_best_checkpoint CLI—renaming the main command, adding task filtering, enriching output formatting, and introducing two new subcommands for standalone evaluation and best‐checkpoint discovery.

Sequence diagram for the new 'evaluate' CLI command

sequenceDiagram
    actor User
    participant CLI as Typer CLI
    participant Evaluator as LeaderboardV2Evaluator
    participant FileSystem
    User->>CLI: Run 'evaluate' with input_dir and optional tasks
    CLI->>FileSystem: Check input_dir exists and is directory
    CLI->>Evaluator: Instantiate with input_dir, num_gpus, eval_config
    CLI->>Evaluator: Set tasks (if provided)
    CLI->>Evaluator: Call run()
    Evaluator-->>CLI: Return result
    CLI->>FileSystem: Write leaderboard_results.json
    CLI->>User: Print formatted results

Sequence diagram for the new 'find_best' CLI command

sequenceDiagram
    actor User
    participant CLI as Typer CLI
    participant FileSystem
    User->>CLI: Run 'find_best' with input_dir and show_all
    CLI->>FileSystem: Check input_dir exists and is directory
    CLI->>FileSystem: Find all leaderboard_results.json files
    loop For each result file
        CLI->>FileSystem: Read and parse JSON
        CLI->>CLI: Track best score and checkpoint
    end
    CLI->>User: Print best checkpoint and/or all results

Class diagram for updated CLI commands in evaluate_best_checkpoint.py

classDiagram
    class LeaderboardV2Evaluator {
        +tasks: list[str]
        +run()
    }
    class TyperApp {
        +best_checkpoint(input_dir, output_file, tasks)
        +evaluate(input_dir, tasks)
        +find_best(input_dir, show_all)
    }
    TyperApp --> LeaderboardV2Evaluator : uses

File-Level Changes

Change	Details	Files
Fix null‐check logic to handle zero scores correctly	Replaced falsy check with explicit 'is None' guard for extracted_value Updated alias fallback to a default '[no-alias]' when none is provided	`src/instructlab/eval/leaderboard.py`
Refactor primary CLI entrypoint and add task filtering	Renamed 'main' command to 'best_checkpoint' Added Annotated tasks option and propagate tasks to the evaluator	`scripts/evaluate_best_checkpoint.py`
Enhance CLI output with rich styling and per‐metric percentages	Enumerate sorted checkpoints and highlight the best with bold/green labels Display overall and individual metric scores as percentages Integrate rich print styling for headers and colored emphasis	`scripts/evaluate_best_checkpoint.py`
Add standalone 'evaluate' and 'find_best' subcommands	Implemented 'evaluate' to run a single-directory evaluation and dump JSON results Implemented 'find_best' to scan subdirectories for leaderboard_results.json and compare scores Added input-dir existence and type checks with proper exit on failure	`scripts/evaluate_best_checkpoint.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @RobotSail - I've reviewed your changes - here's some feedback:

Factor out the repeated metric‐printing logic into a shared helper to avoid duplicating all those if "leaderboard_*" in result blocks across commands.
Add a descriptive help= string to the new tasks option and update the docstring for the best_checkpoint command (formerly main) to accurately describe its behavior.
Consider making num_gpus a configurable CLI option instead of hardcoding it to 8 to give users more flexibility.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Factor out the repeated metric‐printing logic into a shared helper to avoid duplicating all those `if "leaderboard_*" in result` blocks across commands.
- Add a descriptive `help=` string to the new `tasks` option and update the docstring for the `best_checkpoint` command (formerly `main`) to accurately describe its behavior.
- Consider making `num_gpus` a configurable CLI option instead of hardcoding it to 8 to give users more flexibility.

## Individual Comments

### Comment 1
<location> `scripts/evaluate_best_checkpoint.py:158` </location>
<code_context>
+    if "leaderboard_musr" in result:
+        print(f"MUSR: {result['leaderboard_musr']['score'] * 100:.2f}%")
+
+    output_file = input_dir / "leaderboard_results.json"
+    output_file.write_text(json.dumps(result, indent=2))
+
</code_context>

<issue_to_address>
Writing output file directly to input_dir may overwrite existing results.

Consider checking if 'leaderboard_results.json' already exists before writing, or allow users to specify a custom output path to avoid accidental overwrites.
</issue_to_address>

### Comment 2
<location> `src/instructlab/eval/leaderboard.py:254` </location>
<code_context>
             extracted_value = value
             break

-    if not extracted_value:
-        if alias := score_dict.get("alias", None):
+    if extracted_value is None:
+        if alias := score_dict.get("alias", "[no-alias]"):
</code_context>

<issue_to_address>
Changing the check from 'not extracted_value' to 'extracted_value is None' improves correctness.

This prevents valid falsy values from being misclassified as missing, improving edge case handling.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-07-08T04:47:04Z

scripts/evaluate_best_checkpoint.py

+    if "leaderboard_musr" in result:
+        print(f"MUSR: {result['leaderboard_musr']['score'] * 100:.2f}%")
+
+    output_file = input_dir / "leaderboard_results.json"


issue (bug_risk): Writing output file directly to input_dir may overwrite existing results.

Consider checking if 'leaderboard_results.json' already exists before writing, or allow users to specify a custom output path to avoid accidental overwrites.

sourcery-ai · 2025-07-08T04:47:04Z

src/instructlab/eval/leaderboard.py

-    if not extracted_value:
-        if alias := score_dict.get("alias", None):


suggestion: Changing the check from 'not extracted_value' to 'extracted_value is None' improves correctness.

This prevents valid falsy values from being misclassified as missing, improving edge case handling.

sourcery-ai · 2025-07-08T04:47:04Z

scripts/evaluate_best_checkpoint.py

+
+
+@app.command()
+def find_best(


issue (code-quality): Low code quality found in find_best - 19% (low-code-quality)

Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

mergify bot added the ci-failure label May 13, 2025

push-fix

d96e395

Signed-off-by: Oleg Silkin <[email protected]>

RobotSail force-pushed the fix-leaderboard branch from 515a0ba to d96e395 Compare June 2, 2025 06:48

mergify bot added ci-failure and removed ci-failure labels Jun 2, 2025

sourcery-ai bot reviewed Jul 8, 2025

View reviewed changes

mergify bot added ci-failure and removed ci-failure labels Jul 8, 2025

makes requested changes

ffe9c94

RobotSail force-pushed the fix-leaderboard branch from b609626 to ffe9c94 Compare July 8, 2025 06:33

mergify bot removed the ci-failure label Jul 8, 2025

RobotSail merged commit 34e878c into instructlab:main Jul 8, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

push-fix #253

push-fix #253

Uh oh!

RobotSail commented May 13, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

RobotSail commented Jun 2, 2025

Uh oh!

mergify bot commented Jun 2, 2025

Uh oh!

RobotSail commented Jul 8, 2025

Uh oh!

sourcery-ai bot commented Jul 8, 2025

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Jul 8, 2025

Uh oh!

sourcery-ai bot Jul 8, 2025

Uh oh!

sourcery-ai bot Jul 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if not extracted_value:
		if alias := score_dict.get("alias", None):



		@app.command()
		def find_best(

push-fix #253

push-fix #253

Uh oh!

Conversation

RobotSail commented May 13, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

RobotSail commented Jun 2, 2025

Uh oh!

mergify bot commented Jun 2, 2025

✅ Branch has been successfully rebased

Uh oh!

RobotSail commented Jul 8, 2025

Uh oh!

sourcery-ai bot commented Jul 8, 2025

Reviewer's Guide

Sequence diagram for the new 'evaluate' CLI command

Sequence diagram for the new 'find_best' CLI command

Class diagram for updated CLI commands in evaluate_best_checkpoint.py

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RobotSail commented May 13, 2025 •

edited by sourcery-ai bot

Loading