Skip to content

Conversation

@dchichkov
Copy link
Collaborator

@dchichkov dchichkov commented Nov 26, 2025

This adds initial Reward Bench 2 metrics implementation. Can't be merged yet, depends on datasets>=4

Update refs logic
Simplify metrics code for the Ties case
Merging ties into preference scoring
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dchichkov! Could you please recreate this from a branch (sent you an invite)? And also add the new benchmark into docs with a reference command / expected results for any model?




def compute_score(metrics: dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be updated? Currently a copy-paste from aai as far as I can see


dataset = load_dataset("allenai/reward-bench-2", split='test')
# select some samples from Ties
#dataset = dataset.filter(lambda x: x["subset"] == "Ties")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging comments? Should remove?

jsonl_file = eval_config.input_file
with open(jsonl_file, "rt", encoding="utf-8") as fin:
data = [json.loads(line) for line in fin]
with open(jsonl_file, "wt", encoding="utf-8") as fout:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's best to have the logic for computation before opening the file for writing. We should really have some shared util for this, but until we do, could you move the logic for computing all samples above and then only write the file? Otherwise if e.g. one of samples happen to not have "generation" field and line 45 crashes with error, we will lose all data as original input file is being overwritten


from nemo_skills.evaluation.metrics.base import BaseMetrics, as_float, as_int

# This is the original reference implementation of Reward Bench Ties scoring.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove this and have a link instead?

print("Unknown sample type:", sample_type)
continue

print("ref_stats:", ref_stats)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably best not to print things, instead could use LOG.debug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants