This adds initial Reward Bench 2 metrics implementation #1055

dchichkov · 2025-11-26T19:11:40Z

This adds initial Reward Bench 2 metrics implementation. Can't be merged yet, depends on datasets>=4

Update refs logic Simplify metrics code for the Ties case Merging ties into preference scoring

Kipok

Thanks @dchichkov! Could you please recreate this from a branch (sent you an invite)? And also add the new benchmark into docs with a reference command / expected results for any model?

Kipok · 2026-01-08T22:59:13Z

nemo_skills/dataset/reward-bench-2/score.py

+
+
+
+def compute_score(metrics: dict):


this needs to be updated? Currently a copy-paste from aai as far as I can see

Kipok · 2026-01-08T22:59:43Z

nemo_skills/dataset/reward-bench-2/prepare.py

+
+    dataset = load_dataset("allenai/reward-bench-2", split='test')
+    # select some samples from Ties
+    #dataset = dataset.filter(lambda x: x["subset"] == "Ties")


debugging comments? Should remove?

Kipok · 2026-01-08T23:01:24Z

nemo_skills/evaluation/evaluator/rewardbench.py

+    jsonl_file = eval_config.input_file
+    with open(jsonl_file, "rt", encoding="utf-8") as fin:
+        data = [json.loads(line) for line in fin]
+    with open(jsonl_file, "wt", encoding="utf-8") as fout:


it's best to have the logic for computation before opening the file for writing. We should really have some shared util for this, but until we do, could you move the logic for computing all samples above and then only write the file? Otherwise if e.g. one of samples happen to not have "generation" field and line 45 crashes with error, we will lose all data as original input file is being overwritten

Kipok · 2026-01-08T23:02:02Z

nemo_skills/evaluation/metrics/reward_bench_metrics.py

+
+from nemo_skills.evaluation.metrics.base import BaseMetrics, as_float, as_int
+
+# This is the original reference implementation of Reward Bench Ties scoring.


should we remove this and have a link instead?

Kipok · 2026-01-08T23:02:48Z

nemo_skills/evaluation/metrics/reward_bench_metrics.py

+                print("Unknown sample type:", sample_type)
+                continue
+
+        print("ref_stats:", ref_stats)


probably best not to print things, instead could use LOG.debug?

Add Reward Bench dataset and evaluation metrics initial implementation

6710870

Update refs logic Simplify metrics code for the Ties case Merging ties into preference scoring

dchichkov force-pushed the main branch from 04d495a to 6710870 Compare November 26, 2025 19:37

Kipok requested changes Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This adds initial Reward Bench 2 metrics implementation #1055

This adds initial Reward Bench 2 metrics implementation #1055

Uh oh!

dchichkov commented Nov 26, 2025 •

edited

Loading

Uh oh!

Kipok left a comment

Uh oh!

Kipok Jan 8, 2026

Uh oh!

Kipok Jan 8, 2026

Uh oh!

Kipok Jan 8, 2026

Uh oh!

Kipok Jan 8, 2026

Uh oh!

Kipok Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		from nemo_skills.evaluation.metrics.base import BaseMetrics, as_float, as_int

		# This is the original reference implementation of Reward Bench Ties scoring.

This adds initial Reward Bench 2 metrics implementation #1055

Are you sure you want to change the base?

This adds initial Reward Bench 2 metrics implementation #1055

Uh oh!

Conversation

dchichkov commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dchichkov commented Nov 26, 2025 •

edited

Loading