-
Notifications
You must be signed in to change notification settings - Fork 139
This adds initial Reward Bench 2 metrics implementation #1055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Update refs logic Simplify metrics code for the Ties case Merging ties into preference scoring
Kipok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dchichkov! Could you please recreate this from a branch (sent you an invite)? And also add the new benchmark into docs with a reference command / expected results for any model?
|
|
||
|
|
||
|
|
||
| def compute_score(metrics: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be updated? Currently a copy-paste from aai as far as I can see
|
|
||
| dataset = load_dataset("allenai/reward-bench-2", split='test') | ||
| # select some samples from Ties | ||
| #dataset = dataset.filter(lambda x: x["subset"] == "Ties") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debugging comments? Should remove?
| jsonl_file = eval_config.input_file | ||
| with open(jsonl_file, "rt", encoding="utf-8") as fin: | ||
| data = [json.loads(line) for line in fin] | ||
| with open(jsonl_file, "wt", encoding="utf-8") as fout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's best to have the logic for computation before opening the file for writing. We should really have some shared util for this, but until we do, could you move the logic for computing all samples above and then only write the file? Otherwise if e.g. one of samples happen to not have "generation" field and line 45 crashes with error, we will lose all data as original input file is being overwritten
|
|
||
| from nemo_skills.evaluation.metrics.base import BaseMetrics, as_float, as_int | ||
|
|
||
| # This is the original reference implementation of Reward Bench Ties scoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove this and have a link instead?
| print("Unknown sample type:", sample_type) | ||
| continue | ||
|
|
||
| print("ref_stats:", ref_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably best not to print things, instead could use LOG.debug?
This adds initial Reward Bench 2 metrics implementation. Can't be merged yet, depends on datasets>=4