-
Notifications
You must be signed in to change notification settings - Fork 139
Add arena-hard v2 #1205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add arena-hard v2 #1205
Conversation
Signed-off-by: bzantium <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| # will fail if category not in category_prompts as this is unexpected | ||
| prompt = self.category_prompts[category] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will raise KeyError for hard_prompt category since category_prompts only contains creative_writing
| # will fail if category not in category_prompts as this is unexpected | |
| prompt = self.category_prompts[category] | |
| else: | |
| # use category-specific prompt if available, otherwise fall back to default | |
| prompt = self.category_prompts.get(category, self.prompt) |
Signed-off-by: Igor Gitman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 3 comments
| super().update(predictions) | ||
| self.scores.append([]) | ||
| self.agg_mode = f"pass@{len(predictions)}" | ||
|
|
||
| # Track category for per-category scoring (defaults to None for v1 compatibility) | ||
| category = predictions[0].get("category") | ||
| self.categories.append(category) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArenaMetrics.update() now unconditionally appends predictions[0].get("category") to self.categories, but get_metrics() only uses per-category aggregation when len(set(self.categories)) > 1. If a mixed run includes some rows without category (or category=None) alongside real categories, unique_categories becomes {None, "..."} and you’ll emit a category_None bucket in metrics, which is usually unintended and makes downstream consumers brittle.
This shows up when evaluating older/partially-prepared datasets or when a pipeline forgets to propagate category for some items.
A safer approach is to only create per-category buckets for non-None categories (or to require that all entries have a non-null category before enabling breakdown).
| def setup_prompt(self): | ||
| if self.cfg.prompt_format == "openai": | ||
| return None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup_prompt() returns None for prompt_format == "openai", but fill_prompt() for openai delegates to super().fill_prompt(...) without any category-specific override. If arena-hard-v2 is run with prompt_format=openai, the creative_writing category will silently use the default judge prompt (and may reintroduce the “judge generates its own answer first” behavior this PR is trying to avoid).
If category-specific judging is required for v2, this likely needs an equivalent branch for the OpenAI prompt path (e.g., selecting different message templates/configs per category) or explicitly disallowing openai format for v2.
| # Create test.jsonl with category-specific baseline answers | ||
| with open(questions_file, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout: | ||
| for line in fin: | ||
| data = json.loads(line) | ||
| data["question"] = data.pop("prompt") | ||
| category = data["category"] | ||
| data["baseline_answer"] = baseline_answers[data["uid"]][category] | ||
| fout.write(json.dumps(data) + "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepare.py assumes every question uid exists in baseline_answers and that it has an entry for the question’s category (baseline_answers[data["uid"]][category]). If the upstream dataset adds a new category, or if a baseline file is missing/partial, this will raise KeyError and stop dataset preparation.
Given v2 explicitly has multiple baselines by category, it would be safer to fail with a clearer error that prints the missing (uid, category) (or to handle unknown categories explicitly) so users can debug mismatched dataset/baseline versions.
📝 WalkthroughWalkthroughThis PR introduces arena-hard-v2 benchmark evaluation support, including new data preparation scripts, dataset configuration, per-category metrics tracking in arena metrics, creative writing prompt support in the arena judge, and comprehensive test coverage for the new functionality. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@nemo_skills/inference/eval/arena_judge.py`:
- Around line 104-110: The code currently does direct lookup
self.category_prompts[category], which raises KeyError for categories like
"hard_prompt" that are present but unmapped; change the selection to safely
fallback to the default prompt by using a membership check or dict.get so that
prompt = self.category_prompts.get(category, self.prompt) (or if you prefer
check `if category in self.category_prompts` then assign accordingly) ensuring
data_point.get("category"), self.category_prompts, and self.prompt are used as
the referenced symbols.
🧹 Nitpick comments (4)
nemo_skills/evaluation/metrics/arena_metrics.py (1)
103-105: Consider addingstrict=Trueto the zip call for defensive programming.Since
self.scoresandself.categoriesare populated together inupdate(), their lengths should always match. Addingstrict=Truewould catch any future bugs where they diverge.Proposed fix
- for score, category in zip(self.scores, self.categories): + for score, category in zip(self.scores, self.categories, strict=True):docs/evaluation/other-benchmarks.md (1)
45-49: Add language specifier to the fenced code block.The code block at line 45 is missing a language specifier for the results table.
Proposed fix
-``` +```text ------------------------------------------- arena-hard -------------------------------------------tests/test_arena_metrics.py (1)
31-154: Consider adding arena-hard-v2 to GPU and slurm test suites.The new per-category scoring logic and category-specific baselines (hard_prompt and creative_writing) represent significant evaluation changes. Based on CONTRIBUTING.md guidance, new benchmarks with complicated evaluation/metrics logic should be added to slurm tests for comprehensive evaluation, and GPU tests should be run in CI.
nemo_skills/dataset/arena-hard-v2/prepare.py (1)
31-38: Returning empty string silently masks missing assistant messages.If no assistant message is found, returning
""will silently propagate an empty baseline answer. Per coding guidelines, errors should surface rather than fail silently. Consider raising an error or removing the fallback entirely to let theStopIterationorKeyErrorsurface naturally.♻️ Proposed refactor to let errors surface
def extract_answer_text(data): """Extract the answer text from the baseline model's response format.""" messages = data["messages"] for msg in messages: if msg["role"] == "assistant": content = msg["content"] return content["answer"] if isinstance(content, dict) else content - return "" + raise ValueError(f"No assistant message found in baseline data: {data.get('uid', 'unknown')}")As per coding guidelines: "Follow the Zen of Python principles: prioritize... letting errors surface rather than silently failing".
| # Select the appropriate prompt based on category. If not defined, forcing fall-back to default prompt | ||
| category = data_point.get("category") | ||
| if not category: | ||
| prompt = self.prompt | ||
| else: | ||
| # will fail if category not in category_prompts as this is unexpected | ||
| prompt = self.category_prompts[category] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: hard_prompt category will raise KeyError instead of falling back to default prompt.
The current logic falls back to default prompt only when category is falsy (None/empty). However, for categories like hard_prompt that exist but aren't in category_prompts, line 110 will raise a KeyError.
Based on the comment at line 75-76 ("default prompt used for most categories including hard_prompt"), the intent is to fall back to default for unmapped categories.
Proposed fix
# Select the appropriate prompt based on category. If not defined, forcing fall-back to default prompt
category = data_point.get("category")
- if not category:
- prompt = self.prompt
- else:
- # will fail if category not in category_prompts as this is unexpected
- prompt = self.category_prompts[category]
+ if category and category in self.category_prompts:
+ prompt = self.category_prompts[category]
+ else:
+ prompt = self.prompt🤖 Prompt for AI Agents
In `@nemo_skills/inference/eval/arena_judge.py` around lines 104 - 110, The code
currently does direct lookup self.category_prompts[category], which raises
KeyError for categories like "hard_prompt" that are present but unmapped; change
the selection to safely fallback to the default prompt by using a membership
check or dict.get so that prompt = self.category_prompts.get(category,
self.prompt) (or if you prefer check `if category in self.category_prompts` then
assign accordingly) ensuring data_point.get("category"), self.category_prompts,
and self.prompt are used as the referenced symbols.
Signed-off-by: Igor Gitman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 3 comments
| data = json.loads(line) | ||
| data["question"] = data.pop("prompt") | ||
| category = data["category"] | ||
| data["baseline_answer"] = baseline_answers[data["uid"]][category] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assumes every uid has a baseline for its category, will raise KeyError if baseline data is missing/partial
| data["baseline_answer"] = baseline_answers[data["uid"]][category] | |
| if data["uid"] not in baseline_answers or category not in baseline_answers[data["uid"]]: | |
| raise ValueError(f"Missing baseline answer for uid={data['uid']}, category={category}") | |
| data["baseline_answer"] = baseline_answers[data["uid"]][category] |
| prompt = self.prompt | ||
| else: | ||
| # will fail if category not in category_prompts as this is unexpected | ||
| prompt = self.category_prompts[category] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will raise KeyError if category isn't in category_prompts (e.g., unknown category from dataset)
Per CONTRIBUTING.md guidelines: "Don't be overly defensive" - let it fail with a clear error. However, the error message should indicate which category is missing.
| prompt = self.category_prompts[category] | |
| # will fail if category not in category_prompts as this is unexpected | |
| if category not in self.category_prompts: | |
| raise KeyError(f"Category '{category}' not found in category_prompts. Available: {list(self.category_prompts.keys())}") | |
| prompt = self.category_prompts[category] |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if len(unique_categories) > 1: | ||
| for category, scores in category_scores.items(): | ||
| cat_metrics = {"num_entries": len(scores)} | ||
| cat_metrics.update(get_aggregate_score(scores)) | ||
| overall_metrics[f"category_{category}"] = cat_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when mixed data includes category=None, creates category_None bucket in metrics which is brittle for downstream consumers
If arena-hard-v1 data (no category) is mixed with v2 data (with categories), unique_categories becomes {None, "hard_prompt", "creative_writing"} and you'll emit category_None in the output.
Consider only creating per-category buckets for non-None categories:
| if len(unique_categories) > 1: | |
| for category, scores in category_scores.items(): | |
| cat_metrics = {"num_entries": len(scores)} | |
| cat_metrics.update(get_aggregate_score(scores)) | |
| overall_metrics[f"category_{category}"] = cat_metrics | |
| # If we have multiple categories, compute per-category metrics | |
| unique_categories = set(self.categories) | |
| if len(unique_categories) > 1: | |
| for category, scores in category_scores.items(): | |
| if category is not None: # Skip None category to avoid brittle category_None buckets | |
| cat_metrics = {"num_entries": len(scores)} | |
| cat_metrics.update(get_aggregate_score(scores)) | |
| overall_metrics[f"category_{category}"] = cat_metrics |
Add the following fixes on top of #1152
Summary by CodeRabbit
Release Notes
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.