Add rubric-based evaluation mode#13
Conversation
Add --eval_mode rubric option for independent per-response scoring on 4
criteria (Instruction Following, Naturalness, Coherence, Accuracy) using
a 1/3/5/7 Likert scale. Prompt adapted from the Tiny Aya tech report
(Appendix B.3, Aryabumi et al., 2025).
Scores are snapped to valid anchor points {1, 3, 5, 7} if the judge
outputs intermediate values, with logging of snapped counts. Composite
score: (mean - 1) / 6, linearly mapping [1, 7] to [0, 1].
Default mode remains winrate (backward compatible).
OpenJury LLM-as-Judge EvaluationAs mentioned in the initial PR post: here is an exemplary comparison between winrate and rubrics for the same models. Main factors to keep in mind:
Baseline (A): Winrate
|
| Dataset | A Wins | B Wins | Ties | Total | Baseline WR | Ours WR |
|---|---|---|---|---|---|---|
| alpaca-eval | 695 | 914 | 1 | 1,610 | 0.43 | 0.57 |
| arena-hard | 550 | 445 | 5 | 1,000 | 0.55 | 0.45 |
| m-arena-hard-EU | 7,052 | 4,937 | 11 | 12,000 | 0.59 | 0.41 |
| Average | 0.52 | 0.48 |
swap_mode=fixed (Model A presented first)
| Dataset | A Wins | B Wins | Ties | Total | Baseline WR | Ours WR |
|---|---|---|---|---|---|---|
| alpaca-eval | 596 | 208 | 1 | 805 | 0.74 | 0.26 |
| arena-hard | 363 | 136 | 1 | 500 | 0.73 | 0.27 |
| m-arena-hard-EU | 4,069 | 1,922 | 9 | 6,000 | 0.68 | 0.32 |
| Average | 0.72 | 0.28 |
The gap between fixed and both may indicate that the judge suffers from a position bias on fixed.
Rubric
alpaca-eval
| Criterion | Baseline (A) | Ours (B) | Delta |
|---|---|---|---|
| Instruction Following | 6.36 | 6.45 | +0.09 |
| Naturalness | 6.77 | 6.82 | +0.05 |
| Coherence | 6.63 | 6.71 | +0.08 |
| Accuracy | 6.50 | 6.49 | -0.01 |
| Average (0-1) | 0.927 | 0.936 | +0.009 |
arena-hard
| Criterion | Baseline (A) | Ours (B) | Delta |
|---|---|---|---|
| Instruction Following | 5.91 | 5.50 | -0.41 |
| Naturalness | 6.46 | 6.31 | -0.15 |
| Coherence | 6.29 | 6.01 | -0.28 |
| Accuracy | 5.98 | 5.52 | -0.46 |
| Average (0-1) | 0.860 | 0.806 | -0.054 |
m-arena-hard-EU
| Criterion | Baseline (A) | Ours (B) | Delta |
|---|---|---|---|
| Instruction Following | 4.78 | 4.32 | -0.46 |
| Naturalness | 5.92 | 5.45 | -0.47 |
| Coherence | 5.47 | 4.93 | -0.54 |
| Accuracy | 4.90 | 4.31 | -0.59 |
| Average (0-1) | 0.711 | 0.625 | -0.086 |
Notes/implications:
- Rubrics pose a valuable addition to winrate-based evaluation (winrate with
swap_mode=fixed, i.e. model A presented first) may inflate model A winrate) - Rubric scores each response independently
- Rubric also provides better interpretability since we can see which specific criteria improve or degrade, rather than just a binary win/loss. It likely suffers less from positional bias
|
Thanks Fabio for the PR, interesting results! Regarding the results on Olmo3, it is quite interesting overall. Do we have any idea why some datasets are better than others? Regarding the merge of rubrics with OpenJury, we need to discuss it further given that Bora has been working solely on this and performed lots of meta-evaluations. Ideally, we should merge features that allow to do both what you did and Bora did in a compatible way. PS: if possible, I would appreciate to rely less on LLMs for summary as it is not super trustworthy. For instance, it mentions that "Position bias mitigated by swap_mode=both, but at 2x compute" but this is a bit misleading given that the rubrics also run twice per completion. |
|
You're right, thanks for the pointer. I thought I had removed all "2x compute" mentions but that one slipped through. Removing it above! RE the other things you mentioned:
These results are using the old checkpoint that was trained with the wrong tokenizer (base instead of think tokenizer, i.e. including tags now) going from the base to think stage. For the sake of this PR, I still made this comparison but would rather focus on comparing the baseline checkpoint performance between winrate and rubric for now. I will run this comparison with the newer reproduced checkpoint with the correct tokenizer which is about to be done. This will likely address your point about thinking should make arena-hard better (note though that the baseline is also a think checkpoint).
Yes, I'm currently looking into that.
Sure! I'd suggest to do this rather sooner than later since using non-llm judge tasks is one way to cover breadth but also using our judge tasks with rubrics is another valuable way to do this I guess? |
|
Hi Fabio, I was also working on rubric evaluations. I have created a PR for it and also changed my pipeline so it also accepts "reference score" as you do. Would love to discuss this further |
|
Hi @kargibora, cool! Will check it out. Do you mean anchoring by "reference score"? |
|
Exactly, otherwise it felt that LLM can create their own "judgement" about what does score=5 means for pairwise scoring (scoring completions independently). |
What is the problem?
OpenJury currently only supports pairwise winrate evaluation, where a judge LLM compares two responses head-to-head and assigns scores. Pairwise evaluation has known limitations ("Tiny Aya: Bridging Scale and Multilingual Depth", Salamanca, Kreutzer, Fadaee et al., https://arxiv.org/abs/2501.10893, 2026):
How do we solve it?
We add a rubric-based evaluation mode (
--eval_mode rubric) that scores each response independently on 4 criteria using a 1/3/5/7 Likert scale. The evaluation prompt (rubric-prompt.txt) is taken from Appendix B.3 of ("Tiny Aya: Bridging Scale and Multilingual Depth", Salamanca, Kreutzer, Fadaee et al., https://arxiv.org/abs/2501.10893, 2026) and adapted with minor changes:{language}placeholders replaced with language-agnostic phrasing,chatbot->responsefor consistency, and template variable names adapted to OpenJury conventions.Criteria (each scored 1, 3, 5, or 7):
Composite score: mean of the 4 criteria, linearly mapped from [1, 7] to [0, 1] via
(mean - 1) / 6.Because rubric mode evaluates one response at a time (not pairwise), it counteracts the positional bias. The judge prompt explicitly instructs scoring on valid anchor points only (1, 3, 5, 7); if the judge outputs an intermediate value (2, 4, 6), the parser snaps to the nearest valid anchor and logs the count.
Default mode remains
winratefor backward compatibility.Example output (OLMo3-7B Think SFT, alpaca-eval, Qwen3-30B-A3B judge):
Changes
openjury/evaluate.py: addRUBRIC_CRITERIA,VALID_RUBRIC_SCORES,RubricScoreparser (with snap-to-valid logic),RubricAnnotationdataclass,load_rubric_prompts(),annotate_rubric()functionopenjury/generate_and_evaluate.py: add--eval_modeCLI argument toCliArgs, branchmain()on eval mode, addprint_rubric_results(), add"eval_mode"field to winrate results JSON for consistencyopenjury/prompts/rubric-prompt.txt: rubric user prompt with 4 criteria, adapted from Tiny Aya Appendix B.3openjury/prompts/rubric-system-prompt.txt: rubric system prompttests/test_rubric.py: 12 tests covering JSON parsing, snapping, composites, edge cases, and end-to-end with Dummy modelsTesting
pytest tests/test_rubric.py)pytest tests/-- same failures as before, all pre-existing)