Add rubric-based evaluation mode by ferreirafabio · Pull Request #13 · OpenEuroLLM/OpenJury

ferreirafabio · 2026-02-22T14:00:39Z

What is the problem?

OpenJury currently only supports pairwise winrate evaluation, where a judge LLM compares two responses head-to-head and assigns scores. Pairwise evaluation has known limitations ("Tiny Aya: Bridging Scale and Multilingual Depth", Salamanca, Kreutzer, Fadaee et al., https://arxiv.org/abs/2501.10893, 2026):

High variance: small style changes can flip binary preference labels, causing large swings in average win rate on small eval sets
No absolute quality signal: a model might get 90% win rate just because the competitor fails entirely, inflating results without reflecting actual quality
No interpretability: no insight into which quality dimensions (accuracy, fluency, etc.) drove the judge's decision
Position bias: the judge may prefer whichever response appears first or second

How do we solve it?

We add a rubric-based evaluation mode (--eval_mode rubric) that scores each response independently on 4 criteria using a 1/3/5/7 Likert scale. The evaluation prompt (rubric-prompt.txt) is taken from Appendix B.3 of ("Tiny Aya: Bridging Scale and Multilingual Depth", Salamanca, Kreutzer, Fadaee et al., https://arxiv.org/abs/2501.10893, 2026) and adapted with minor changes: {language} placeholders replaced with language-agnostic phrasing, chatbot -> response for consistency, and template variable names adapted to OpenJury conventions.

Criteria (each scored 1, 3, 5, or 7):

Instruction Following
Naturalness
Coherence
Accuracy

Composite score: mean of the 4 criteria, linearly mapped from [1, 7] to [0, 1] via (mean - 1) / 6.

Because rubric mode evaluates one response at a time (not pairwise), it counteracts the positional bias. The judge prompt explicitly instructs scoring on valid anchor points only (1, 3, 5, 7); if the judge outputs an intermediate value (2, 4, 6), the parser snaps to the nearest valid anchor and logs the count.

Default mode remains winrate for backward compatibility.

Example output (OLMo3-7B Think SFT, alpaca-eval, Qwen3-30B-A3B judge):

============================================================
                 RUBRIC EVALUATION RESULTS
Dataset: alpaca-eval
Judge: Qwen/Qwen3-30B-A3B-Instruct-2507
------------------------------------------------------------
  Model A: Olmo-3-7B-Think-SFT (baseline)
  Model B: dolci-think-sft-hf-65k (ours)

  Criterion                    Model A    Model B
  -----------------------------------------------
  Instruction Following           5.99       6.26
  Naturalness                     6.70       6.83
  Coherence                       6.41       6.68
  Accuracy                        6.19       6.39
  -----------------------------------------------
  Average (0-1)                0.887      0.923

  Evaluations: 805 | Parse failures: A=0, B=1
============================================================

Changes

openjury/evaluate.py: add RUBRIC_CRITERIA, VALID_RUBRIC_SCORES, RubricScore parser (with snap-to-valid logic), RubricAnnotation dataclass, load_rubric_prompts(), annotate_rubric() function
openjury/generate_and_evaluate.py: add --eval_mode CLI argument to CliArgs, branch main() on eval mode, add print_rubric_results(), add "eval_mode" field to winrate results JSON for consistency
openjury/prompts/rubric-prompt.txt: rubric user prompt with 4 criteria, adapted from Tiny Aya Appendix B.3
openjury/prompts/rubric-system-prompt.txt: rubric system prompt
tests/test_rubric.py: 12 tests covering JSON parsing, snapping, composites, edge cases, and end-to-end with Dummy models

Testing

All 12 rubric tests pass (pytest tests/test_rubric.py)
All existing tests unaffected (pytest tests/ -- same failures as before, all pre-existing)
Tested on H200 GPU with OLMo3-7B Think SFT models (baseline vs ours) using Qwen3-30B-A3B as judge on alpaca-eval, arena-hard, and m-arena-hard-EU. Results with Likert-enforced prompt to follow.

Add --eval_mode rubric option for independent per-response scoring on 4 criteria (Instruction Following, Naturalness, Coherence, Accuracy) using a 1/3/5/7 Likert scale. Prompt adapted from the Tiny Aya tech report (Appendix B.3, Aryabumi et al., 2025). Scores are snapped to valid anchor points {1, 3, 5, 7} if the judge outputs intermediate values, with logging of snapped counts. Composite score: (mean - 1) / 6, linearly mapping [1, 7] to [0, 1]. Default mode remains winrate (backward compatible).

ferreirafabio · 2026-02-23T10:09:19Z

OpenJury LLM-as-Judge Evaluation

As mentioned in the initial PR post: here is an exemplary comparison between winrate and rubrics for the same models. Main factors to keep in mind:

Winrate: The judge LLM receives both response A and response B in the same prompt, then decides which is better
Rubric: The judge LLM receives only one response at a time and scores it on the 4 criteria. It does this separately for A and B in two independent calls.

Baseline (A): Olmo-3-7B-Think-SFT
Trained (B): dolci-think-sft-hf-65k-config-fix
Judge: Qwen/Qwen3-30B-A3B-Instruct-2507
Config: max_tokens=32768 | truncate_chars=32768 | explanation=true
Evals: 50k (=use all available)

Winrate

`swap_mode=both`

Each comparison is run twice with swapped order and averaged.

Dataset	A Wins	B Wins	Ties	Total	Baseline WR	Ours WR
alpaca-eval	695	914	1	1,610	0.43	0.57
arena-hard	550	445	5	1,000	0.55	0.45
m-arena-hard-EU	7,052	4,937	11	12,000	0.59	0.41
Average					0.52	0.48

`swap_mode=fixed` (Model A presented first)

Dataset	A Wins	B Wins	Ties	Total	Baseline WR	Ours WR
alpaca-eval	596	208	1	805	0.74	0.26
arena-hard	363	136	1	500	0.73	0.27
m-arena-hard-EU	4,069	1,922	9	6,000	0.68	0.32
Average					0.72	0.28

The gap between fixed and both may indicate that the judge suffers from a position bias on fixed.

Rubric

alpaca-eval

Criterion	Baseline (A)	Ours (B)	Delta
Instruction Following	6.36	6.45	+0.09
Naturalness	6.77	6.82	+0.05
Coherence	6.63	6.71	+0.08
Accuracy	6.50	6.49	-0.01
Average (0-1)	0.927	0.936	+0.009

arena-hard

Criterion	Baseline (A)	Ours (B)	Delta
Instruction Following	5.91	5.50	-0.41
Naturalness	6.46	6.31	-0.15
Coherence	6.29	6.01	-0.28
Accuracy	5.98	5.52	-0.46
Average (0-1)	0.860	0.806	-0.054

m-arena-hard-EU

Criterion	Baseline (A)	Ours (B)	Delta
Instruction Following	4.78	4.32	-0.46
Naturalness	5.92	5.45	-0.47
Coherence	5.47	4.93	-0.54
Accuracy	4.90	4.31	-0.59
Average (0-1)	0.711	0.625	-0.086

Notes/implications:

Rubrics pose a valuable addition to winrate-based evaluation (winrate with swap_mode=fixed, i.e. model A presented first) may inflate model A winrate)
Rubric scores each response independently
Rubric also provides better interpretability since we can see which specific criteria improve or degrade, rather than just a binary win/loss. It likely suffers less from positional bias

geoalgo · 2026-02-23T16:47:56Z

Thanks Fabio for the PR, interesting results!

Regarding the results on Olmo3, it is quite interesting overall.

Do we have any idea why some datasets are better than others?
It is interesting that the reproduction is better at alpaca-eval but quite worse for arena-hard (between 6 and 12% score shift, in particular given that the think stage should make arena-hard better?
We probably want to evaluate the other non-llm judge tasks that are relevant for this stage to assess the gap here?

Regarding the merge of rubrics with OpenJury, we need to discuss it further given that Bora has been working solely on this and performed lots of meta-evaluations. Ideally, we should merge features that allow to do both what you did and Bora did in a compatible way.

PS: if possible, I would appreciate to rely less on LLMs for summary as it is not super trustworthy. For instance, it mentions that "Position bias mitigated by swap_mode=both, but at 2x compute" but this is a bit misleading given that the rubrics also run twice per completion.

ferreirafabio · 2026-02-23T17:08:14Z

You're right, thanks for the pointer. I thought I had removed all "2x compute" mentions but that one slipped through. Removing it above! RE the other things you mentioned:

Do we have any idea why some datasets are better than others?

These results are using the old checkpoint that was trained with the wrong tokenizer (base instead of think tokenizer, i.e. including tags now) going from the base to think stage. For the sake of this PR, I still made this comparison but would rather focus on comparing the baseline checkpoint performance between winrate and rubric for now. I will run this comparison with the newer reproduced checkpoint with the correct tokenizer which is about to be done. This will likely address your point about thinking should make arena-hard better (note though that the baseline is also a think checkpoint).

We probably want to evaluate the other non-llm judge tasks that are relevant for this stage to assess the gap here?

Yes, I'm currently looking into that.

Ideally, we should merge features that allow to do both what you did and Bora did in a compatible way.

Sure! I'd suggest to do this rather sooner than later since using non-llm judge tasks is one way to cover breadth but also using our judge tasks with rubrics is another valuable way to do this I guess?

kargibora · 2026-02-24T13:32:24Z

Hi Fabio, I was also working on rubric evaluations. I have created a PR for it and also changed my pipeline so it also accepts "reference score" as you do. Would love to discuss this further

ferreirafabio · 2026-02-24T13:34:39Z

Hi @kargibora, cool! Will check it out. Do you mean anchoring by "reference score"?

kargibora · 2026-02-24T13:45:00Z

Exactly, otherwise it felt that LLM can create their own "judgement" about what does score=5 means for pairwise scoring (scoring completions independently).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rubric-based evaluation mode#13

Add rubric-based evaluation mode#13
ferreirafabio wants to merge 1 commit intoOpenEuroLLM:mainfrom
ferreirafabio:feat/rubric-evaluation

ferreirafabio commented Feb 22, 2026 •

edited

Loading

Uh oh!

ferreirafabio commented Feb 23, 2026 •

edited

Loading

Uh oh!

geoalgo commented Feb 23, 2026

Uh oh!

ferreirafabio commented Feb 23, 2026 •

edited

Loading

Uh oh!

kargibora commented Feb 24, 2026

Uh oh!

ferreirafabio commented Feb 24, 2026

Uh oh!

kargibora commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ferreirafabio commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the problem?

How do we solve it?

Changes

Testing

Uh oh!

ferreirafabio commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenJury LLM-as-Judge Evaluation

Winrate

swap_mode=both

swap_mode=fixed (Model A presented first)

Rubric

alpaca-eval

arena-hard

m-arena-hard-EU

Uh oh!

geoalgo commented Feb 23, 2026

Uh oh!

ferreirafabio commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kargibora commented Feb 24, 2026

Uh oh!

ferreirafabio commented Feb 24, 2026

Uh oh!

kargibora commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ferreirafabio commented Feb 22, 2026 •

edited

Loading

ferreirafabio commented Feb 23, 2026 •

edited

Loading

`swap_mode=both`

`swap_mode=fixed` (Model A presented first)

ferreirafabio commented Feb 23, 2026 •

edited

Loading