Skip to content

Add rubric-based evaluation mode#13

Open
ferreirafabio wants to merge 1 commit intoOpenEuroLLM:mainfrom
ferreirafabio:feat/rubric-evaluation
Open

Add rubric-based evaluation mode#13
ferreirafabio wants to merge 1 commit intoOpenEuroLLM:mainfrom
ferreirafabio:feat/rubric-evaluation

Conversation

@ferreirafabio
Copy link
Contributor

@ferreirafabio ferreirafabio commented Feb 22, 2026

What is the problem?

OpenJury currently only supports pairwise winrate evaluation, where a judge LLM compares two responses head-to-head and assigns scores. Pairwise evaluation has known limitations ("Tiny Aya: Bridging Scale and Multilingual Depth", Salamanca, Kreutzer, Fadaee et al., https://arxiv.org/abs/2501.10893, 2026):

  • High variance: small style changes can flip binary preference labels, causing large swings in average win rate on small eval sets
  • No absolute quality signal: a model might get 90% win rate just because the competitor fails entirely, inflating results without reflecting actual quality
  • No interpretability: no insight into which quality dimensions (accuracy, fluency, etc.) drove the judge's decision
  • Position bias: the judge may prefer whichever response appears first or second

How do we solve it?

We add a rubric-based evaluation mode (--eval_mode rubric) that scores each response independently on 4 criteria using a 1/3/5/7 Likert scale. The evaluation prompt (rubric-prompt.txt) is taken from Appendix B.3 of ("Tiny Aya: Bridging Scale and Multilingual Depth", Salamanca, Kreutzer, Fadaee et al., https://arxiv.org/abs/2501.10893, 2026) and adapted with minor changes: {language} placeholders replaced with language-agnostic phrasing, chatbot -> response for consistency, and template variable names adapted to OpenJury conventions.

Criteria (each scored 1, 3, 5, or 7):

  1. Instruction Following
  2. Naturalness
  3. Coherence
  4. Accuracy

Composite score: mean of the 4 criteria, linearly mapped from [1, 7] to [0, 1] via (mean - 1) / 6.

Because rubric mode evaluates one response at a time (not pairwise), it counteracts the positional bias. The judge prompt explicitly instructs scoring on valid anchor points only (1, 3, 5, 7); if the judge outputs an intermediate value (2, 4, 6), the parser snaps to the nearest valid anchor and logs the count.

Default mode remains winrate for backward compatibility.

Example output (OLMo3-7B Think SFT, alpaca-eval, Qwen3-30B-A3B judge):

============================================================
                 RUBRIC EVALUATION RESULTS
Dataset: alpaca-eval
Judge: Qwen/Qwen3-30B-A3B-Instruct-2507
------------------------------------------------------------
  Model A: Olmo-3-7B-Think-SFT (baseline)
  Model B: dolci-think-sft-hf-65k (ours)

  Criterion                    Model A    Model B
  -----------------------------------------------
  Instruction Following           5.99       6.26
  Naturalness                     6.70       6.83
  Coherence                       6.41       6.68
  Accuracy                        6.19       6.39
  -----------------------------------------------
  Average (0-1)                0.887      0.923

  Evaluations: 805 | Parse failures: A=0, B=1
============================================================

Changes

  • openjury/evaluate.py: add RUBRIC_CRITERIA, VALID_RUBRIC_SCORES, RubricScore parser (with snap-to-valid logic), RubricAnnotation dataclass, load_rubric_prompts(), annotate_rubric() function
  • openjury/generate_and_evaluate.py: add --eval_mode CLI argument to CliArgs, branch main() on eval mode, add print_rubric_results(), add "eval_mode" field to winrate results JSON for consistency
  • openjury/prompts/rubric-prompt.txt: rubric user prompt with 4 criteria, adapted from Tiny Aya Appendix B.3
  • openjury/prompts/rubric-system-prompt.txt: rubric system prompt
  • tests/test_rubric.py: 12 tests covering JSON parsing, snapping, composites, edge cases, and end-to-end with Dummy models

Testing

  • All 12 rubric tests pass (pytest tests/test_rubric.py)
  • All existing tests unaffected (pytest tests/ -- same failures as before, all pre-existing)
  • Tested on H200 GPU with OLMo3-7B Think SFT models (baseline vs ours) using Qwen3-30B-A3B as judge on alpaca-eval, arena-hard, and m-arena-hard-EU. Results with Likert-enforced prompt to follow.

Add --eval_mode rubric option for independent per-response scoring on 4
criteria (Instruction Following, Naturalness, Coherence, Accuracy) using
a 1/3/5/7 Likert scale. Prompt adapted from the Tiny Aya tech report
(Appendix B.3, Aryabumi et al., 2025).

Scores are snapped to valid anchor points {1, 3, 5, 7} if the judge
outputs intermediate values, with logging of snapped counts. Composite
score: (mean - 1) / 6, linearly mapping [1, 7] to [0, 1].

Default mode remains winrate (backward compatible).
@ferreirafabio
Copy link
Contributor Author

ferreirafabio commented Feb 23, 2026

OpenJury LLM-as-Judge Evaluation

As mentioned in the initial PR post: here is an exemplary comparison between winrate and rubrics for the same models. Main factors to keep in mind:

  • Winrate: The judge LLM receives both response A and response B in the same prompt, then decides which is better
  • Rubric: The judge LLM receives only one response at a time and scores it on the 4 criteria. It does this separately for A and B in two independent calls.

Baseline (A): Olmo-3-7B-Think-SFT
Trained (B): dolci-think-sft-hf-65k-config-fix
Judge: Qwen/Qwen3-30B-A3B-Instruct-2507
Config: max_tokens=32768 | truncate_chars=32768 | explanation=true
Evals: 50k (=use all available)


Winrate

swap_mode=both

Each comparison is run twice with swapped order and averaged.

Dataset A Wins B Wins Ties Total Baseline WR Ours WR
alpaca-eval 695 914 1 1,610 0.43 0.57
arena-hard 550 445 5 1,000 0.55 0.45
m-arena-hard-EU 7,052 4,937 11 12,000 0.59 0.41
Average 0.52 0.48

swap_mode=fixed (Model A presented first)

Dataset A Wins B Wins Ties Total Baseline WR Ours WR
alpaca-eval 596 208 1 805 0.74 0.26
arena-hard 363 136 1 500 0.73 0.27
m-arena-hard-EU 4,069 1,922 9 6,000 0.68 0.32
Average 0.72 0.28

The gap between fixed and both may indicate that the judge suffers from a position bias on fixed.


Rubric

alpaca-eval

Criterion Baseline (A) Ours (B) Delta
Instruction Following 6.36 6.45 +0.09
Naturalness 6.77 6.82 +0.05
Coherence 6.63 6.71 +0.08
Accuracy 6.50 6.49 -0.01
Average (0-1) 0.927 0.936 +0.009

arena-hard

Criterion Baseline (A) Ours (B) Delta
Instruction Following 5.91 5.50 -0.41
Naturalness 6.46 6.31 -0.15
Coherence 6.29 6.01 -0.28
Accuracy 5.98 5.52 -0.46
Average (0-1) 0.860 0.806 -0.054

m-arena-hard-EU

Criterion Baseline (A) Ours (B) Delta
Instruction Following 4.78 4.32 -0.46
Naturalness 5.92 5.45 -0.47
Coherence 5.47 4.93 -0.54
Accuracy 4.90 4.31 -0.59
Average (0-1) 0.711 0.625 -0.086

Notes/implications:

  • Rubrics pose a valuable addition to winrate-based evaluation (winrate with swap_mode=fixed, i.e. model A presented first) may inflate model A winrate)
  • Rubric scores each response independently
  • Rubric also provides better interpretability since we can see which specific criteria improve or degrade, rather than just a binary win/loss. It likely suffers less from positional bias

@geoalgo
Copy link
Collaborator

geoalgo commented Feb 23, 2026

Thanks Fabio for the PR, interesting results!

Regarding the results on Olmo3, it is quite interesting overall.

Do we have any idea why some datasets are better than others?
It is interesting that the reproduction is better at alpaca-eval but quite worse for arena-hard (between 6 and 12% score shift, in particular given that the think stage should make arena-hard better?
We probably want to evaluate the other non-llm judge tasks that are relevant for this stage to assess the gap here?

Regarding the merge of rubrics with OpenJury, we need to discuss it further given that Bora has been working solely on this and performed lots of meta-evaluations. Ideally, we should merge features that allow to do both what you did and Bora did in a compatible way.

PS: if possible, I would appreciate to rely less on LLMs for summary as it is not super trustworthy. For instance, it mentions that "Position bias mitigated by swap_mode=both, but at 2x compute" but this is a bit misleading given that the rubrics also run twice per completion.

@ferreirafabio
Copy link
Contributor Author

ferreirafabio commented Feb 23, 2026

You're right, thanks for the pointer. I thought I had removed all "2x compute" mentions but that one slipped through. Removing it above! RE the other things you mentioned:

Do we have any idea why some datasets are better than others?

These results are using the old checkpoint that was trained with the wrong tokenizer (base instead of think tokenizer, i.e. including tags now) going from the base to think stage. For the sake of this PR, I still made this comparison but would rather focus on comparing the baseline checkpoint performance between winrate and rubric for now. I will run this comparison with the newer reproduced checkpoint with the correct tokenizer which is about to be done. This will likely address your point about thinking should make arena-hard better (note though that the baseline is also a think checkpoint).

We probably want to evaluate the other non-llm judge tasks that are relevant for this stage to assess the gap here?

Yes, I'm currently looking into that.

Ideally, we should merge features that allow to do both what you did and Bora did in a compatible way.

Sure! I'd suggest to do this rather sooner than later since using non-llm judge tasks is one way to cover breadth but also using our judge tasks with rubrics is another valuable way to do this I guess?

@kargibora
Copy link
Collaborator

Hi Fabio, I was also working on rubric evaluations. I have created a PR for it and also changed my pipeline so it also accepts "reference score" as you do. Would love to discuss this further

@ferreirafabio
Copy link
Contributor Author

Hi @kargibora, cool! Will check it out. Do you mean anchoring by "reference score"?

@kargibora
Copy link
Collaborator

Exactly, otherwise it felt that LLM can create their own "judgement" about what does score=5 means for pairwise scoring (scoring completions independently).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants