feat: add support for the original mt-bench by ErlisLushtaku · Pull Request #21 · OpenEuroLLM/OpenJury

ErlisLushtaku · 2026-03-02T20:34:43Z

Summary

This PR implements full multi-turn evaluation support for the original MT-Bench benchmark . It aligns the turn 2 prompt structure with the methodology recommended in the original paper ("Judging LLM-as-a-Judge") to prevent judge confusion and ensure high-quality evaluations.

Key Changes

Dataset & Loader:
- Implemented mt_bench.py to handle downloading questions and reference answers from the LMSYS HuggingFace space.
Prompt Alignment:
- Created dedicated multi-turn templates (prompt-multi-turn.txt) that present two separate, full conversation histories (User -> A1 -> User -> A2 and User -> B1 -> User -> B2).
- This prevents the "attribution error" where the judge misidentifies model responses between turns (see paper section 3.5).
Pipeline Logic:
- Added _run_mt_bench() to perform separate evaluation calls for Turn 1 and Turn 2.
- Integrated Reference-guided grading for technical categories (Math, Reasoning, Coding) by embedding ground-truth answers directly into the judge context.
Model Wrappers:
- Refactored LlamaCpp into ChatLlamaCppModel to support GGUF chat templates and proper KV cache management between multi-turn calls.
- Extracted BaseLocalModel to share logic between LlamaCpp and VLLM wrappers.
Features & Metrics:
- Added --mt_bench_turns flag to support single-turn, multi-turn, or both.
- Added --mt_bench_compatibility argument, which when set to fastchat reproduces the implementation of the paper from FastChat/MT-Bench with [[A]]/[[B]]/[[C]] verdict parsing, conservative position-bias handling (a model wins only if both orderings agree, otherwise it is a tie), judge temperature=0, and MT-Bench category temperatures. If set to openjury (default) then it uses the OpenJury evaluation with softmax, uses averaging for position bias, doesn't set judge temperature, doesn't use per-category temperature configs.
- Updated results output to include per-category and per-turn win rate breakdowns.

Testing

Added comprehensive unit and integration tests in tests/test_generate_and_evaluate.py.

- Updated README to use EuroLLM-Instruct because the base (EuroLLM-9B) doesn't have a chat template and throws error. - Added functionality to load pre-existing dataset completions for models. Was throwing error previously, becuase it was considering the model as a provider.

…the new `max_model_len` and related parameters

- Moved max_model_len and chat_template to **model_kwargs for readability. - Adjusted ChatVLLM initialization to cap max_model_len based on model's max_position_embeddings. - Added warnings for potential max_model_len issues.

…t template now

- mock external api calls - add safety check for content in completions

…t dependency resolution

- moved slurmpilot to dev group since it doesn't have a published version on Pypi and doesn't allow we are not allowed to publish Openjury on Pypi otherwise

- There was a halting issue with LlamaCpp since the model was not emitting EOS token and doesn't call Llama.reset() between calls (turns), causing a KV cache position mismatch crash so ChatLlamaCppModel was created as a custom wrapper to fix this - BaseLocalModel was extracted as common logic for ChatLlamaCppModel and ChatVLLM

- Implement MT-Bench loader and multi-turn generation/judging logic. - Add paper-aligned prompt templates while keeping the score-based evaluation to be consistent with OpenJury. - Support reference answers, per-turn breakdowns, and swap mode. - Add comprehensive MT-Bench pipeline tests.

kargibora · 2026-03-03T09:28:51Z

openjury/generate.py

+    """
+    chat_model = make_model(model, max_tokens=max_tokens, **model_kwargs)
+
+    system_prompt = "You are a helpful assistant."


Maybe we can use a better system_prompt. What does MT-Bench uses?

Good point, we have a naive default also in general (it is not blocking for this PR as we can change/improve it later).
Using the prompt of arena-hard would make most sense to me as the benchmark is more refined than MT-bench in some sense.

I added the mt-bench prompts (and other changes to reproduce their setup) here.

geoalgo

I just did a first pass and had a few comments, I will finish the code review this afternoon.

geoalgo · 2026-03-03T09:47:20Z

openjury/instruction_dataset/mt_bench.py

+from openjury.utils import data_root
+
+def _read_json_or_jsonl(path: Path) -> list[dict]:
+    if path.suffix == ".jsonl":


Can we do rather pd.read_json to avoid custom code here?

Changed it here

geoalgo · 2026-03-03T09:54:56Z

openjury/prompts/prompt-multi-turn-with-explanation.txt

+score_B: <a number between 0 and 10 to indicate the quality of Assistant B's answer>
+```
+
+## Your output, do not repeat the input above, first starts with an explanation of your judgement


Given that both prompts are exact duplicate, I would prefer to have

Suggested change

## Your output, do not repeat the input above, first starts with an explanation of your judgement

## Your output, do not repeat the input above{explanation_prompt}

where explanation_prompt is set to empty string or ", first starts with an explanation of your judgement" depending on which mode we want.

Changed it here

geoalgo · 2026-03-03T09:56:46Z

openjury/generate.py

+    """
+    chat_model = make_model(model, max_tokens=max_tokens, **model_kwargs)
+
+    system_prompt = "You are a helpful assistant."


Good point, we have a naive default also in general (it is not blocking for this PR as we can change/improve it later).
Using the prompt of arena-hard would make most sense to me as the benchmark is more refined than MT-bench in some sense.

- Implemented a new function to download MT-Bench questions and GPT-4 reference answers, with fallback mechanisms for missing references. - Remove duplication.

…fastchat

geoalgo

Thanks look clean overall, I just have a question regarding having the feature in a separate endpoint.

We can leave it like this for now but perhaps we would want to have separate endpoint for multiturn generation and evaluation as it may be cleaner as the code is quite different? (I mean to have a generate_and_evaluate_multiturn.py as the entrypoint may blowup otherwise)

Also I assume the complexity of multiturn is going to increase a bit when we merge the other PR. Let me know what you think.

geoalgo · 2026-03-17T09:02:08Z

openjury/utils.py

+        - Uses ``temperature=0.6`` and ``top_p=0.95`` unless explicitly
+          overridden.


Why change the default here?

This was the default before as well. Check here.

geoalgo · 2026-03-17T09:42:58Z

README.md

 | **Lighteval** | ✅  | ❌  | ❌  | ❌  | ❌                         | ❌                                       |
 | **Evalchemy** | ✅  | ✅  | ❌  | ❌  | ❌                         | ❌                                           |
-| **OpenJury** | 🔜  | ✅  | ✅  | ✅  | ✅                         | ✅                                          |
+| **OpenJury** | ✅  | ✅  | ✅  | ✅  | ✅                         | ✅                                          |


ErlisLushtaku · 2026-03-17T13:29:33Z

Thanks look clean overall, I just have a question regarding having the feature in a separate endpoint.

We can leave it like this for now but perhaps we would want to have separate endpoint for multiturn generation and evaluation as it may be cleaner as the code is quite different? (I mean to have a generate_and_evaluate_multiturn.py as the entrypoint may blowup otherwise)

Also I assume the complexity of multiturn is going to increase a bit when we merge the other PR. Let me know what you think.

@geoalgo It is a valid concern. The entrypoint will grow quite a bit.

However the pipeline for mt-bench-101 is also quite different from mt-bench (in this PR), and the mt-bench here also supports single turns so they wouldn't fit together in a new multiturn entrypoint. We could remove this single turn mode for mt-bench and evaluate on both turns always. We could also simplify here to make it always use the original mt-bench mode and remove the openjury mode for mt-bench (currently controlled by --mt_bench_compatibility argument). Then this could fit in a new multiturn entrypoint and be more separate from the openjury pipeline.

Other than these technical issues, I think in general it would make the UX worse if we add more entrypoints, since the users need to know which entrypoint to use for which dataset. We could instead keep one entrypoint but route to different private pipelines depending on the dataset, and keep the generate_and_evaluate.py leaner like in this new commit, if you agree?
We could make the simplifications mentioned above nonetheless, so to always evaluate both turns for mt-bench and to always use the mt-bench mode. What do you think?

…entrypoint

ErlisLushtaku and others added 24 commits February 14, 2026 20:45

Add llamacpp dependency and update gitignore with generated directories

ba4220d

Add documentation for llamacpp in Readme

d2a5a42

Document direnv usage for environment variables management

a828adb

narrow down transformers dependency to fix version mismatch

0dcebf9

Add max_model_len param for VLLM in order to prevent OOM errors

d60073b

Remove direnv documentation

6f5e0fc

Revert stylistic (formatting) changes and add more documentation for …

42ff2ae

…the new `max_model_len` and related parameters

Rename OPENJURY_EVAL_DATA to OPENJURY_DATA

8fcb032

Merge main

df958af

Revert changes in gitignore

35856f2

Revert EuroLLM-9B-Instruct to EuroLLM-9B since there is a default cha…

fecd3ed

…t template now

fix tests

0b4eaec

- mock external api calls - add safety check for content in completions

Change test github workflow to use uv instead of pip for a more robus…

29340b0

…t dependency resolution

Move dev dependencies to dependency-group

2c294f1

Revert comment removal

4be61bf

Add pre-commit hook

51d2597

add project scripts and move slurmpilot to dev group

8dee7b2

- moved slurmpilot to dev group since it doesn't have a published version on Pypi and doesn't allow we are not allowed to publish Openjury on Pypi otherwise

Merge branch 'main' into erlislushtaku/feat/add-mt-bench-support

648a9be

fix result formatting

14f747e

remove double environment variable

e67ea79

ErlisLushtaku requested a review from geoalgo March 2, 2026 21:25

remove accidental duplications

4089be8

kargibora reviewed Mar 3, 2026

View reviewed changes

geoalgo reviewed Mar 3, 2026

View reviewed changes

ErlisLushtaku added 2 commits March 4, 2026 15:25

Refactor

03f5cce

- Implemented a new function to download MT-Bench questions and GPT-4 reference answers, with fallback mechanisms for missing references. - Remove duplication.

Remove duplication between prompt templates

8ffe3a6

ErlisLushtaku added 4 commits March 9, 2026 21:00

add temperature argument

b877f11

add option for making mt-bench consistent with the original one from …

c2056b5

…fastchat

Merge branch 'main' into erlislushtaku/feat/add-mt-bench-support

41cd15d

remove redundant print statement

0ca66c5

geoalgo reviewed Mar 17, 2026

View reviewed changes

ErlisLushtaku added 5 commits March 17, 2026 14:42

move mt-bench logic from the entrypoint

a295305

Remove stale unused entries for fastchat mode

0fb9700

Merge origin/main into erlislushtaku/feat/add-mt-bench-support

e5670ea

Refactor mt-bench eval helpers into shared runtime module

6dd78fd

move cli args and parsing to separate util to remove dependencies on …

0094eea

…entrypoint

	## Your output, do not repeat the input above, first starts with an explanation of your judgement
	## Your output, do not repeat the input above{explanation_prompt}

		- Uses ``temperature=0.6`` and ``top_p=0.95`` unless explicitly
		overridden.

Conversation

ErlisLushtaku commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Testing

Uh oh!

kargibora Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoalgo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoalgo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ErlisLushtaku commented Mar 2, 2026 •

edited

Loading

kargibora Mar 3, 2026 •

edited

Loading

ErlisLushtaku Mar 11, 2026 •

edited

Loading

ErlisLushtaku commented Mar 17, 2026 •

edited

Loading