feat: add support for the original mt-bench#21
Conversation
- Updated README to use EuroLLM-Instruct because the base (EuroLLM-9B) doesn't have a chat template and throws error. - Added functionality to load pre-existing dataset completions for models. Was throwing error previously, becuase it was considering the model as a provider.
…the new `max_model_len` and related parameters
- Moved max_model_len and chat_template to **model_kwargs for readability. - Adjusted ChatVLLM initialization to cap max_model_len based on model's max_position_embeddings. - Added warnings for potential max_model_len issues.
…t dependency resolution
- moved slurmpilot to dev group since it doesn't have a published version on Pypi and doesn't allow we are not allowed to publish Openjury on Pypi otherwise
- There was a halting issue with LlamaCpp since the model was not emitting EOS token and doesn't call Llama.reset() between calls (turns), causing a KV cache position mismatch crash so ChatLlamaCppModel was created as a custom wrapper to fix this - BaseLocalModel was extracted as common logic for ChatLlamaCppModel and ChatVLLM
- Implement MT-Bench loader and multi-turn generation/judging logic. - Add paper-aligned prompt templates while keeping the score-based evaluation to be consistent with OpenJury. - Support reference answers, per-turn breakdowns, and swap mode. - Add comprehensive MT-Bench pipeline tests.
| """ | ||
| chat_model = make_model(model, max_tokens=max_tokens, **model_kwargs) | ||
|
|
||
| system_prompt = "You are a helpful assistant." |
There was a problem hiding this comment.
Maybe we can use a better system_prompt. What does MT-Bench uses?
There was a problem hiding this comment.
Good point, we have a naive default also in general (it is not blocking for this PR as we can change/improve it later).
Using the prompt of arena-hard would make most sense to me as the benchmark is more refined than MT-bench in some sense.
There was a problem hiding this comment.
I added the mt-bench prompts (and other changes to reproduce their setup) here.
geoalgo
left a comment
There was a problem hiding this comment.
I just did a first pass and had a few comments, I will finish the code review this afternoon.
| from openjury.utils import data_root | ||
|
|
||
| def _read_json_or_jsonl(path: Path) -> list[dict]: | ||
| if path.suffix == ".jsonl": |
There was a problem hiding this comment.
Can we do rather pd.read_json to avoid custom code here?
| score_B: <a number between 0 and 10 to indicate the quality of Assistant B's answer> | ||
| ``` | ||
|
|
||
| ## Your output, do not repeat the input above, first starts with an explanation of your judgement |
There was a problem hiding this comment.
Given that both prompts are exact duplicate, I would prefer to have
| ## Your output, do not repeat the input above, first starts with an explanation of your judgement | |
| ## Your output, do not repeat the input above{explanation_prompt} |
where explanation_prompt is set to empty string or ", first starts with an explanation of your judgement" depending on which mode we want.
| """ | ||
| chat_model = make_model(model, max_tokens=max_tokens, **model_kwargs) | ||
|
|
||
| system_prompt = "You are a helpful assistant." |
There was a problem hiding this comment.
Good point, we have a naive default also in general (it is not blocking for this PR as we can change/improve it later).
Using the prompt of arena-hard would make most sense to me as the benchmark is more refined than MT-bench in some sense.
geoalgo
left a comment
There was a problem hiding this comment.
Thanks look clean overall, I just have a question regarding having the feature in a separate endpoint.
We can leave it like this for now but perhaps we would want to have separate endpoint for multiturn generation and evaluation as it may be cleaner as the code is quite different? (I mean to have a generate_and_evaluate_multiturn.py as the entrypoint may blowup otherwise)
Also I assume the complexity of multiturn is going to increase a bit when we merge the other PR. Let me know what you think.
| - Uses ``temperature=0.6`` and ``top_p=0.95`` unless explicitly | ||
| overridden. |
There was a problem hiding this comment.
Why change the default here?
There was a problem hiding this comment.
This was the default before as well. Check here.
| | **Lighteval** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | | ||
| | **Evalchemy** | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | | ||
| | **OpenJury** | 🔜 | ✅ | ✅ | ✅ | ✅ | ✅ | | ||
| | **OpenJury** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
@geoalgo It is a valid concern. The entrypoint will grow quite a bit. However the pipeline for Other than these technical issues, I think in general it would make the UX worse if we add more entrypoints, since the users need to know which entrypoint to use for which dataset. We could instead keep one entrypoint but route to different private pipelines depending on the dataset, and keep the |
Summary
This PR implements full multi-turn evaluation support for the original MT-Bench benchmark . It aligns the turn 2 prompt structure with the methodology recommended in the original paper ("Judging LLM-as-a-Judge") to prevent judge confusion and ensure high-quality evaluations.
Key Changes
Dataset & Loader:
mt_bench.pyto handle downloading questions and reference answers from the LMSYS HuggingFace space.Prompt Alignment:
prompt-multi-turn.txt) that present two separate, full conversation histories (User -> A1 -> User -> A2andUser -> B1 -> User -> B2).Pipeline Logic:
_run_mt_bench()to perform separate evaluation calls for Turn 1 and Turn 2.Model Wrappers:
LlamaCppintoChatLlamaCppModelto support GGUF chat templates and proper KV cache management between multi-turn calls.BaseLocalModelto share logic between LlamaCpp and VLLM wrappers.Features & Metrics:
--mt_bench_turnsflag to support single-turn, multi-turn, or both.--mt_bench_compatibilityargument, which when set tofastchatreproduces the implementation of the paper from FastChat/MT-Bench with [[A]]/[[B]]/[[C]] verdict parsing, conservative position-bias handling (a model wins only if both orderings agree, otherwise it is a tie), judge temperature=0, and MT-Bench category temperatures. If set toopenjury(default) then it uses the OpenJury evaluation with softmax, uses averaging for position bias, doesn't set judge temperature, doesn't use per-category temperature configs.Testing
tests/test_generate_and_evaluate.py.