Skip to content

feat: add support for the original mt-bench#21

Open
ErlisLushtaku wants to merge 36 commits intomainfrom
erlislushtaku/feat/add-mt-bench-support
Open

feat: add support for the original mt-bench#21
ErlisLushtaku wants to merge 36 commits intomainfrom
erlislushtaku/feat/add-mt-bench-support

Conversation

@ErlisLushtaku
Copy link
Collaborator

@ErlisLushtaku ErlisLushtaku commented Mar 2, 2026

Summary

This PR implements full multi-turn evaluation support for the original MT-Bench benchmark . It aligns the turn 2 prompt structure with the methodology recommended in the original paper ("Judging LLM-as-a-Judge") to prevent judge confusion and ensure high-quality evaluations.

Key Changes

  • Dataset & Loader:

    • Implemented mt_bench.py to handle downloading questions and reference answers from the LMSYS HuggingFace space.
  • Prompt Alignment:

    • Created dedicated multi-turn templates (prompt-multi-turn.txt) that present two separate, full conversation histories (User -> A1 -> User -> A2 and User -> B1 -> User -> B2).
    • This prevents the "attribution error" where the judge misidentifies model responses between turns (see paper section 3.5).
  • Pipeline Logic:

    • Added _run_mt_bench() to perform separate evaluation calls for Turn 1 and Turn 2.
    • Integrated Reference-guided grading for technical categories (Math, Reasoning, Coding) by embedding ground-truth answers directly into the judge context.
  • Model Wrappers:

    • Refactored LlamaCpp into ChatLlamaCppModel to support GGUF chat templates and proper KV cache management between multi-turn calls.
    • Extracted BaseLocalModel to share logic between LlamaCpp and VLLM wrappers.
  • Features & Metrics:

    • Added --mt_bench_turns flag to support single-turn, multi-turn, or both.
    • Added --mt_bench_compatibility argument, which when set to fastchat reproduces the implementation of the paper from FastChat/MT-Bench with [[A]]/[[B]]/[[C]] verdict parsing, conservative position-bias handling (a model wins only if both orderings agree, otherwise it is a tie), judge temperature=0, and MT-Bench category temperatures. If set to openjury (default) then it uses the OpenJury evaluation with softmax, uses averaging for position bias, doesn't set judge temperature, doesn't use per-category temperature configs.
    • Updated results output to include per-category and per-turn win rate breakdowns.

Testing

  • Added comprehensive unit and integration tests in tests/test_generate_and_evaluate.py.

ErlisLushtaku and others added 24 commits February 14, 2026 20:45
- Updated README to use EuroLLM-Instruct because the base (EuroLLM-9B) doesn't have a chat template and throws error.
- Added functionality to load pre-existing dataset completions for models. Was throwing error previously, becuase it was considering the model as a provider.
…the new `max_model_len` and related parameters
- Moved max_model_len and chat_template to **model_kwargs for readability.
- Adjusted ChatVLLM initialization to cap max_model_len based on model's max_position_embeddings.
- Added warnings for potential max_model_len issues.
- mock external api calls
- add safety check for content in completions
- moved slurmpilot to dev group since it doesn't have a published version on Pypi and doesn't allow we are not allowed to publish Openjury on Pypi otherwise
- There was a halting issue with LlamaCpp since the model was not emitting EOS token and doesn't call Llama.reset() between calls (turns), causing a KV cache position mismatch crash so ChatLlamaCppModel was created as a custom wrapper to fix this
- BaseLocalModel was extracted as common logic for ChatLlamaCppModel and ChatVLLM
- Implement MT-Bench loader and multi-turn generation/judging logic.
- Add paper-aligned prompt templates while keeping the score-based evaluation to be consistent with OpenJury.
- Support reference answers, per-turn breakdowns, and swap mode.
- Add comprehensive MT-Bench pipeline tests.
@ErlisLushtaku ErlisLushtaku requested a review from geoalgo March 2, 2026 21:25
"""
chat_model = make_model(model, max_tokens=max_tokens, **model_kwargs)

system_prompt = "You are a helpful assistant."
Copy link
Collaborator

@kargibora kargibora Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use a better system_prompt. What does MT-Bench uses?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we have a naive default also in general (it is not blocking for this PR as we can change/improve it later).
Using the prompt of arena-hard would make most sense to me as the benchmark is more refined than MT-bench in some sense.

Copy link
Collaborator Author

@ErlisLushtaku ErlisLushtaku Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the mt-bench prompts (and other changes to reproduce their setup) here.

Copy link
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a first pass and had a few comments, I will finish the code review this afternoon.

from openjury.utils import data_root

def _read_json_or_jsonl(path: Path) -> list[dict]:
if path.suffix == ".jsonl":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do rather pd.read_json to avoid custom code here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it here

score_B: <a number between 0 and 10 to indicate the quality of Assistant B's answer>
```

## Your output, do not repeat the input above, first starts with an explanation of your judgement
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that both prompts are exact duplicate, I would prefer to have

Suggested change
## Your output, do not repeat the input above, first starts with an explanation of your judgement
## Your output, do not repeat the input above{explanation_prompt}

where explanation_prompt is set to empty string or ", first starts with an explanation of your judgement" depending on which mode we want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it here

"""
chat_model = make_model(model, max_tokens=max_tokens, **model_kwargs)

system_prompt = "You are a helpful assistant."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we have a naive default also in general (it is not blocking for this PR as we can change/improve it later).
Using the prompt of arena-hard would make most sense to me as the benchmark is more refined than MT-bench in some sense.

- Implemented a new function to download MT-Bench questions and GPT-4 reference answers, with fallback mechanisms for missing references.
- Remove duplication.
Copy link
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks look clean overall, I just have a question regarding having the feature in a separate endpoint.

We can leave it like this for now but perhaps we would want to have separate endpoint for multiturn generation and evaluation as it may be cleaner as the code is quite different? (I mean to have a generate_and_evaluate_multiturn.py as the entrypoint may blowup otherwise)

Also I assume the complexity of multiturn is going to increase a bit when we merge the other PR. Let me know what you think.

Comment on lines +304 to +305
- Uses ``temperature=0.6`` and ``top_p=0.95`` unless explicitly
overridden.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the default here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the default before as well. Check here.

| **Lighteval** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **Evalchemy** | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| **OpenJury** | 🔜 | ✅ | ✅ | ✅ | ✅ | ✅ |
| **OpenJury** | | ✅ | ✅ | ✅ | ✅ | ✅ |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💪

@ErlisLushtaku
Copy link
Collaborator Author

ErlisLushtaku commented Mar 17, 2026

Thanks look clean overall, I just have a question regarding having the feature in a separate endpoint.

We can leave it like this for now but perhaps we would want to have separate endpoint for multiturn generation and evaluation as it may be cleaner as the code is quite different? (I mean to have a generate_and_evaluate_multiturn.py as the entrypoint may blowup otherwise)

Also I assume the complexity of multiturn is going to increase a bit when we merge the other PR. Let me know what you think.

@geoalgo It is a valid concern. The entrypoint will grow quite a bit.

However the pipeline for mt-bench-101 is also quite different from mt-bench (in this PR), and the mt-bench here also supports single turns so they wouldn't fit together in a new multiturn entrypoint. We could remove this single turn mode for mt-bench and evaluate on both turns always. We could also simplify here to make it always use the original mt-bench mode and remove the openjury mode for mt-bench (currently controlled by --mt_bench_compatibility argument). Then this could fit in a new multiturn entrypoint and be more separate from the openjury pipeline.

Other than these technical issues, I think in general it would make the UX worse if we add more entrypoints, since the users need to know which entrypoint to use for which dataset. We could instead keep one entrypoint but route to different private pipelines depending on the dataset, and keep the generate_and_evaluate.py leaner like in this new commit, if you agree?
We could make the simplifications mentioned above nonetheless, so to always evaluate both turns for mt-bench and to always use the mt-bench mode. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants