-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Users have been reporting that their evaluation scores don't match official benchmarks for awhile (e.g #3407, #3411, #2774
#3405,#2555, #2707, among others) particularly when using instruction-tuned models. The evaluation harness was originally designed primarily for base models, which are evaluated with few-shot examples and rely on in-context pattern recognition (or alternatively multiple-choice/loglikelihoods based evals) . We've also preferred following the official implementation for each task (where possible), which works well for base models since they behave relatively consistently across different model families (give or take).
However, instruction-tuned models need to be handled differently:
- They're trained to follow explicit instructions and perform significantly better when given clear formatting directives (e.g., "Write your answer as: The answer is [X]")
- Unlike base models, there's no "official" instruction format we can follow as a standard
- Different model families/types may require different instruction formats (e.g thinking models vs vanilla)
Proposed Solution
Add a new optional field chat_overload to task configurations that provides instruction-specific formatting when a model's chat_template is being used.
How it works:
- When a model has a chat template defined (either auto-detected or specified), the chat_overload instructions are automatically incorporated
- Base models without chat templates continue using the standard few-shot evaluation format
- This allows each model type to be evaluated properly without requiring manual prompt modifications
- An alternative could be to keep creating different config variants (gsm8k, gsm8k_chat_cot, gsm8k_chat_qwen, ..), but this would get unwieldy fast and I think also confusing to users.
This approach would allow base model evaluations to remain fairly consistent while properly supporting the instruction-tuned models. This would also allow us to support multiple instruction variants as the community iterates.