Skip to content

add instruct_format (or chat_overload or similar) field to task configsΒ #3417

@baberabb

Description

@baberabb

Users have been reporting that their evaluation scores don't match official benchmarks for awhile (e.g #3407, #3411, #2774
#3405,#2555, #2707, among others) particularly when using instruction-tuned models. The evaluation harness was originally designed primarily for base models, which are evaluated with few-shot examples and rely on in-context pattern recognition (or alternatively multiple-choice/loglikelihoods based evals) . We've also preferred following the official implementation for each task (where possible), which works well for base models since they behave relatively consistently across different model families (give or take).

However, instruction-tuned models need to be handled differently:

  • They're trained to follow explicit instructions and perform significantly better when given clear formatting directives (e.g., "Write your answer as: The answer is [X]")
  • Unlike base models, there's no "official" instruction format we can follow as a standard
  • Different model families/types may require different instruction formats (e.g thinking models vs vanilla)

Proposed Solution

Add a new optional field chat_overload to task configurations that provides instruction-specific formatting when a model's chat_template is being used.

How it works:

  • When a model has a chat template defined (either auto-detected or specified), the chat_overload instructions are automatically incorporated
  • Base models without chat templates continue using the standard few-shot evaluation format
  • This allows each model type to be evaluated properly without requiring manual prompt modifications
  • An alternative could be to keep creating different config variants (gsm8k, gsm8k_chat_cot, gsm8k_chat_qwen, ..), but this would get unwieldy fast and I think also confusing to users.

This approach would allow base model evaluations to remain fairly consistent while properly supporting the instruction-tuned models. This would also allow us to support multiple instruction variants as the community iterates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions