add `instruct_format` (or `chat_overload` or similar) field to task configs

Users have been reporting that their evaluation scores don't match official benchmarks for awhile (e.g #3407, #3411, #2774
#3405,#2555, #2707, among others) particularly when using instruction-tuned models. The evaluation harness was originally designed primarily for base models, which are evaluated with few-shot examples and rely on in-context pattern recognition (or alternatively multiple-choice/loglikelihoods based evals) . We've  also preferred following the official implementation for each task (where possible), which works well for base models since they behave relatively consistently across different model families (give or take).

However, instruction-tuned models need to be handled differently:

- They're trained to follow explicit instructions and perform significantly better when given clear formatting directives (e.g., "Write your answer as: The answer is [X]")
- Unlike base models, there's no "official" instruction format we can follow as a standard
- Different model families/types may require different instruction formats (e.g thinking models vs vanilla)

#### Proposed Solution
Add a new optional field `chat_overload` to task configurations that provides instruction-specific formatting when a model's `chat_template` is being used.

#### How it works:

- When a model has a chat template defined (either auto-detected or specified), the chat_overload instructions are automatically incorporated
- Base models without chat templates continue using the standard few-shot evaluation format
- This allows each model type to be evaluated properly without requiring manual prompt modifications
- An alternative could be to keep creating different config variants (gsm8k, gsm8k_chat_cot, gsm8k_chat_qwen, ..), but this would get unwieldy fast and I think also confusing to users.

This approach would allow base model evaluations to remain fairly consistent while properly supporting the instruction-tuned models. This would also allow us to support multiple instruction variants as the community iterates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add `instruct_format` (or `chat_overload` or similar) field to task configs #3417

Proposed Solution

How it works:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

add instruct_format (or chat_overload or similar) field to task configs #3417

Description

Proposed Solution

How it works:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

add `instruct_format` (or `chat_overload` or similar) field to task configs #3417