What is best practice for evaluating a base model with inspect?

Hi, what is the recommended best practice for evaluating a base model with inspect on, for example, gsm8k?

For example, I'd like to evaluate `Qwen/Qwen3-1.7B-Base`. It appears that the execution will eventually reach 
```python
        if self.tokenizer.chat_template is not None:
            chat = self.tokenizer.apply_chat_template(
                hf_messages,
                add_generation_prompt=True,
                tokenize=False,
                tools=tools_list if len(tools_list) > 0 else None,
                enable_thinking=self.enable_thinking,  # not all models use this, check if it is supported
            )
        else:
            chat = ""
            for message in hf_messages:
                chat += f"{message.role}: {message.content}\n"
        # return
        return cast(str, chat)
```
at https://github.com/UKGovernmentBEIS/inspect_ai/blob/694d3ed0b4386c92a9b4bcca8ed7878e2e13d9b1/src/inspect_ai/model/_providers/hf.py#L281 if using the `hf` provider.

However, since the tokenizer of the `Qwen/Qwen3-1.7B-Base` base model does have a chat template, it will format the text nonetheless using the chat template (which results in gibberish). 

Similarly, for vllm, it seems the text will always get formatted into chat messages if I'm reading this execution correctly: https://github.com/UKGovernmentBEIS/inspect_ai/blob/694d3ed0b4386c92a9b4bcca8ed7878e2e13d9b1/src/inspect_ai/model/_providers/openai_compatible.py#L146.

Is there a recommended workaround for generating an input text without a chat template for base models with these two providers? I suppose you could check if "base" is in the name but this feels rather hacky (especially for different model families having different conventions for whether excluding "base" from the name is the base model or instruct, e.g., Qwen2.5 vs Qwen3). Thanks for the help! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is best practice for evaluating a base model with inspect? #3122

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What is best practice for evaluating a base model with inspect? #3122

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions