Skip to content

What is best practice for evaluating a base model with inspect? #3122

@kdu4108

Description

@kdu4108

Hi, what is the recommended best practice for evaluating a base model with inspect on, for example, gsm8k?

For example, I'd like to evaluate Qwen/Qwen3-1.7B-Base. It appears that the execution will eventually reach

        if self.tokenizer.chat_template is not None:
            chat = self.tokenizer.apply_chat_template(
                hf_messages,
                add_generation_prompt=True,
                tokenize=False,
                tools=tools_list if len(tools_list) > 0 else None,
                enable_thinking=self.enable_thinking,  # not all models use this, check if it is supported
            )
        else:
            chat = ""
            for message in hf_messages:
                chat += f"{message.role}: {message.content}\n"
        # return
        return cast(str, chat)

at

if self.tokenizer.chat_template is not None:
if using the hf provider.

However, since the tokenizer of the Qwen/Qwen3-1.7B-Base base model does have a chat template, it will format the text nonetheless using the chat template (which results in gibberish).

Similarly, for vllm, it seems the text will always get formatted into chat messages if I'm reading this execution correctly:

.

Is there a recommended workaround for generating an input text without a chat template for base models with these two providers? I suppose you could check if "base" is in the name but this feels rather hacky (especially for different model families having different conventions for whether excluding "base" from the name is the base model or instruct, e.g., Qwen2.5 vs Qwen3). Thanks for the help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions