Skip to content

Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill) #3877

@fonfonya

Description

@fonfonya

This issue proposes adding support for assistant prefill: allowing an assistant message to be provided as a prefix that the model should continue generating from, rather than treating it as a completed assistant turn.

This capability enables deterministic continuation, structured output anchoring, and server-controlled tool or schema-guided generation.

Current Behavior

When passing messages like:

messages = [
    {"role": "user", "content": "Write a short apology email to a customer for a delayed shipment."},
    {"role": "assistant", "content": "Hi John,\n\nI'm sorry for the delay with your order. "}
]

the model server currently serializes the assistant message as a completed turn (constexpr bool add_generation_prompt = true;) and then starts a new assistant turn:

Pipeline input text: <|im_start|>user
Write a short apology email to a customer for a delayed shipment.<|im_end|>
<|im_start|>assistant
Hi John,

I'm sorry for the delay with your order. <|im_end|>
<|im_start|>assistant

As a result, the provided assistant content cannot be used as the active generation prefix.

Expected Behavior

The assistant message should be treated as a partial prefix, and generation should continue immediately after it:

<|im_start|>user
Write a short apology email to a customer for a delayed shipment.<|im_end|>
<|im_start|>assistant
Hi John,

I'm sorry for the delay with your order. 

Other Use Cases

Structured Output Prefill

messages = [
    {
        "role": "user",
        "content": "Is the customer satisfied? Respond in JSON with fields \"reasoning\" and \"answer\"."
    },
    {
        "role": "assistant",
        # Prefills the JSON shape and anchors a concise reasoning style.
        "content": '{\n  "reasoning": "Based on the user\'s tone and wording, '
    }
]

This allows the server to enforce output structure while still letting the model complete the response naturally.

Tool-Guided Generation

messages = [
    {
        "role": "user",
        "content": "Tell the user that their package is delayed by 2 days."
    },
    {
        "role": "assistant",
        # Tool name and immutable arguments are injected by the system.
        # The model only needs to generate the remaining message content.
        "content": '{"tool": "send_notification", "args": {"user_id": "uid_541", "message": "'
    }
]

This pattern enables server-controlled tool configuration, while allowing the model to complete the remaining schema-constrained fields.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions