-
Notifications
You must be signed in to change notification settings - Fork 231
Description
This issue proposes adding support for assistant prefill: allowing an assistant message to be provided as a prefix that the model should continue generating from, rather than treating it as a completed assistant turn.
This capability enables deterministic continuation, structured output anchoring, and server-controlled tool or schema-guided generation.
Current Behavior
When passing messages like:
messages = [
{"role": "user", "content": "Write a short apology email to a customer for a delayed shipment."},
{"role": "assistant", "content": "Hi John,\n\nI'm sorry for the delay with your order. "}
]the model server currently serializes the assistant message as a completed turn (constexpr bool add_generation_prompt = true;) and then starts a new assistant turn:
Pipeline input text: <|im_start|>user
Write a short apology email to a customer for a delayed shipment.<|im_end|>
<|im_start|>assistant
Hi John,
I'm sorry for the delay with your order. <|im_end|>
<|im_start|>assistant
As a result, the provided assistant content cannot be used as the active generation prefix.
Expected Behavior
The assistant message should be treated as a partial prefix, and generation should continue immediately after it:
<|im_start|>user
Write a short apology email to a customer for a delayed shipment.<|im_end|>
<|im_start|>assistant
Hi John,
I'm sorry for the delay with your order.
Other Use Cases
Structured Output Prefill
messages = [
{
"role": "user",
"content": "Is the customer satisfied? Respond in JSON with fields \"reasoning\" and \"answer\"."
},
{
"role": "assistant",
# Prefills the JSON shape and anchors a concise reasoning style.
"content": '{\n "reasoning": "Based on the user\'s tone and wording, '
}
]This allows the server to enforce output structure while still letting the model complete the response naturally.
Tool-Guided Generation
messages = [
{
"role": "user",
"content": "Tell the user that their package is delayed by 2 days."
},
{
"role": "assistant",
# Tool name and immutable arguments are injected by the system.
# The model only needs to generate the remaining message content.
"content": '{"tool": "send_notification", "args": {"user_id": "uid_541", "message": "'
}
]This pattern enables server-controlled tool configuration, while allowing the model to complete the remaining schema-constrained fields.