Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill)

This issue proposes adding support for assistant prefill: allowing an `assistant` message to be provided as a prefix that the model should continue generating from, rather than treating it as a completed assistant turn.

This capability enables deterministic continuation, structured output anchoring, and server-controlled tool or schema-guided generation.

## Current Behavior

When passing messages like:

```python
messages = [
    {"role": "user", "content": "Write a short apology email to a customer for a delayed shipment."},
    {"role": "assistant", "content": "Hi John,\n\nI'm sorry for the delay with your order. "}
]
```

the model server currently serializes the assistant message as a completed turn (`constexpr bool add_generation_prompt = true;`) and then starts a new assistant turn:

```
Pipeline input text: <|im_start|>user
Write a short apology email to a customer for a delayed shipment.<|im_end|>
<|im_start|>assistant
Hi John,

I'm sorry for the delay with your order. <|im_end|>
<|im_start|>assistant
```

As a result, the provided assistant content cannot be used as the active generation prefix.

## Expected Behavior

The assistant message should be treated as a partial prefix, and generation should continue immediately after it:

```
<|im_start|>user
Write a short apology email to a customer for a delayed shipment.<|im_end|>
<|im_start|>assistant
Hi John,

I'm sorry for the delay with your order. 
```

## Other Use Cases

### Structured Output Prefill

```python
messages = [
    {
        "role": "user",
        "content": "Is the customer satisfied? Respond in JSON with fields \"reasoning\" and \"answer\"."
    },
    {
        "role": "assistant",
        # Prefills the JSON shape and anchors a concise reasoning style.
        "content": '{\n  "reasoning": "Based on the user\'s tone and wording, '
    }
]
```

This allows the server to enforce output structure while still letting the model complete the response naturally.

### Tool-Guided Generation

```python
messages = [
    {
        "role": "user",
        "content": "Tell the user that their package is delayed by 2 days."
    },
    {
        "role": "assistant",
        # Tool name and immutable arguments are injected by the system.
        # The model only needs to generate the remaining message content.
        "content": '{"tool": "send_notification", "args": {"user_id": "uid_541", "message": "'
    }
]
```

This pattern enables server-controlled tool configuration, while allowing the model to complete the remaining schema-constrained fields.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill) #3877

Current Behavior

Expected Behavior

Other Use Cases

Structured Output Prefill

Tool-Guided Generation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill) #3877

Description

Current Behavior

Expected Behavior

Other Use Cases

Structured Output Prefill

Tool-Guided Generation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions