Skip to content

Conversation

@isaac-mcfadyen
Copy link
Contributor

Following up on PR #13174.

Overview

After some discussion, the decision was made to add an opt-out flag for the assistant prefill behavior so it can be disabled to restore the previous functionality.

  • This PR adds the --no-prefill-assistant flag, specific to llama-server. Also has a corresponding env var LLAMA_ARG_NO_PREFILL_ASSISTANT.
  • When the flag is not specified, the default behavior is to prefill the response based on the assistant message if it's at the end of the messages array, so that use-cases such as Feature Request: Prefix assistant answer #11536 continue to work.
  • When this flag is specified, we treat the trailing assistant message as a full message as was the behavior before Prefilling assistant message in openai compatible API #13174.

Testing

Used bartowski/Llama-3.2-1B-Instruct-GGUF for testing as I had it on hand. Tested with both /apply-template and /v1/chat/completions as they both used the shared prompt templating functions.

/apply-template:

# Flag omitted
curl http://127.0.0.1:8080/apply-template --json '{"messages": [{"role": "assistant", "content": "My name is"}]}' -s
# {"prompt":"<|start_header_id|>assistant<|end_header_id|>\n\nMy name is"}

# --no-prefill-assistant added (also tested with LLAMA_ARG_NO_PREFILL_ASSISTANT=1)
curl http://127.0.0.1:8080/apply-template --json '{"messages": [{"role": "assistant", "content": "My name is"}]}' -s
# {"prompt":"<|start_header_id|>assistant<|end_header_id|>\n\nMy name is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"}

/v1/chat/completions:

# Flag omitted
curl http://127.0.0.1:8080/v1/chat/completions --json '{"max_tokens": 12, "messages": [{"role": "assistant", "content": "My name is"}]}' -s | jq ".choices[0].message.content"
# " Rohan, and I'm an assistant here. What seems"

# --no-prefill-assistant added (also tested with LLAMA_ARG_NO_PREFILL_ASSISTANT=1)
curl http://127.0.0.1:8080/v1/chat/completions --json '{"max_tokens": 12, "messages": [{"role": "assistant", "content": "My name is"}]}' -s | jq ".choices[0].message.content"
# "It seems like you're about to start a conversation, but"

This is my first non-docs PR to llama.cpp so let me know if I need to make any changes 😅

@ngxson ngxson merged commit 6a2bc8b into ggml-org:master May 17, 2025
46 checks passed
@isaac-mcfadyen isaac-mcfadyen deleted the no-prefill-assistant branch May 18, 2025 01:27
@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented May 18, 2025

Glad to see it, though personally I would have preferred the other way around (--prefill-assistant), I think.

Not a big deal, but the general the policy I'd personally like to see is the standard behavior should be the defaults. My understanding is #13174 is not standard OpenAI API behavior, so now this flag is "needed" to restore behavior.

Just my 2c.

@isaac-mcfadyen
Copy link
Contributor Author

though personally I would have preferred the other way around

This was also my personal opinion, but in #13174 the counterargument was that it was default for a week or two and so it would be more breaking to revert that again.

infil00p pushed a commit to baseweight/llama.cpp that referenced this pull request May 22, 2025
* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants