-
Notifications
You must be signed in to change notification settings - Fork 13.4k
tools/main: llama-cli: prevent spurious assistant token (#13402) #16202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ggerganov
merged 3 commits into
ggml-org:master
from
vinkal-chudgar:fix/spurious-token-13402
Sep 29, 2025
Merged
tools/main: llama-cli: prevent spurious assistant token (#13402) #16202
ggerganov
merged 3 commits into
ggml-org:master
from
vinkal-chudgar:fix/spurious-token-13402
Sep 29, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes ggml-org#13402. Signed-off-by: Vinkal Chudgar <[email protected]>
CISC
reviewed
Sep 23, 2025
Co-authored-by: Sigbjørn Skjæret <[email protected]>
CISC
approved these changes
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to fix the issue, it's unfortunate that main.cpp is so convoluted that it's hard to follow the logic, but it is what it is I guess. :)
Signed-off-by: Vinkal Chudgar <[email protected]>
yael-works
pushed a commit
to yael-works/llama.cpp
that referenced
this pull request
Oct 15, 2025
* tools/main: llama-cli: prevent spurious assistant token (ggml-org#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes ggml-org#13402. Signed-off-by: Vinkal Chudgar <[email protected]> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <[email protected]> --------- Signed-off-by: Vinkal Chudgar <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
pwilkin
pushed a commit
to pwilkin/llama.cpp
that referenced
this pull request
Oct 23, 2025
* tools/main: llama-cli: prevent spurious assistant token (ggml-org#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes ggml-org#13402. Signed-off-by: Vinkal Chudgar <[email protected]> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <[email protected]> --------- Signed-off-by: Vinkal Chudgar <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
baseline_perplexity_13402.txt
after_fix_perplexity_13402.txt
baseline_bench_13402_latest.txt
after_fix_bench_13402.txt
ci.zip
Fixes: #13402
Summary
This PR fixes a bug where the last token of a user’s formatted prompt could be written into the assistant response buffer (assistant_ss) before any model token was sampled. Because
assistant_ssis incorporated into the chat history that is fed back to the model, inserting a prompt piece there distorts the context for subsequent turns and can degrade response quality.The fix updates
assistant_ssonly when the model is producing output (tokens): immediately after a token is sampled and accepted, and only if the token is not the end-of-generation (EOG) token.Root Cause
The bug is a state-handling error in conversation mode that occurs during the transition from consuming a prompt to generating a response. It does not produce an immediate visual artifact but instead contaminates the internal state (
chat_msgs) used to maintain conversation history.common_sampler_accept(smpl, token, /*accept_grammar=*/false)so that repetition penalties apply during generation phase.common_sampler_last(smpl)) is still the final token from the user's formatted prompt.assistant_ss, starting with the leaked prompt token, are committed to the permanent chat history (chat_msgs). This contaminated history is then used in all subsequent turns.Solution
assistant_ssis updated only for sampled tokens and never for prompt-accepted tokens; end-of-generation (EOG) is skipped.Update assistant_ss only inside the generation branch
(i.e., when
(int) embd_inp.size() <= n_consumed && !is_interacting), immediately after:Append the token piece only for newly sampled, non-EOG tokens:
Remove the unconditional path that wrote to
assistant_ssusingcommon_sampler_last(smpl)outside the sampling path (i.e., while ingesting prompt tokens).Leave terminal streaming unchanged; console output continues to be written from generated embd tokens.
Impact
Environment
Build SHAs used
Baseline (upstream master at measurement time): 138c87ce8
Full: 138c87c
After-fix (this branch): vinkal-chudgar/llama.cpp@bcf14fd4c (branch fix/spurious-token-13402)
Full: bcf14fd
Perplexity (CPU-Only)
Command (both runs - baseline and after fix):
./build-<base|fix>/bin/llama-perplexity -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -t 22 -ngl 0 -f ~/data/wikitext-2-raw/wiki.test.100k.rawllama-bench (CPU-Only)
./build-<base|fix>/bin/llama-bench -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -t 22 -ngl 0 -r 3 --no-warmup --progress -fa 1 -o mdVerification
Local CI (CPU-only)
Executed the project CI script locally on WSL2 (CPU-only):
Outcome: Exit code: 0; all CTest suites present in this run passed (CPU-only)
CI Log: A sanitized CI log is attached.