Skip to content

Conversation

@vinkal-chudgar
Copy link
Contributor

@vinkal-chudgar vinkal-chudgar commented Sep 23, 2025

baseline_perplexity_13402.txt
after_fix_perplexity_13402.txt
baseline_bench_13402_latest.txt
after_fix_bench_13402.txt
ci.zip
Fixes: #13402

Summary

This PR fixes a bug where the last token of a user’s formatted prompt could be written into the assistant response buffer (assistant_ss) before any model token was sampled. Because assistant_ss is incorporated into the chat history that is fed back to the model, inserting a prompt piece there distorts the context for subsequent turns and can degrade response quality.

The fix updates assistant_ss only when the model is producing output (tokens): immediately after a token is sampled and accepted, and only if the token is not the end-of-generation (EOG) token.

Root Cause

The bug is a state-handling error in conversation mode that occurs during the transition from consuming a prompt to generating a response. It does not produce an immediate visual artifact but instead contaminates the internal state (chat_msgs) used to maintain conversation history.

  1. Prompt Consumption: While consuming tokens from the user's formatted prompt, the code adds each token to the sampler's history via common_sampler_accept(smpl, token, /*accept_grammar=*/false) so that repetition penalties apply during generation phase.
  2. State Transition Flaw: After the final prompt token is processed, the main loop continues. In this single transitional iteration, no new token has been generated yet, so the "last token" known to the sampler (common_sampler_last(smpl)) is still the final token from the user's formatted prompt.
  3. Assistant buffer update After the final prompt token has been accepted and before the first model token is sampled, the code that builds the assistant message reads the sampler’s last accepted token via common_sampler_last(smpl). At that moment this still refers to the last formatted prompt token. It then appends that piece to assistant_ss, inserting a prompt token into the assistant buffer.
  4. Commit to history: When the model's turn is complete (e.g., on an EOG token), the entire contents of assistant_ss, starting with the leaked prompt token, are committed to the permanent chat history (chat_msgs). This contaminated history is then used in all subsequent turns.

Solution

assistant_ss is updated only for sampled tokens and never for prompt-accepted tokens; end-of-generation (EOG) is skipped.

Update assistant_ss only inside the generation branch
(i.e., when (int) embd_inp.size() <= n_consumed && !is_interacting), immediately after:

 const llama_token id = common_sampler_sample(smpl, ctx, -1);
 common_sampler_accept(smpl, id, /* accept_grammar= */ true);
 embd.push_back(id);

Append the token piece only for newly sampled, non-EOG tokens:

if (params.conversation_mode && !waiting_for_first_input &&
    !llama_vocab_is_eog(vocab, id)) {
    assistant_ss << common_token_to_piece(ctx, id, false);
}

Remove the unconditional path that wrote to assistant_ss using common_sampler_last(smpl) outside the sampling path (i.e., while ingesting prompt tokens).
Leave terminal streaming unchanged; console output continues to be written from generated embd tokens.

Impact

  • Behavior: assistant history now contains only model-generated pieces; no prompt pieces can enter the assistant buffer.
  • Scope: conversation mode only; no changes to sampling distribution, penalties, grammar handling, or display logic.
  • Terminal output: unchanged (console stream is produced from generated embd tokens).
  • Risk: low; the change is local and preserves existing output to the terminal.
  • Performance: no meaningful change.

Environment

  • OS: Ubuntu 24.04 on WSL2 (CPU-only)
  • Compiler: GCC 13.3
  • Build flags: -DGGML_CUDA=OFF, -DGGML_NATIVE=ON
  • Threads: -t 22 (CPU)
  • Model: TinyLlama-1.1B-chat v1.0, Q4_K_M (GGUF v3)
  • Dataset for PPL: WikiText-2 test, 100 KB slice wiki.test.100k.raw

Build SHAs used

Baseline (upstream master at measurement time): 138c87ce8
Full: 138c87c

After-fix (this branch): vinkal-chudgar/llama.cpp@bcf14fd4c (branch fix/spurious-token-13402)
Full: bcf14fd

Perplexity (CPU-Only)

Command (both runs - baseline and after fix):

./build-<base|fix>/bin/llama-perplexity -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -t 22 -ngl 0 -f ~/data/wikitext-2-raw/wiki.test.100k.raw

  • Baseline (build 6547, SHA 138c87c): PPL 17.6138 ± 0.5025, throughput 55.23 tok/s, 27,136 tokens, 53 chunks.
  • After-fix (build 6548, SHA bcf14fd): PPL 17.6138 ± 0.5025, throughput 54.43 tok/s, 27,136 tokens, 53 chunks.
  • Conclusion: PPL identical; throughput within expected CPU variance.

llama-bench (CPU-Only)

./build-<base|fix>/bin/llama-bench -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -t 22 -ngl 0 -r 3 --no-warmup --progress -fa 1 -o md

  • Baseline: pp512: 59.20 ± 0.97 t/s; tg128: 0.17 ± 0.00 t/s.
  • After-fix: pp512: 59.70 ± 5.27 t/s; tg128: 0.17 ± 0.00 t/s.

Verification

Local CI (CPU-only)
Executed the project CI script locally on WSL2 (CPU-only):

# from repo root
mkdir -p ./tmp/results ./tmp/mnt
bash ./ci/run.sh ./tmp/results ./tmp/mnt |& tee ./tmp/results/ci.log

Outcome: Exit code: 0; all CTest suites present in this run passed (CPU-only)

CI Log: A sanitized CI log is attached.

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.

Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.

Fixes ggml-org#13402.

Signed-off-by: Vinkal Chudgar <[email protected]>
@vinkal-chudgar vinkal-chudgar marked this pull request as ready for review September 23, 2025 20:19
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to fix the issue, it's unfortunate that main.cpp is so convoluted that it's hard to follow the logic, but it is what it is I guess. :)

@ggerganov ggerganov merged commit 2f61c0f into ggml-org:master Sep 29, 2025
56 of 58 checks passed
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* tools/main: llama-cli: prevent spurious assistant token (ggml-org#13402)

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.

Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.

Fixes ggml-org#13402.

Signed-off-by: Vinkal Chudgar <[email protected]>

* Update tools/main/main.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* tools/main: remove outdated comment

Signed-off-by: Vinkal Chudgar <[email protected]>

---------

Signed-off-by: Vinkal Chudgar <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
* tools/main: llama-cli: prevent spurious assistant token (ggml-org#13402)

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.

Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.

Fixes ggml-org#13402.

Signed-off-by: Vinkal Chudgar <[email protected]>

* Update tools/main/main.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* tools/main: remove outdated comment

Signed-off-by: Vinkal Chudgar <[email protected]>

---------

Signed-off-by: Vinkal Chudgar <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: llama-cli, spurious token added to assistant response

3 participants