tools/main: llama-cli: prevent spurious assistant token (#13402) #16202

vinkal-chudgar · 2025-09-23T15:07:05Z

baseline_perplexity_13402.txt
after_fix_perplexity_13402.txt
baseline_bench_13402_latest.txt
after_fix_bench_13402.txt
ci.zip
Fixes: #13402

Summary

This PR fixes a bug where the last token of a user’s formatted prompt could be written into the assistant response buffer (assistant_ss) before any model token was sampled. Because assistant_ss is incorporated into the chat history that is fed back to the model, inserting a prompt piece there distorts the context for subsequent turns and can degrade response quality.

The fix updates assistant_ss only when the model is producing output (tokens): immediately after a token is sampled and accepted, and only if the token is not the end-of-generation (EOG) token.

Root Cause

The bug is a state-handling error in conversation mode that occurs during the transition from consuming a prompt to generating a response. It does not produce an immediate visual artifact but instead contaminates the internal state (chat_msgs) used to maintain conversation history.

Prompt Consumption: While consuming tokens from the user's formatted prompt, the code adds each token to the sampler's history via common_sampler_accept(smpl, token, /*accept_grammar=*/false) so that repetition penalties apply during generation phase.
State Transition Flaw: After the final prompt token is processed, the main loop continues. In this single transitional iteration, no new token has been generated yet, so the "last token" known to the sampler (common_sampler_last(smpl)) is still the final token from the user's formatted prompt.
Assistant buffer update After the final prompt token has been accepted and before the first model token is sampled, the code that builds the assistant message reads the sampler’s last accepted token via common_sampler_last(smpl). At that moment this still refers to the last formatted prompt token. It then appends that piece to assistant_ss, inserting a prompt token into the assistant buffer.
Commit to history: When the model's turn is complete (e.g., on an EOG token), the entire contents of assistant_ss, starting with the leaked prompt token, are committed to the permanent chat history (chat_msgs). This contaminated history is then used in all subsequent turns.

Solution

assistant_ss is updated only for sampled tokens and never for prompt-accepted tokens; end-of-generation (EOG) is skipped.

Update assistant_ss only inside the generation branch
(i.e., when (int) embd_inp.size() <= n_consumed && !is_interacting), immediately after:

 const llama_token id = common_sampler_sample(smpl, ctx, -1);
 common_sampler_accept(smpl, id, /* accept_grammar= */ true);
 embd.push_back(id);

Append the token piece only for newly sampled, non-EOG tokens:

if (params.conversation_mode && !waiting_for_first_input &&
    !llama_vocab_is_eog(vocab, id)) {
    assistant_ss << common_token_to_piece(ctx, id, false);
}

Remove the unconditional path that wrote to assistant_ss using common_sampler_last(smpl) outside the sampling path (i.e., while ingesting prompt tokens).
Leave terminal streaming unchanged; console output continues to be written from generated embd tokens.

Impact

Behavior: assistant history now contains only model-generated pieces; no prompt pieces can enter the assistant buffer.
Scope: conversation mode only; no changes to sampling distribution, penalties, grammar handling, or display logic.
Terminal output: unchanged (console stream is produced from generated embd tokens).
Risk: low; the change is local and preserves existing output to the terminal.
Performance: no meaningful change.

Environment

OS: Ubuntu 24.04 on WSL2 (CPU-only)
Compiler: GCC 13.3
Build flags: -DGGML_CUDA=OFF, -DGGML_NATIVE=ON
Threads: -t 22 (CPU)
Model: TinyLlama-1.1B-chat v1.0, Q4_K_M (GGUF v3)
Dataset for PPL: WikiText-2 test, 100 KB slice wiki.test.100k.raw

Build SHAs used

Baseline (upstream master at measurement time): 138c87ce8
Full: 138c87c

After-fix (this branch): vinkal-chudgar/llama.cpp@bcf14fd4c (branch fix/spurious-token-13402)
Full: bcf14fd

Perplexity (CPU-Only)

Command (both runs - baseline and after fix):

./build-<base|fix>/bin/llama-perplexity -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -t 22 -ngl 0 -f ~/data/wikitext-2-raw/wiki.test.100k.raw

Baseline (build 6547, SHA 138c87c): PPL 17.6138 ± 0.5025, throughput 55.23 tok/s, 27,136 tokens, 53 chunks.
After-fix (build 6548, SHA bcf14fd): PPL 17.6138 ± 0.5025, throughput 54.43 tok/s, 27,136 tokens, 53 chunks.
Conclusion: PPL identical; throughput within expected CPU variance.

llama-bench (CPU-Only)

./build-<base|fix>/bin/llama-bench -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -t 22 -ngl 0 -r 3 --no-warmup --progress -fa 1 -o md

Baseline: pp512: 59.20 ± 0.97 t/s; tg128: 0.17 ± 0.00 t/s.
After-fix: pp512: 59.70 ± 5.27 t/s; tg128: 0.17 ± 0.00 t/s.

Verification

Local CI (CPU-only)
Executed the project CI script locally on WSL2 (CPU-only):

# from repo root
mkdir -p ./tmp/results ./tmp/mnt
bash ./ci/run.sh ./tmp/results ./tmp/mnt |& tee ./tmp/results/ci.log

Outcome: Exit code: 0; all CTest suites present in this run passed (CPU-only)

CI Log: A sanitized CI log is attached.

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes ggml-org#13402. Signed-off-by: Vinkal Chudgar <[email protected]>

tools/main/main.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC

This appears to fix the issue, it's unfortunate that main.cpp is so convoluted that it's hard to follow the logic, but it is what it is I guess. :)

tools/main/main.cpp

Signed-off-by: Vinkal Chudgar <[email protected]>

* tools/main: llama-cli: prevent spurious assistant token (ggml-org#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes ggml-org#13402. Signed-off-by: Vinkal Chudgar <[email protected]> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <[email protected]> --------- Signed-off-by: Vinkal Chudgar <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

github-actions bot added the examples label Sep 23, 2025

vinkal-chudgar marked this pull request as ready for review September 23, 2025 20:19

vinkal-chudgar requested a review from ggerganov as a code owner September 23, 2025 20:19

CISC reviewed Sep 23, 2025

View reviewed changes

tools/main/main.cpp Outdated Show resolved Hide resolved

Update tools/main/main.cpp

2f16ea6

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC approved these changes Sep 27, 2025

View reviewed changes

tools/main/main.cpp Outdated Show resolved Hide resolved

tools/main: remove outdated comment

61b76c3

Signed-off-by: Vinkal Chudgar <[email protected]>

ggerganov merged commit 2f61c0f into ggml-org:master Sep 29, 2025
56 of 58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tools/main: llama-cli: prevent spurious assistant token (#13402) #16202

tools/main: llama-cli: prevent spurious assistant token (#13402) #16202

Uh oh!

vinkal-chudgar commented Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

CISC left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tools/main: llama-cli: prevent spurious assistant token (#13402) #16202

tools/main: llama-cli: prevent spurious assistant token (#13402) #16202

Uh oh!

Conversation

vinkal-chudgar commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Solution

Environment

Build SHAs used

Perplexity (CPU-Only)

llama-bench (CPU-Only)

Verification

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vinkal-chudgar commented Sep 23, 2025 •

edited

Loading