Skip to content

Conversation

@ggerganov
Copy link
Member

During multi-batch prompt processing we were incorrectly synchronizing the llama context on each batch. Instead we just need to synchronize at the end when the prompt is fully computed.

@ggerganov
Copy link
Member Author

ggerganov commented Sep 6, 2025

@slaren Somewhat related, the logic in llama-bench for processing the prompt:

while (n_processed < n_prompt) {
int n_tokens = std::min(n_prompt - n_processed, n_batch);
tokens[0] = n_processed == 0 && llama_vocab_get_add_bos(vocab) ? llama_vocab_bos(vocab) : std::rand() % n_vocab;
for (int i = 1; i < n_tokens; i++) {
tokens[i] = std::rand() % n_vocab;
}
int res = llama_decode(ctx, llama_batch_get_one(tokens.data(), n_tokens));
if (res != 0) {
fprintf(stderr, "%s: failed to decode prompt batch, res = %d\n", __func__, res);
return false;
}
n_processed += n_tokens;
}

Using llama_batch_get_one() here will cause to queue a copy of the logits of the last token in every batch. So for example -p 8192 -b 2048 -ub 512 will cause 3 extra logit copies at 2048, 4096 and 61206144 processed tokens, compared to only reading the logits for the 8192th token. Setting -b 8192 or other number larger than -p would avoid that, but I am not sure the intent here was to queue these extra copies for the prompt processing test.

@ggerganov ggerganov merged commit a885dcf into master Sep 8, 2025
55 checks passed
@ggerganov ggerganov deleted the gg/batched-bench-fix-synchronize branch September 8, 2025 07:27
@slaren
Copy link
Member

slaren commented Sep 8, 2025

I was aware of the issue with llama-bench requesting the logits of intermediate batches, I just didn't think that it affects performance enough to worry about it. But I can fix it if that's not the case.

@ggerganov
Copy link
Member Author

I haven't timed it, but my guess is that it is indeed quite negligible.

I'll submit a patch for this as I am working on the async Metal backend now and this is a bit related.

njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants