Skip to content

Commit 134e694

Browse files
authored
llama : skip output reordering for single token batches (ggml-org#17466)
This commit adds a check to skip the output reordering logic when n_outputs == 1. With a single output token, the data is trivially sorted and the reordering code is currently doing unnecessary work (resetting and rebuilding output_ids to the same values). The motivation for this change is improved code clarity and avoiding confusion when debugging. While the performance impact is probably negligible, this unnecessary work happens on every decode call in llama-server when processing batches with single-token outputs.
1 parent 0543f92 commit 134e694

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

src/llama-context.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1248,7 +1248,7 @@ int llama_context::decode(const llama_batch & batch_inp) {
12481248

12491249
// make the outputs have the same order they had in the user-provided batch
12501250
// note: this is mostly relevant for recurrent models atm
1251-
if (!sorted_output) {
1251+
if (!sorted_output && n_outputs > 1) {
12521252
GGML_ASSERT((size_t) n_outputs == out_ids.size());
12531253

12541254
// TODO: is there something more efficient which also minimizes swaps?

0 commit comments

Comments
 (0)