batched-bench : fix llama_synchronize usage during prompt processing #15835

ggerganov · 2025-09-06T11:20:44Z

During multi-batch prompt processing we were incorrectly synchronizing the llama context on each batch. Instead we just need to synchronize at the end when the prompt is fully computed.

ggml-ci

ggerganov · 2025-09-06T11:50:42Z

@slaren Somewhat related, the logic in llama-bench for processing the prompt:

llama.cpp/tools/llama-bench/llama-bench.cpp

Lines 1746 to 1758 in 186415d

    
           while (n_processed < n_prompt) { 
        
               int n_tokens = std::min(n_prompt - n_processed, n_batch); 
        
               tokens[0]    = n_processed == 0 && llama_vocab_get_add_bos(vocab) ? llama_vocab_bos(vocab) : std::rand() % n_vocab; 
        
               for (int i = 1; i < n_tokens; i++) { 
        
                   tokens[i] = std::rand() % n_vocab; 
        
               } 
        
               int res = llama_decode(ctx, llama_batch_get_one(tokens.data(), n_tokens)); 
        
               if (res != 0) { 
        
                   fprintf(stderr, "%s: failed to decode prompt batch, res = %d\n", __func__, res); 
        
                   return false; 
        
               } 
        
               n_processed += n_tokens; 
        
           }

Using llama_batch_get_one() here will cause to queue a copy of the logits of the last token in every batch. So for example -p 8192 -b 2048 -ub 512 will cause 3 extra logit copies at 2048, 4096 and ~~6120~~6144 processed tokens, compared to only reading the logits for the 8192th token. Setting -b 8192 or other number larger than -p would avoid that, but I am not sure the intent here was to queue these extra copies for the prompt processing test.

slaren · 2025-09-08T11:06:50Z

I was aware of the issue with llama-bench requesting the logits of intermediate batches, I just didn't think that it affects performance enough to worry about it. But I can fix it if that's not the case.

ggerganov · 2025-09-08T11:20:19Z

I haven't timed it, but my guess is that it is indeed quite negligible.

I'll submit a patch for this as I am working on the async Metal backend now and this is a bit related.

…gml-org#15835) ggml-ci

batched-bench : fix llama_synchronize usage during prompt processing

f073385

ggml-ci

github-actions bot added the examples label Sep 6, 2025

ggerganov merged commit a885dcf into master Sep 8, 2025
55 checks passed

ggerganov deleted the gg/batched-bench-fix-synchronize branch September 8, 2025 07:27

njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025

batched-bench : fix llama_synchronize usage during prompt processing (g…

af20001

…gml-org#15835) ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

batched-bench : fix llama_synchronize usage during prompt processing #15835

batched-bench : fix llama_synchronize usage during prompt processing #15835

ggerganov commented Sep 6, 2025

Uh oh!

ggerganov commented Sep 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

slaren commented Sep 8, 2025

Uh oh!

ggerganov commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

batched-bench : fix llama_synchronize usage during prompt processing #15835

batched-bench : fix llama_synchronize usage during prompt processing #15835

Conversation

ggerganov commented Sep 6, 2025

Uh oh!

ggerganov commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

slaren commented Sep 8, 2025

Uh oh!

ggerganov commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Sep 6, 2025 •

edited

Loading