-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi there, I am also working on the performance comparison through different LLM inference framework. For the batching prompt, the llama.cpp perform quite worse then I expected. I found that your method use the -b of ./llama-bench to set up the batch size. However, it's not quite clear to me that this parameter is the same as the batch_size of other framework.
n_batch (-b) don't affect how much of the context you can use, it is just a limit to how many tokens you can put in a single batch.. If I understand it correctly, it means that if I set -b to 32 means that there is 32 token input in a single llama_decode(), other than set 32*1024 tokens to the input batch.
Here are some other related links from llama.cpp. batch_prompt , batch-size and ubatch-size.