-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Closed
Labels
Description
Name and Version
version: 4818 (dfd6b2c)
built with Android (10552028, +pgo, +bolt, +lto, -mlgo, based on r487747d) clang version 17.0.2 (https://android.googlesource.com/toolchain/llvm-project d9f89f4d16663d5012e5c09495f3b30ece3d2362) for x86_64-apple-darwin23.6.0
Operating systems
Other? (Please let us know in description)
GGML backends
CPU
Hardware
CPU: Google Tensor G4 (Pixel 9)
Models
No response
Problem description & steps to reproduce
Line 364 of file llama-android.cpp's KV cache size calculation doesn't make sense, it is simply assigning n_len to n_kv_req:
auto n_kv_req = tokens_list.size() + (n_len - tokens_list.size());Since tokens_list is tokenized from the input text (either formatted or not), while n_len is the max length of the tokens to be generated, the required KV cache size would naturally be the sum of them.
First Bad Commit
No response
Relevant log output
(empty, no tokens are generated. Because when I send a long message formatted with system prompt and user prmopt, `nlen` with a default value of `64` becomes the actual `n_kv_req` and gets burst)