diff --git a/tools/llama-bench/README.md b/tools/llama-bench/README.md index ead4da45e2957..87d9c0a219bd8 100644 --- a/tools/llama-bench/README.md +++ b/tools/llama-bench/README.md @@ -82,6 +82,9 @@ Using the `-d ` option, each test can be run at a specified context depth, pr For a description of the other options, see the [main example](../main/README.md). +> [!NOTE] +> The measurements with `llama-bench` do not include the times for tokenization and for sampling. + ## Examples ### Text generation with different models @@ -131,7 +134,7 @@ $ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32 | llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | pp 64 | 33.52 ± 0.03 | | llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | tg 16 | 15.32 ± 0.05 | | llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | pp 64 | 59.00 ± 1.11 | -| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 || +| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 | ### Different numbers of layers offloaded to the GPU