- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
Open
Labels
bugSomething isn't workingSomething isn't workinglow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
Description
What happened?
When running llama-bench with multiple parameter sets, load time (and as such, total time) in the llama_print_timings output seems to be accumulated over the parameter sets.
This doesn't affect the actual benchmark data output.
Name and Version
$ ./llama-cli --version
version: 3651 (8f1d81a0)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
What operating system are you seeing the problem on?
Mac
Relevant log output
$ ./llama-bench -m ~/Documents/Llama/models/llama-3-1-8b-instruct-f16.gguf -n 4,8,16,32,64,128,256 -ngl 99 -v 2>&1 | grep llama_print_timings
llama_print_timings:        load time =    1130.77 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    4528.24 ms /  3072 tokens (    1.47 ms per token,   678.41 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    4587.03 ms /  3073 tokens
llama_print_timings:        load time =    4669.62 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =     975.30 ms /    21 runs   (   46.44 ms per token,    21.53 tokens per second)
llama_print_timings:       total time =    5577.03 ms /    21 tokens
llama_print_timings:        load time =    5681.06 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    1880.25 ms /    41 runs   (   45.86 ms per token,    21.81 tokens per second)
llama_print_timings:       total time =    7486.03 ms /    41 tokens
llama_print_timings:        load time =    7566.78 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    3661.28 ms /    81 runs   (   45.20 ms per token,    22.12 tokens per second)
llama_print_timings:       total time =   11161.03 ms /    81 tokens
llama_print_timings:        load time =   11243.36 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    7263.77 ms /   161 runs   (   45.12 ms per token,    22.16 tokens per second)
llama_print_timings:       total time =   18438.03 ms /   161 tokens
llama_print_timings:        load time =   18520.05 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =   14510.21 ms /   321 runs   (   45.20 ms per token,    22.12 tokens per second)
llama_print_timings:       total time =   32963.03 ms /   321 tokens
llama_print_timings:        load time =   33049.59 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =   29012.77 ms /   641 runs   (   45.26 ms per token,    22.09 tokens per second)
llama_print_timings:       total time =   61991.03 ms /   641 tokensMetadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinglow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)