-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Closed
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Description
What happened?
Enabling flash attention reduces performance on vulkan by a lot more than expected.
Even if performance varies between hardware, it feels like a 50% drop would be a bug
Hardware is AMD RX 6800 XT
Name and Version
version: 3772 (23e0d70)
built with MSVC 19.29.30154.0 for x64
What operating system are you seeing the problem on?
Windows
Relevant log output
llama-b3772-bin-win-vulkan-x64> ./llama-cli.exe -m '..\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf' -p "to be or" -n 600 -c 4096 -ngl 99
Performance without flash attention:
llama_perf_sampler_print: sampling time = 48.42 ms / 604 runs ( 0.08 ms per token, 12474.70 tokens per second)
llama_perf_context_print: load time = 13033.53 ms
llama_perf_context_print: prompt eval time = 183.59 ms / 4 tokens ( 45.90 ms per token, 21.79 tokens per second)
llama_perf_context_print: eval time = 9458.98 ms / 599 runs ( 15.79 ms per token, 63.33 tokens per second)
llama_perf_context_print: total time = 9765.68 ms / 603 tokens
llama-b3772-bin-win-vulkan-x64> ./llama-cli.exe -m '..\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf' -p "to be or" -n 600 -c 4096 -ngl 99 --flash-attn
with flash attention:
llama_perf_sampler_print: sampling time = 48.48 ms / 604 runs ( 0.08 ms per token, 12458.75 tokens per second)
llama_perf_context_print: load time = 2709.09 ms
llama_perf_context_print: prompt eval time = 194.77 ms / 4 tokens ( 48.69 ms per token, 20.54 tokens per second)
llama_perf_context_print: eval time = 18321.90 ms / 599 runs ( 30.59 ms per token, 32.69 tokens per second)
llama_perf_context_print: total time = 18617.86 ms / 603 tokens
Metadata
Metadata
Assignees
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)