-
Notifications
You must be signed in to change notification settings - Fork 13.3k
llama: use FA + max. GPU layers by default #15434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama: use FA + max. GPU layers by default #15434
Conversation
I agree with increasing ngl by default, but the error message when model loading fails due to buffer allocation error should give some hint to let the users know what they need to change. The reason FA is not already the default for the backends that support it is because the CUDA backend implementation of |
ad34c0d
to
0aed1a9
Compare
I pushed a version that seems to work for automatically setting FlashAttention (the same for all layers). The way I'm determining whether FA should be used is to check whether or not the FA ggml op is being assigned to the same backend as the previous node in the graph. But that is I think a bad solution. Would it make sense to set a flag for tensors that cannot run on fast backends? The point at which I'm resolving For this PR my goal is not to implement toggling FA on a per-layer basis - I'm not convinced that there are many situations where this would make sense. |
src/llama-context.cpp
Outdated
if (cparams.flash_attn_type == LLAMA_FLASH_ATTN_TYPE_AUTO) { | ||
bool fa_backend_mismatch = false; | ||
GGML_ASSERT(ggml_graph_node(gf, 0)->op != GGML_OP_FLASH_ATTN_EXT); | ||
for (int i = 1; i < ggml_graph_n_nodes(gf); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good way to do it. It is very fragile code that makes a lot of assumptions that are not guaranteed anywhere, and will break very easily and in a very difficult way to detect when making changes to other parts of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potentially slightly better way to do it could be:
- Extract the layer number from the tensor name
- Verify if the device of the backend (
ggml_backend_get_device
) is the same as the device assigned to the layer KV - The device assigned to the layer can be obtained from
model.dev_layer(il)
ifoffload_kqv
, CPU otherwise
0aed1a9
to
6cac54a
Compare
Just to make sure that this doesn't go unnoticed: on master |
6cac54a
to
3c6af1c
Compare
You can use |
3c6af1c
to
86f0cea
Compare
91427e1
to
9a06779
Compare
9a06779
to
17dba8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a message telling users to reduce --n-gpu-layers
when loading a model fails, otherwise people trying to run models bigger than their VRAM will just see an error and assume that they cannot use llama.cpp.
common/common.h
Outdated
#ifdef GGML_USE_WEBGPU | ||
// FIXME the webgpu backend is lacking support for very basic operations so the test allocation for -fa auto result in an abort | ||
enum llama_flash_attn_type flash_attn_type = LLAMA_FLASH_ATTN_TYPE_DISABLED; // whether to use Flash Attention | ||
#else | ||
enum llama_flash_attn_type flash_attn_type = LLAMA_FLASH_ATTN_TYPE_AUTO; // whether to use Flash Attention | ||
#endif // GGML_USE_WEBGPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to leave the test failing than adding an exception here.
common/arg.cpp
Outdated
params.n_gpu_layers = 999; | ||
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_ENABLED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these values are the same as the default now, these lines could be removed entirely.
} | ||
|
||
ggml_backend_dev_t ggml_backend_get_device(ggml_backend_t backend) { | ||
GGML_ASSERT(backend); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind having asserts against null pointers here, but it needs to be consistent, not just in one isolated function.
I added the messages in |
17dba8c
to
9be3435
Compare
My bad, I missed that. Looks good. |
For |
Supposedly there are issues with context shifting when using FlashAttention: #9646 This would align with the test in |
Could you show a failure log or steps to reproduce? |
src/llama-context.cpp
Outdated
const int il = std::stoi(n->name + 6); | ||
ggml_backend_dev_t device_kv = model.dev_layer(il); | ||
if (device_fa != device_kv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should require checking against the CPU when using no-kv-offload, but it seems to be broken at the moment, and attention ops are being run on the GPU even when not offloaded.
# 64 tokens are generated thanks to shifting the context when it gets full | ||
global server | ||
server.enable_ctx_shift = True | ||
server.fa = "off" # FIXME prompt_n assert fails otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov remove this line or set it to "on"
, then run the unit test. Alternatively, edit the unit test on master to run with FA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the assert that you observed too:
assert res.status_code == 200
> assert res.body["timings"]["prompt_n"] == 109
E assert 173 == 109
unit/test_ctx_shift.py:36: AssertionError
FAILED unit/test_ctx_shift.py::test_ctx_shift_enabled - assert 173 == 109
This occurs because we pad the context size to 256 when flash attention is enabled:
llama.cpp/src/llama-kv-cache.cpp
Lines 1990 to 1994 in ef47691
uint32_t llama_kv_cache::get_padding(const llama_cparams & cparams) { | |
// the FA kernels require padding to avoid extra runtime boundary checks | |
return cparams.flash_attn ? 256u : 32u; | |
} |
So in this test, when FA is off the padding is 32 and when FA is on the padding is 256. This affects the amount of truncated tokens from the prompt.
You can fix this with this patch on master
to make it work both with and without FA:
diff --git a/tools/server/tests/unit/test_ctx_shift.py b/tools/server/tests/unit/test_ctx_shift.py
index 8f51bc301..3edf18727 100644
--- a/tools/server/tests/unit/test_ctx_shift.py
+++ b/tools/server/tests/unit/test_ctx_shift.py
@@ -15,25 +15,27 @@ Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deseru
def create_server():
global server
server = ServerPreset.tinyllama2()
- server.n_ctx = 256
+ server.n_ctx = 512
server.n_slots = 2
+ server.n_predict = 128
def test_ctx_shift_enabled():
# the prompt is 301 tokens
- # the slot context is 256/2 = 128 tokens
- # the prompt is truncated to keep the last 109 tokens
- # 64 tokens are generated thanks to shifting the context when it gets full
+ # the slot context is 512/2 = 256 tokens
+ # the prompt is truncated to keep the last (301 - 256/2) = 173 tokens
+ # 96 tokens are generated thanks to shifting the context when it gets full
global server
server.enable_ctx_shift = True
+ server.fa = True
server.start()
res = server.make_request("POST", "/completion", data={
- "n_predict": 64,
+ "n_predict": 96,
"prompt": LONG_TEXT,
})
assert res.status_code == 200
- assert res.body["timings"]["prompt_n"] == 109
- assert res.body["timings"]["predicted_n"] == 64
+ assert res.body["timings"]["prompt_n"] == 173
+ assert res.body["timings"]["predicted_n"] == 96
assert res.body["truncated"] is True
diff --git a/tools/server/tests/utils.py b/tools/server/tests/utils.py
index f55a53947..d9df9bd91 100644
--- a/tools/server/tests/utils.py
+++ b/tools/server/tests/utils.py
@@ -160,7 +160,7 @@ class ServerProcess:
server_args.extend(["-ctk", self.ctk])
if self.ctv:
server_args.extend(["-ctv", self.ctv])
- if self.fa is not None:
+ if self.fa is not None and self.fa is True:
server_args.append("-fa")
if self.n_predict:
server_args.extend(["--n-predict", self.n_predict])
I think the problem has to do with the same graph being passed twice. If I edit the code like this diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index ac8453ab7..105b91c73 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -1405,6 +1405,10 @@ ggml_cgraph * llama_context::graph_reserve(uint32_t n_tokens, uint32_t n_seqs, u
LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
return nullptr;
}
+ if (!ggml_backend_sched_reserve(sched.get(), gf)) {
+ LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
+ return nullptr;
+ }
return gf;
}
( so that |
My understanding is that Should we add something like a |
I had this error all morning: I read the comments above and I can confirm the error got resolved by adding -fa off |
I’m not sure which commit changed the behavior, but it looks like it works correctly now. |
All models work except gemma-3n (both E4B and E2B)
(works with -fa off or CUDA_VISIBLE_DEVICES=0) |
Cannot reproduce this with two GPUs. |
you are right, "CUDA_VISIBLE_DEVICES=0,1" also fixes the issue, so 3 GPUs are needed I can debug or send more logs if needed call stack:
|
@JohannesGaessler I cannot test this easily, but please increase |
I tried this:
and this:
on google_gemma-3-27b-it-Q8_0.gguf max value is:
but on google_gemma-3n-E4B-it-Q8_0.gguf:
|
@ubergarm, llama-sweep-bench no longer compiles because common_params definition no longer includes flash_attn.
|
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.
* llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.
lol so sorry i ever put a link or usernames in my commit message, i'll try to get rid of that spam... 💀 |
Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: `github.com /ggml-org/pull/15434` This patch just changes all `s/flash_attn/flash_attn_type/g`.
This PR updates the llama.cpp defaults to use FlashAttention and the maximum number of GPU layers by default. FlashAttention is I think by now mature enough where it is the better choice for most combinations of models and hardware. Both 0 and max. GPU layers have downsides but I think that there are more cases where max. GPU layers is the better choice. In particular, when someone is using llama.cpp for the first time and most reliant on defaults, they would likely be using a very small model for testing (and in that scenario max. GPU layers in definitely the correct choice).