llama: use FA + max. GPU layers by default #15434

JohannesGaessler · 2025-08-19T21:44:28Z

This PR updates the llama.cpp defaults to use FlashAttention and the maximum number of GPU layers by default. FlashAttention is I think by now mature enough where it is the better choice for most combinations of models and hardware. Both 0 and max. GPU layers have downsides but I think that there are more cases where max. GPU layers is the better choice. In particular, when someone is using llama.cpp for the first time and most reliant on defaults, they would likely be using a very small model for testing (and in that scenario max. GPU layers in definitely the correct choice).

slaren · 2025-08-19T21:57:17Z

I agree with increasing ngl by default, but the error message when model loading fails due to buffer allocation error should give some hint to let the users know what they need to change.

The reason FA is not already the default for the backends that support it is because the CUDA backend implementation of supports_op is not reliable. You can reproduce this with llama-cli -hf ggml-org/tiny-llamas -hff stories260K.gguf -fa -ngl 99. If you can fix that, I can revive #10101.

JohannesGaessler · 2025-08-21T21:09:06Z

I pushed a version that seems to work for automatically setting FlashAttention (the same for all layers). The way I'm determining whether FA should be used is to check whether or not the FA ggml op is being assigned to the same backend as the previous node in the graph. But that is I think a bad solution. Would it make sense to set a flag for tensors that cannot run on fast backends?

The point at which I'm resolving -fa auto is when the worst-case graphs are reserved since FA is relevant there.

For this PR my goal is not to implement toggling FA on a per-layer basis - I'm not convinced that there are many situations where this would make sense.

common/arg.cpp

src/llama-cparams.h

slaren · 2025-08-22T14:02:31Z

src/llama-context.cpp

I don't think this is a good way to do it. It is very fragile code that makes a lot of assumptions that are not guaranteed anywhere, and will break very easily and in a very difficult way to detect when making changes to other parts of the code.

A potentially slightly better way to do it could be:

Extract the layer number from the tensor name

Verify if the device of the backend (ggml_backend_get_device) is the same as the device assigned to the layer KV

The device assigned to the layer can be obtained from model.dev_layer(il) if offload_kqv, CPU otherwise

JohannesGaessler · 2025-08-27T13:40:20Z

Just to make sure that this doesn't go unnoticed: on master ggml_backend_sched_reserve automatically resets the backend scheduler, this PR removes this automatic reset because otherwise the tensor assignments cannot be retrieved.

slaren · 2025-08-27T14:17:29Z

Just to make sure that this doesn't go unnoticed: on master ggml_backend_sched_reserve automatically resets the backend scheduler, this PR removes this automatic reset because otherwise the tensor assignments cannot be retrieved.

You can use ggml_backend_sched_alloc_graph instead, there is no need to change ggml_backend_sched_reserve.

slaren

We also need a message telling users to reduce --n-gpu-layers when loading a model fails, otherwise people trying to run models bigger than their VRAM will just see an error and assume that they cannot use llama.cpp.

slaren · 2025-08-28T12:53:45Z

common/common.h

I think it would be better to leave the test failing than adding an exception here.

slaren · 2025-08-28T12:54:19Z

common/arg.cpp

Since these values are the same as the default now, these lines could be removed entirely.

slaren · 2025-08-28T12:55:09Z

ggml/src/ggml-backend.cpp

I don't mind having asserts against null pointers here, but it needs to be consistent, not just in one isolated function.

JohannesGaessler · 2025-08-29T14:52:05Z

We also need a message telling users to reduce --n-gpu-layers when loading a model fails, otherwise people trying to run models bigger than their VRAM will just see an error and assume that they cannot use llama.cpp.

I added the messages in common.cpp because only in that context a reference to --n-gpu-layers makes sense. Or do you mean that the C API should also explicitly mention the CPU+GPU hybrid functionality?

slaren · 2025-08-29T15:04:00Z

I added the messages in common.cpp

My bad, I missed that. Looks good.

JohannesGaessler · 2025-08-29T15:29:52Z

For ggml-backend.cpp I added asserts to pointers that are going to be accessed unconditionally and would result in a segmentation fault.

JohannesGaessler · 2025-08-29T17:43:29Z

Supposedly there are issues with context shifting when using FlashAttention: #9646

This would align with the test in test_ctx_shift.py failing if FA is enabled (already happens on master, independently of this PR).

ggerganov · 2025-08-29T17:48:51Z

Supposedly there are issues with context shifting when using FlashAttention: #9646

This would align with the test in test_ctx_shift.py failing if FA is enabled (already happens on master, independently of this PR).

Could you show a failure log or steps to reproduce?

slaren · 2025-08-29T18:16:39Z

src/llama-context.cpp

+                    const int il = std::stoi(n->name + 6);
+                    ggml_backend_dev_t device_kv = model.dev_layer(il);
+                    if (device_fa != device_kv) {


This should require checking against the CPU when using no-kv-offload, but it seems to be broken at the moment, and attention ops are being run on the GPU even when not offloaded.

JohannesGaessler · 2025-08-29T18:19:29Z

tools/server/tests/unit/test_ctx_shift.py

    # 64 tokens are generated thanks to shifting the context when it gets full
    global server
    server.enable_ctx_shift = True
+    server.fa = "off"  # FIXME prompt_n assert fails otherwise


@ggerganov remove this line or set it to "on", then run the unit test. Alternatively, edit the unit test on master to run with FA.

Is this the assert that you observed too:

assert res.status_code == 200 > assert res.body["timings"]["prompt_n"] == 109 E assert 173 == 109 unit/test_ctx_shift.py:36: AssertionError FAILED unit/test_ctx_shift.py::test_ctx_shift_enabled - assert 173 == 109

This occurs because we pad the context size to 256 when flash attention is enabled:

llama.cpp/src/llama-kv-cache.cpp

Lines 1990 to 1994 in ef47691

uint32_t llama_kv_cache::get_padding(const llama_cparams & cparams) {

// the FA kernels require padding to avoid extra runtime boundary checks

return cparams.flash_attn ? 256u : 32u;

}

So in this test, when FA is off the padding is 32 and when FA is on the padding is 256. This affects the amount of truncated tokens from the prompt.

You can fix this with this patch on master to make it work both with and without FA:

diff --git a/tools/server/tests/unit/test_ctx_shift.py b/tools/server/tests/unit/test_ctx_shift.py index 8f51bc301..3edf18727 100644 --- a/tools/server/tests/unit/test_ctx_shift.py +++ b/tools/server/tests/unit/test_ctx_shift.py @@ -15,25 +15,27 @@ Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deseru def create_server(): global server server = ServerPreset.tinyllama2() - server.n_ctx = 256 + server.n_ctx = 512 server.n_slots = 2 + server.n_predict = 128 def test_ctx_shift_enabled(): # the prompt is 301 tokens - # the slot context is 256/2 = 128 tokens - # the prompt is truncated to keep the last 109 tokens - # 64 tokens are generated thanks to shifting the context when it gets full + # the slot context is 512/2 = 256 tokens + # the prompt is truncated to keep the last (301 - 256/2) = 173 tokens + # 96 tokens are generated thanks to shifting the context when it gets full global server server.enable_ctx_shift = True + server.fa = True server.start() res = server.make_request("POST", "/completion", data={ - "n_predict": 64, + "n_predict": 96, "prompt": LONG_TEXT, }) assert res.status_code == 200 - assert res.body["timings"]["prompt_n"] == 109 - assert res.body["timings"]["predicted_n"] == 64 + assert res.body["timings"]["prompt_n"] == 173 + assert res.body["timings"]["predicted_n"] == 96 assert res.body["truncated"] is True diff --git a/tools/server/tests/utils.py b/tools/server/tests/utils.py index f55a53947..d9df9bd91 100644 --- a/tools/server/tests/utils.py +++ b/tools/server/tests/utils.py @@ -160,7 +160,7 @@ class ServerProcess: server_args.extend(["-ctk", self.ctk]) if self.ctv: server_args.extend(["-ctv", self.ctv]) - if self.fa is not None: + if self.fa is not None and self.fa is True: server_args.append("-fa") if self.n_predict: server_args.extend(["--n-predict", self.n_predict])

saadsafi · 2025-08-31T12:55:18Z

I had this error all morning:
ggml-backend.cpp: GGML_ASSERT(n_graph_inputs < GGML_SCHED_MAX_SPLIT_INPUTS) failed
I have 2 nvidia GPUs. I build from source on ubuntu (nvidia cuda container) every morning.

I read the comments above and I can confirm the error got resolved by adding -fa off

jacekpoplawski · 2025-08-31T18:18:38Z

I’m not sure which commit changed the behavior, but it looks like it works correctly now.

jacekpoplawski · 2025-09-01T19:18:27Z

All models work except gemma-3n (both E4B and E2B)

llama_context: pipeline parallelism enabled (n_copies=4)
/home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp:1258: GGML_ASSERT(n_inputs < GGML_SCHED_MAX_SPLIT_INPUTS) failed

(works with -fa off or CUDA_VISIBLE_DEVICES=0)

slaren · 2025-09-01T19:59:25Z

All models work except gemma-3n (both E4B and E2B)
llama_context: pipeline parallelism enabled (n_copies=4)
/home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp:1258: GGML_ASSERT(n_inputs < GGML_SCHED_MAX_SPLIT_INPUTS) failed
(works with -fa off or CUDA_VISIBLE_DEVICES=0)

Cannot reproduce this with two GPUs.

jacekpoplawski · 2025-09-01T20:08:58Z

All models work except gemma-3n (both E4B and E2B)
llama_context: pipeline parallelism enabled (n_copies=4)
/home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp:1258: GGML_ASSERT(n_inputs < GGML_SCHED_MAX_SPLIT_INPUTS) failed
(works with -fa off or CUDA_VISIBLE_DEVICES=0)
Cannot reproduce this with two GPUs.

you are right, "CUDA_VISIBLE_DEVICES=0,1" also fixes the issue, so 3 GPUs are needed

I can debug or send more logs if needed

call stack:

#4  0x000071dc27b77f13 in ggml_print_backtrace () from /home/jacek/git/llama.cpp/build_2025.08.31/bin/libggml-base.so
#5  0x000071dc27b780bb in ggml_abort () from /home/jacek/git/llama.cpp/build_2025.08.31/bin/libggml-base.so
#6  0x000071dc27b91126 in ggml_backend_sched_split_graph () from /home/jacek/git/llama.cpp/build_2025.08.31/bin/libggml-base.so
#7  0x000071dc27c9c752 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool) () from /home/jacek/git/llama.cpp/build_2025.08.31/bin/libllama.so
#8  0x000071dc27c9f9aa in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/jacek/git/llama.cpp/build_2025.08.31/bin/libllama.so
#9  0x000071dc27ca0146 in llama_init_from_model () from /home/jacek/git/llama.cpp/build_2025.08.31/bin/libllama.so
#10 0x0000587795327186 in common_init_from_params(common_params&) ()
#11 0x000058779521085f in server_context::load_model(common_params const&) ()
#12 0x00005877951a560b in main ()

slaren · 2025-09-01T20:13:03Z

@JohannesGaessler I cannot test this easily, but please increase GGML_SCHED_MAX_SPLIT_INPUTS as much as needed to fix this.

jacekpoplawski · 2025-09-01T20:38:57Z

I tried this:

#define GGML_MAX_SRC            50

and this:

GGML_LOG_ERROR("n_inputs: %d GGML_SCHED_MAX_SPLIT_INPUT: %d\n", n_inputs, GGML_SCHED_MAX_SPLIT_INPUTS);

on google_gemma-3-27b-it-Q8_0.gguf max value is:

n_inputs: 8 GGML_SCHED_MAX_SPLIT_INPUT: 50

but on google_gemma-3n-E4B-it-Q8_0.gguf:

n_inputs: 26 GGML_SCHED_MAX_SPLIT_INPUT: 50

Thireus · 2025-09-02T14:11:48Z

@ubergarm, llama-sweep-bench no longer compiles because common_params definition no longer includes flash_attn.

In file included from /home/runner/work/llama.cpp/llama.cpp/examples/sweep-bench/sweep-bench.cpp:7:
/home/runner/work/llama.cpp/llama.cpp/examples/sweep-bench/sweep-bench.cpp: In function ‘int main(int, char**)’:
/home/runner/work/llama.cpp/llama.cpp/examples/sweep-bench/sweep-bench.cpp:146:203: error: ‘struct common_params’ has no member named ‘flash_attn’; did you mean ‘flash_attn_type’?
  146 |         LOG_INF("%s: n_kv_max = %d, n_batch = %d, n_ubatch = %d, flash_attn = %d, n_gpu_layers = %d, n_threads = %u, n_threads_batch = %u\n", __func__, n_kv_max, params.n_batch, params.n_ubatch, params.flash_attn, params.n_gpu_layers, ctx_params.n_threads, ctx_params.n_threads_batch);
      |                                                                                                                                                                                                           ^~~~~~~~~~
/home/runner/work/llama.cpp/llama.cpp/common/./log.h:86:56: note: in definition of macro ‘LOG_TMPL’
   86 |             common_log_add(common_log_main(), (level), __VA_ARGS__); \
      |                                                        ^~~~~~~~~~~
/home/runner/work/llama.cpp/llama.cpp/examples/sweep-bench/sweep-bench.cpp:146:9: note: in expansion of macro ‘LOG_INF’
  146 |         LOG_INF("%s: n_kv_max = %d, n_batch = %d, n_ubatch = %d, flash_attn = %d, n_gpu_layers = %d, n_threads = %u, n_threads_batch = %u\n", __func__, n_kv_max, params.n_batch, params.n_ubatch, params.flash_attn, params.n_gpu_layers, ctx_params.n_threads, ctx_params.n_threads_batch);
      |         ^~~~~~~
/home/runner/work/llama.cpp/llama.cpp/examples/sweep-bench/sweep-bench.cpp:248:67: error: ‘struct common_params’ has no member named ‘flash_attn’; did you mean ‘flash_attn_type’?
  248 |                 n_kv_max, params.n_batch, params.n_ubatch, params.flash_attn, params.n_gpu_layers, ctx_params.n_threads, ctx_params.n_threads_batch,
      |                                                                   ^~~~~~~~~~
/home/runner/work/llama.cpp/llama.cpp/common/./log.h:86:56: note: in definition of macro ‘LOG_TMPL’
   86 |             common_log_add(common_log_main(), (level), __VA_ARGS__); \
      |                                                        ^~~~~~~~~~~
/home/runner/work/llama.cpp/llama.cpp/examples/sweep-bench/sweep-bench.cpp:245:13: note: in expansion of macro ‘LOG_INF’
  245 |             LOG_INF(
      |             ^~~~~~~

https://github.com/ggml-org/llama.cpp/pull/15434/files#diff-34c932128256ee886b3a8581b5f11a1c38717aaa9d228189f1ce12e823f3207fL375

Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.

* llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault

Behavior of mainline llama.cpp `-fa` changed and now *requires* an argument of `on` or `1` it seems to enable flash attenion explicitly. This diverges from ik_llama.cpp behavior which omitting it is disabled, however on mainline that means `auto` which means "probably enabled" I believe. Details here: ggml-org#15434 This patch just changes all `s/flash_attn/flash_attn_type/g`.

ubergarm · 2025-10-13T17:17:33Z

lol so sorry i ever put a link or usernames in my commit message, i'll try to get rid of that spam... 💀

github-actions bot added script Script related python python script changes labels Aug 19, 2025

JohannesGaessler force-pushed the llama-update-defaults branch from ad34c0d to 0aed1a9 Compare August 21, 2025 20:48

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Aug 21, 2025

slaren reviewed Aug 22, 2025

View reviewed changes

JohannesGaessler force-pushed the llama-update-defaults branch from 0aed1a9 to 6cac54a Compare August 27, 2025 13:38

JohannesGaessler force-pushed the llama-update-defaults branch from 6cac54a to 3c6af1c Compare August 27, 2025 13:44

llama: use max. GPU layers by default, auto -fa

86f0cea

JohannesGaessler force-pushed the llama-update-defaults branch from 3c6af1c to 86f0cea Compare August 27, 2025 16:14

JohannesGaessler requested a review from ngxson as a code owner August 27, 2025 17:31

github-actions bot added the server label Aug 27, 2025

JohannesGaessler force-pushed the llama-update-defaults branch from 91427e1 to 9a06779 Compare August 27, 2025 17:35

disable -fa for server test

97ce75a

JohannesGaessler force-pushed the llama-update-defaults branch from 9a06779 to 17dba8c Compare August 27, 2025 17:57

slaren reviewed Aug 28, 2025

View reviewed changes

remove redundant defaults

9be3435

JohannesGaessler force-pushed the llama-update-defaults branch from 17dba8c to 9be3435 Compare August 29, 2025 14:56

ggml-backend: abort instead of segfault

8d77368

slaren approved these changes Aug 29, 2025

View reviewed changes

JohannesGaessler commented Aug 29, 2025

View reviewed changes

CISC mentioned this pull request Aug 31, 2025

ci : explicitly set fa off or on #15692

Merged

JohannesGaessler mentioned this pull request Aug 31, 2025

llama: fix -fa auto for multiple GPUs #15693

Closed

jacekpoplawski mentioned this pull request Aug 31, 2025

nvidia nemotron nano v2 (nemotronh) #15507

Merged

JohannesGaessler mentioned this pull request Sep 1, 2025

ggml-backend: raise GGML_MAX_SPLIT_INPUTS #15722

Merged

taronaeo mentioned this pull request Sep 2, 2025

Eval bug: ggml-cpu Conversion FP32<->FP16 Using GGML_NNPA Stop Inferencing Correctly After b6324 #15721

Closed

ericcurtin mentioned this pull request Sep 3, 2025

Document the new max GPU layers default in help #15771

Merged

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025

llama: use FA + max. GPU layers by default (ggml-org#15434)

878fe00

* llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault

AndrewMobbs added a commit to AndrewMobbs/llauncher that referenced this pull request Sep 23, 2025

Change flash-attn to string to match ggml-org/llama.cpp#15434

bf2c5dc

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025

Revert "llama: use FA + max. GPU layers by default (ggml-org#15434)"

9f3f5d9

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 25, 2025

Revert "llama: use FA + max. GPU layers by default (ggml-org#15434)"

228b915

anthqiu mentioned this pull request Nov 20, 2025

修复llama-cpp更新到b6325版本后-fa参数错误问题 PiDanShouRouZhouXD/Sakura_Launcher_GUI#68

Open


	uint32_t llama_kv_cache::get_padding(const llama_cparams & cparams) {
	// the FA kernels require padding to avoid extra runtime boundary checks
	return cparams.flash_attn ? 256u : 32u;
	}

llama: use FA + max. GPU layers by default #15434

llama: use FA + max. GPU layers by default #15434

Uh oh!

Conversation

JohannesGaessler commented Aug 19, 2025

Uh oh!

slaren commented Aug 19, 2025

Uh oh!

JohannesGaessler commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Aug 27, 2025

Uh oh!

slaren commented Aug 27, 2025

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Aug 29, 2025

Uh oh!

slaren commented Aug 29, 2025

Uh oh!

JohannesGaessler commented Aug 29, 2025

Uh oh!

JohannesGaessler commented Aug 29, 2025

Uh oh!

ggerganov commented Aug 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saadsafi commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacekpoplawski commented Aug 31, 2025

Uh oh!

jacekpoplawski commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Sep 1, 2025

Uh oh!

jacekpoplawski commented Sep 1, 2025

Uh oh!

slaren commented Sep 1, 2025

Uh oh!

jacekpoplawski commented Sep 1, 2025

Uh oh!

Thireus commented Sep 2, 2025

Uh oh!

ubergarm commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

saadsafi commented Aug 31, 2025 •

edited

Loading

jacekpoplawski commented Sep 1, 2025 •

edited

Loading