graph : reuse SSM graphs #16490

ggerganov · 2025-10-09T16:58:09Z

Not sure if there is a reason not to enable graph reuse for recurrent graphs (mamba, hybrids, SSM, etc.). Did a few tests and seems to work, resulting in some modest perf improvements. cc @gabe-l-hart @compilade

Without graph reuse

make -j && LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32

model	size	params	backend	ngl	threads	fa	test	t/s
mamba 0.1B F16	256.96 MiB	129.14 M	Metal	99	1	1	pp512	8415.73 ± 46.47
mamba 0.1B F16	256.96 MiB	129.14 M	Metal	99	1	1	tg32	322.74 ± 0.64
granitehybrid ?B Q8_0	6.88 GiB	6.94 B	Metal	99	1	1	pp512	2119.36 ± 3.31
granitehybrid ?B Q8_0	6.88 GiB	6.94 B	Metal	99	1	1	tg32	77.17 ± 0.11
jamba ?B Q8_0	51.05 GiB	51.57 B	Metal	99	1	1	pp512	603.47 ± 1.83
jamba ?B Q8_0	51.05 GiB	51.57 B	Metal	99	1	1	tg32	42.35 ± 0.02
lfm2 2.6B Q4_K - Medium	1.45 GiB	2.57 B	Metal	99	1	1	pp512	2923.41 ± 3.20
lfm2 2.6B Q4_K - Medium	1.45 GiB	2.57 B	Metal	99	1	1	tg32	169.83 ± 0.67
build: `638e2c2` (6725)

With graph reuse

make -j && ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32

model	size	params	backend	ngl	threads	fa	test	t/s
mamba 0.1B F16	256.96 MiB	129.14 M	Metal	99	1	1	pp512	8453.65 ± 20.10
mamba 0.1B F16	256.96 MiB	129.14 M	Metal	99	1	1	tg32	348.83 ± 1.67
granitehybrid ?B Q8_0	6.88 GiB	6.94 B	Metal	99	1	1	pp512	2126.12 ± 1.90
granitehybrid ?B Q8_0	6.88 GiB	6.94 B	Metal	99	1	1	tg32	82.26 ± 0.13
jamba ?B Q8_0	51.05 GiB	51.57 B	Metal	99	1	1	pp512	604.56 ± 2.08
jamba ?B Q8_0	51.05 GiB	51.57 B	Metal	99	1	1	tg32	43.22 ± 0.02
lfm2 2.6B Q4_K - Medium	1.45 GiB	2.57 B	Metal	99	1	1	pp512	2928.31 ± 1.78
lfm2 2.6B Q4_K - Medium	1.45 GiB	2.57 B	Metal	99	1	1	tg32	179.18 ± 0.47
build: `638e2c2` (6725)

gabe-l-hart · 2025-10-09T17:07:58Z

Very cool! I'll test shortly with Granite 4.

The only thought I've had about why this might be difficult is around implementing the SSD version of SSM_SCAN. In mamba_ssm and mlx, they conditionally use SSD if (and only if) the cache is empty and the sequence length is >1. Since SSD is composed of a bunch of other smaller ops (tril and cumsum), one way this could be implemented is at the graph building layer which would result in different graphs for different parts of the generate loop. That said, this could also be implemented inside the SSM_SCAN kernel dispatching layer, so I don't think it's a blocker for reusing graphs.

compilade · 2025-10-09T17:04:46Z

src/llama-graph.cpp

+
+    bool res = true;
+
+    res &= s_copy->ne[0] == mctx->get_n_rs();


mctx->get_head() (the start of the slot) and mctx->get_rs_z() (the first zeroed state) are used in view offsets, and so would need to match too, otherwise the graph can't really be re-used.

The case where they wouldn't match (but n_rs matches) is when ubatches of the same size with different sequences are used.

E.g. seq_ids 0, 1, with 1 token and then seq_ids 2, 3 with 1 token, in consecutive ubatches, repeatedly.

This probably happens when using -ub 1 in the llama-parallel example, I think (because it uses a single seq_id per ubatch at a time, but ends up using different seq_ids while using the same size of ubatches).

(Note that I didn't actually test the changes yet, so I don't know if this is a real problem)

There is a check earlier for whether the sequences are the same:

llama.cpp/src/llama-graph.h

Lines 443 to 457 in 638e2c2

// when we split the batch using "equal_seqs" we have to verify that the participating sequences are the same

// the reason is because the set of attention streams would be different for different sequences

if (can_reuse_ubatch && ubatch.equal_seqs()) {

if (!ubatch.data) {

// if the old ubatch does not own it's data, then we cannot guarantee that it is still alive, and

// therefore we cannot perform the sequence id check. normally should never happen

can_reuse_ubatch = false;

} else {

for (uint32_t s = 0; s < ubatch.n_seqs_unq; ++s) {

can_reuse_ubatch &= ubatch.seq_id_unq[s] == other.ubatch.seq_id_unq[s];

}

}

}

This check applies to all graphs and if not satisfied, we don't attempt to reuse the graph. I think this should cover this case.

There is a check earlier for whether the sequences are the same

Right, this should cover the case where different sequences are used.

However, I don't think it covers the case when a sequence is cleared (which will make mctx->get_rs_z() differ).

I'm noticing different perplexity with and without graph-reuse with a Q8_0 mamba-130m on CPU.

(this is on the first 10 chunks of calibration_datav3)

params LLAMA_GRAPH_REUSE_DISABLE PPL

-b 512 0 7.7852

-b 2048 0 7.8628

-b 512 1 7.7852

-b 2048 1 7.7852

I'm not sure it's caused by what exactly, but I'm suspecting it's either related to rs_z or head (since this doesn't seem to happen with non-recurrent models (I tested with a Q8_0 TinyLlama)).

@ggerganov
Checking for head and rs_z mismatch does seem to help with the case in my previous comment, making the graph-reuse case have the same PPL as when it's not used.

Patch with changes

diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp index 7f0c974f1..aad42d62d 100644 --- a/src/llama-graph.cpp +++ b/src/llama-graph.cpp @@ -258,6 +258,9 @@ bool llm_graph_input_rs::can_reuse(const llm_graph_params & params) { bool res = true; + res &= this->head == mctx->get_head(); + res &= this->rs_z == mctx->get_rs_z(); + res &= s_copy->ne[0] == mctx->get_n_rs(); res &= s_copy_main->ne[0] == params.ubatch.n_seqs; @@ -482,6 +485,9 @@ bool llm_graph_input_mem_hybrid::can_reuse(const llm_graph_params & params) { res &= inp_attn->self_kq_mask->ne[0] == mctx->get_attn()->get_n_kv(); res &= inp_attn->self_kq_mask->ne[1] == GGML_PAD(params.ubatch.n_tokens, GGML_KQ_MASK_PAD); + res &= inp_rs->head == mctx->get_recr()->get_head(); + res &= inp_rs->rs_z == mctx->get_recr()->get_rs_z(); + res &= inp_rs->s_copy->ne[0] == mctx->get_recr()->get_n_rs(); res &= inp_rs->s_copy_main->ne[0] == params.ubatch.n_seqs; @@ -1827,6 +1833,9 @@ static std::unique_ptr<llm_graph_input_rs> build_rs_inp_impl( inp->s_copy_main = ggml_view_1d(ctx0, inp->s_copy, n_seqs, 0); inp->s_copy_extra = ggml_view_1d(ctx0, inp->s_copy, n_rs - n_seqs, n_seqs * inp->s_copy->nb[0]); + inp->head = mctx_cur->get_head(); + inp->rs_z = mctx_cur->get_rs_z(); + return inp; } diff --git a/src/llama-graph.h b/src/llama-graph.h index 394e88432..a596461bb 100644 --- a/src/llama-graph.h +++ b/src/llama-graph.h @@ -234,6 +234,10 @@ public: ggml_tensor * s_copy_extra; // I32 [n_rs - n_seqs] const llama_memory_recurrent_context * mctx; + + // used in view offsets, need to match for valid graph reuse + uint32_t head; + int32_t rs_z; }; class llm_graph_input_cross_embd : public llm_graph_input_i {

It might not be ideal to expose another way to get head and rs_z. But the constructor of llm_graph_input_rs would need access to llama-memory-recurrent.h to use mctx->get_head() and mctx->get_rs_z().

Strangely enough, hybrid models like Falcon-H1 don't manifest the same problem as mamba-130m; I can't reproduce the original problem with that.

It might not be ideal to expose another way to get head and rs_z. But the constructor of llm_graph_input_rs would need access to llama-memory-recurrent.h to use mctx->get_head() and mctx->get_rs_z().

Can you clarify what you mean here? The proposed solution seems OK to me.

On a related topic, would it be possible to avoid these offsets through the use of ggml_set_rows() in a similar way as we avoided the KV cache offset for the regular attention?

@compilade I improved the state management of the recurrent state with 6589d3b. The recurrent memory context now keeps immutable values such as head, rs_z, etc. These can be used in the can_reuse() logic without duplicating this state in the inputs.

ggerganov · 2025-10-09T17:24:55Z

@gabe-l-hart What is SSD?

gabe-l-hart · 2025-10-09T17:34:24Z

What is SSD?

Sorry, commenting from my phone at the airport! SSD is the State Space Duality part of the mamba2 paper where they reframe the SSM_SCAN op as an attention operation. The mlx implementation is here and the original triton kernel is here. I'm still working on actually grokking the math and was hoping to try to get it implemented in ggml soon-ish. It should provide a nice performance boost for prefill.

gabe-l-hart · 2025-10-09T17:47:11Z

Results looking good for granite4:micro-h (using the GGUF we uploaded to Ollama):

Metal

Reuse on, fa on

./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.166	770.58	1.617	79.16	1.783	143.56
128	128	2	512	0.309	828.03	4.041	63.35	4.350	117.69
128	128	4	1024	0.598	856.59	7.172	71.39	7.770	131.80
256	128	1	384	0.307	834.22	1.628	78.63	1.935	198.48
256	128	2	768	0.593	863.30	4.048	63.24	4.641	165.48
256	128	4	1536	1.191	860.14	7.162	71.48	8.353	183.89

Reuse on, fa off

./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.167	768.56	1.697	75.43	1.863	137.38
128	128	2	512	0.310	826.56	4.130	61.99	4.440	115.32
128	128	4	1024	0.599	854.75	7.232	70.80	7.831	130.76
256	128	1	384	0.307	833.12	1.705	75.08	2.012	190.84
256	128	2	768	0.594	861.28	4.175	61.32	4.770	161.02
256	128	4	1536	1.193	858.02	7.237	70.75	8.430	182.20

Reuse off, fa on

LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.166	769.00	1.714	74.69	1.880	136.16
128	128	2	512	0.309	828.93	4.209	60.83	4.517	113.34
128	128	4	1024	0.598	855.49	7.282	70.31	7.881	129.94
256	128	1	384	0.307	834.57	1.763	72.61	2.070	185.55
256	128	2	768	0.593	864.13	4.176	61.30	4.769	161.04
256	128	4	1536	1.190	860.30	7.291	70.22	8.481	181.10

Reuse off, fa off

LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.170	751.10	1.790	71.50	1.961	130.58
128	128	2	512	0.310	826.98	4.190	61.10	4.499	113.79
128	128	4	1024	0.602	850.25	7.299	70.15	7.901	129.60
256	128	1	384	0.309	829.05	1.793	71.38	2.102	182.67
256	128	2	768	0.596	858.70	4.220	60.67	4.816	159.46
256	128	4	1536	1.193	858.16	7.309	70.05	8.502	180.66

gabe-l-hart

I haven't followed the graph reuse implementation well enough to review the code changes here well, but the performance changes are working well for me on metal and I've validated that results match perfectly with and without graph reuse for single-concurrency.

pwilkin · 2025-10-09T18:21:48Z

@gabe-l-hart as a side note, I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.

gabe-l-hart · 2025-10-09T18:25:47Z

I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.

@pwilkin I thought I saw that trying to keep up with the comments! It's high on my todo list after this conference to get into your PR (partly selfishly because I want to reuse these parts)

ggerganov · 2025-10-10T07:29:25Z

@gabe-l-hart Parallel performance of SSMs should be fixed with #16494

gabe-l-hart · 2025-10-10T11:43:06Z

Thank you for digging into these performance improvements!

gabe-l-hart · 2025-10-10T15:42:49Z

I'm hitting errors on metal with the most recent changes on this branch:

lldb ./bin/llama-cli -- -m $(find-ollama-gguf.sh granite4:micro-h) -no-cnv -p "tell me a story about a developer and their dog?" -ngl 99 --temp 0

tell me a story about a developer and their dog? The response mustProcess 95451 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
    frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
   1141	}
   1142	
   1143	uint32_t llama_memory_recurrent_context::get_head() const {
-> 1144	    return head;
   1145	}
   1146	
   1147	int32_t llama_memory_recurrent_context::get_rs_z() const {
Target 0: (llama-cli) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
  * frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
    frame #1: 0x00000001016ad078 libllama.dylib`llm_graph_input_mem_hybrid::can_reuse(this=0x0000600000258500, params=0x000000016fdf5b88) at llama-graph.cpp:481:52
    frame #2: 0x00000001016adc60 libllama.dylib`llm_graph_result::can_reuse(this=0x0000000121810600, params=0x000000016fdf5b88) at llama-graph.cpp:565:33
    frame #3: 0x000000010164f0f0 libllama.dylib`llama_context::process_ubatch(this=0x0000000102904080, ubatch=0x0000600000750540, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x00006000039732c0, ret=0x000000016fdf9cd4) at llama-context.cpp:746:38
    frame #4: 0x0000000101650b24 libllama.dylib`llama_context::decode(this=0x0000000102904080, batch_inp=0x000000016fdfac88) at llama-context.cpp:1088:28
    frame #5: 0x0000000101656a68 libllama.dylib`llama_decode(ctx=0x0000000102904080, batch=llama_batch @ 0x000000016fdfac88) at llama-context.cpp:2747:26
    frame #6: 0x0000000100006fb8 llama-cli`main(argc=10, argv=0x000000016fdfd380) at main.cpp:671:21
    frame #7: 0x000000019fe72b98 dyld`start + 6076

I'll investigate further, but wanted to post in case it's about to be merged

gabe-l-hart · 2025-10-10T15:50:19Z

It looks like it broken in 6589d3b8fcca803e4f2d4ad7da3ff8e87dfaf9ad for me

ggerganov · 2025-10-10T16:02:25Z

Should be ok now. I mistakenly thought that the old mctx of the input would be valid. Let me know if you spot any other issues.

gabe-l-hart · 2025-10-10T16:05:31Z

Confirmed, it's working again for me! I'll test a little further with parallel sequences, but I think it's probably ready

gabe-l-hart · 2025-10-10T16:26:43Z

Hitting assertions with llama-parallel:

lldb ./bin/llama-parallel -- -m $(find-ollama-gguf.sh granite4:micro-h) -ngl 99 -fa on -ns 10 -np 10

main: clearing the KV cache
Client   0, seq    0, junk =    0, prompt = 267, started decoding ...
Client   1, seq    1, junk =    0, prompt = 267, started decoding ...
Client   2, seq    2, junk =    0, prompt = 267, started decoding ...
Client   3, seq    3, junk =    0, prompt = 270, started decoding ...
Client   4, seq    4, junk =    0, prompt = 273, started decoding ...
Client   5, seq    5, junk =    0, prompt = 267, started decoding ...
Client   6, seq    6, junk =    0, prompt = 273, started decoding ...
Client   7, seq    7, junk =    0, prompt = 273, started decoding ...
Client   8, seq    8, junk =    0, prompt = 273, started decoding ...
Client   9, seq    9, junk =    0, prompt = 270, started decoding ...
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 23228
error: attach failed: tried to attach to process already being debugged
Process 23228 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x1a01da388 <+8>:  b.lo   0x1a01da3a8    ; <+40>
    0x1a01da38c <+12>: pacibsp 
    0x1a01da390 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1a01da394 <+20>: mov    x29, sp
Target 0: (llama-parallel) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a021388c libsystem_pthread.dylib`pthread_kill + 296
    frame #2: 0x00000001a011ca3c libsystem_c.dylib`abort + 124
    frame #3: 0x0000000101211554 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
    frame #4: 0x0000000101213adc libggml-base.dylib`ggml_new_tensor_impl(ctx=0x00006000005fa040, type=GGML_TYPE_I32, n_dims=1, ne=0x000000016fdf6218, view_src=0x00000001304e0740, view_offs=0) at ggml.c:1648:5
    frame #5: 0x0000000101218e08 libggml-base.dylib`ggml_view_impl(ctx=0x00006000005fa040, a=0x00000001304e0740, n_dims=1, ne=0x000000016fdf6218, offset=0) at ggml.c:3477:35
    frame #6: 0x0000000101218dac libggml-base.dylib`ggml_view_1d(ctx=0x00006000005fa040, a=0x00000001304e0740, ne0=8, offset=0) at ggml.c:3495:35
    frame #7: 0x00000001016afe0c libllama.dylib`build_rs_inp_impl(ctx0=0x00006000005fa040, ubatch=0x000000016fdfaa08, mctx_cur=0x000060000354b430) at llama-graph.cpp:1839:25
    frame #8: 0x00000001016b0398 libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013bf1f1d0) const at llama-graph.cpp:1910:21
    frame #9: 0x00000001017ea7dc libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16190:22
    frame #10: 0x00000001017ea6b8 libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16180:41
    frame #11: 0x0000000101775774 libllama.dylib`std::__1::__unique_if<llm_build_granite_hybrid>::__unique_single std::__1::make_unique[abi:ne190102]<llm_build_granite_hybrid, llama_model const&, llm_graph_params const&>(__args=0x000000013e020e00, __args=0x000000016fdf6c28) at unique_ptr.h:635:30
    frame #12: 0x0000000101770c48 libllama.dylib`llama_model::build_graph(this=0x000000013e020e00, params=0x000000016fdf6c28) const at llama-model.cpp:19824:23
    frame #13: 0x000000010164b180 libllama.dylib`llama_context::process_ubatch(this=0x0000000120a04080, ubatch=0x000000013bf1e1a0, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x0000600000369f40, ret=0x000000016fdfad74) at llama-context.cpp:758:20
    frame #14: 0x000000010164cb24 libllama.dylib`llama_context::decode(this=0x0000000120a04080, batch_inp=0x000000016fdfb600) at llama-context.cpp:1088:28
    frame #15: 0x0000000101652a68 libllama.dylib`llama_decode(ctx=0x0000000120a04080, batch=llama_batch @ 0x000000016fdfb600) at llama-context.cpp:2747:26
    frame #16: 0x000000010000410c llama-parallel`main(argc=11, argv=0x000000016fdfd3e0) at parallel.cpp:402:29
    frame #17: 0x000000019fe72b98 dyld`start + 6076

gabe-l-hart · 2025-10-10T16:28:23Z

Just confirmed that I don't hit these on master (81086cd6a)

gabe-l-hart · 2025-10-10T16:43:04Z

Running cleanly with those reverts

gabe-l-hart · 2025-10-10T16:44:30Z

In case it's helpful, I was seeing it consistently on the second call to build_inp_mem_hybrid during the parallel portion of the test

debug logs

llama_kv_cache: size =  352.00 MiB (  4096 cells,   4 layers, 11/11 seqs), K (f16):  176.00 MiB, V (f16):  176.00 MiB
llama_memory_recurrent:      Metal RS buffer size =   811.72 MiB
llama_memory_recurrent: size =  811.72 MiB (    11 cells,  40 layers, 11 seqs), R (f32):   19.72 MiB, S (f32):  792.00 MiB
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013380a160) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
llama_context:      Metal compute buffer size =   256.67 MiB
llama_context:        CPU compute buffer size =    15.05 MiB
llama_context: graph nodes  = 2303
llama_context: graph splits = 3
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000102f04080) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
2025-10-10 10:41:04.730362-0600 llama-parallel[35994:106223077] flock failed to lock list file (/var/folders/20/4th8f1dj2t15_21ygkdhskdc0000gn/C//com.apple.metal/16777235_419/functions.list): errno = 35
No new questions so proceed with build-in defaults.


main: initializing samplers with different RNG seeds, starting from -1
main: Simulating parallel requests from clients:
main: n_parallel = 10, n_sequences = 10, cont_batching = 1, system tokens = 256

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, junk =    0, prompt = 267, started decoding ...
Client   1, seq    1, junk =    0, prompt = 267, started decoding ...
Client   2, seq    2, junk =    0, prompt = 267, started decoding ...
Client   3, seq    3, junk =    0, prompt = 270, started decoding ...
Client   4, seq    4, junk =    0, prompt = 273, started decoding ...
Client   5, seq    5, junk =    0, prompt = 267, started decoding ...
Client   6, seq    6, junk =    0, prompt = 273, started decoding ...
Client   7, seq    7, junk =    0, prompt = 273, started decoding ...
Client   8, seq    8, junk =    0, prompt = 273, started decoding ...
Client   9, seq    9, junk =    0, prompt = 270, started decoding ...
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 35994
error: attach failed: tried to attach to process already being debugged
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000101211550 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
   230 	        ggml_print_backtrace();
   231 	    }
   232 	
-> 233 	    abort();
   234 	}
   235 	
   236 	// ggml_print_backtrace is registered with std::set_terminate by ggml.cpp
Target 0: (llama-parallel) stopped.

ggerganov · 2025-10-10T16:47:28Z

So the change in 00f115f does not work for some reason. We want to eventually extract the state of the recurrent memory into the memory context as we do with the KV cache implementations. But I think there is something being mutated when it should not be. For now, let's revert this and figure it out later.

To clarify, the design is that when building the graph we should only reference data that is stored in the memory context (i.e. in llama_memory_recurrent_context), and not in the memory itself (i.e. in llama_memory_recurrent). Except for some constant members such as the ggml tensors for example.

gabe-l-hart · 2025-10-10T16:50:25Z

Got it, that makes sense.

This reverts commit 00f115f.

ggerganov requested a review from CISC as a code owner October 9, 2025 16:58

ggerganov changed the title ~~graph : reuse mamba graphs~~ graph : reuse SSM graphs Oct 9, 2025

compilade reviewed Oct 9, 2025

View reviewed changes

gabe-l-hart reviewed Oct 9, 2025

View reviewed changes

ggerganov mentioned this pull request Oct 10, 2025

metal : fix mul-mm condition + fix mul-mv permuted kernels #16494

Merged

ggerganov requested a review from compilade October 10, 2025 08:00

ggerganov force-pushed the gg/graph-mamba-reuse branch from 18212b0 to 2744d61 Compare October 10, 2025 16:41

ggerganov force-pushed the gg/graph-mamba-reuse branch from 2744d61 to 16d57ca Compare October 11, 2025 13:55

ggerganov added 4 commits October 13, 2025 23:08

graph : reuse hybrid graphs

763de82

graph : reuse recurrent graphs

08365ca

graph : fix reuse check for recurrent inputs

1bf3f6a

memory : move the recurrent state into the memory context

08dc973

Revert "memory : move the recurrent state into the memory context"

7641e6f

This reverts commit 00f115f.

ggerganov force-pushed the gg/graph-mamba-reuse branch from 16d57ca to 7641e6f Compare October 13, 2025 20:08


	// when we split the batch using "equal_seqs" we have to verify that the participating sequences are the same
	// the reason is because the set of attention streams would be different for different sequences
	if (can_reuse_ubatch && ubatch.equal_seqs()) {
	if (!ubatch.data) {
	// if the old ubatch does not own it's data, then we cannot guarantee that it is still alive, and
	// therefore we cannot perform the sequence id check. normally should never happen
	can_reuse_ubatch = false;
	} else {
	for (uint32_t s = 0; s < ubatch.n_seqs_unq; ++s) {
	can_reuse_ubatch &= ubatch.seq_id_unq[s] == other.ubatch.seq_id_unq[s];
	}
	}
	}

params	LLAMA_GRAPH_REUSE_DISABLE	PPL
`-b 512`	0	7.7852
`-b 2048`	0	7.8628
`-b 512`	1	7.7852
`-b 2048`	1	7.7852

graph : reuse SSM graphs #16490

Are you sure you want to change the base?

graph : reuse SSM graphs #16490

Conversation

ggerganov commented Oct 9, 2025

Without graph reuse

With graph reuse

Uh oh!

gabe-l-hart commented Oct 9, 2025

Uh oh!

compilade Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

compilade Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Oct 9, 2025

Uh oh!

gabe-l-hart commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Oct 9, 2025

Metal

Reuse on, fa on

Reuse on, fa off

Reuse off, fa on

Reuse off, fa off

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Oct 9, 2025

Uh oh!

gabe-l-hart commented Oct 9, 2025

Uh oh!

ggerganov commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

ggerganov commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

ggerganov commented Oct 10, 2025

Uh oh!

gabe-l-hart commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

compilade Oct 9, 2025 •

edited

Loading

compilade Oct 9, 2025 •

edited

Loading

ggerganov Oct 10, 2025 •

edited

Loading

gabe-l-hart commented Oct 9, 2025 •

edited

Loading