Skip to content

Conversation

ggerganov
Copy link
Member

Not sure if there is a reason not to enable graph reuse for recurrent graphs (mamba, hybrids, SSM, etc.). Did a few tests and seems to work, resulting in some modest perf improvements. cc @gabe-l-hart @compilade

Without graph reuse

make -j && LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32
model size params backend ngl threads fa test t/s
mamba 0.1B F16 256.96 MiB 129.14 M Metal 99 1 1 pp512 8415.73 ± 46.47
mamba 0.1B F16 256.96 MiB 129.14 M Metal 99 1 1 tg32 322.74 ± 0.64
granitehybrid ?B Q8_0 6.88 GiB 6.94 B Metal 99 1 1 pp512 2119.36 ± 3.31
granitehybrid ?B Q8_0 6.88 GiB 6.94 B Metal 99 1 1 tg32 77.17 ± 0.11
jamba ?B Q8_0 51.05 GiB 51.57 B Metal 99 1 1 pp512 603.47 ± 1.83
jamba ?B Q8_0 51.05 GiB 51.57 B Metal 99 1 1 tg32 42.35 ± 0.02
lfm2 2.6B Q4_K - Medium 1.45 GiB 2.57 B Metal 99 1 1 pp512 2923.41 ± 3.20
lfm2 2.6B Q4_K - Medium 1.45 GiB 2.57 B Metal 99 1 1 tg32 169.83 ± 0.67
build: 638e2c2 (6725)

With graph reuse

make -j && ./bin/llama-bench -m ../models/mamba-130m/ggml-model-f16.gguf -m ../models/granite-4-h-tiny/ggml-model-q8_0.gguf -m ../models/ai21-jamba-mini-1.7/ggml-model-q8_0.gguf -m ../models/liquidai-lfm2-2.6b/ggml-model-q4_k.gguf -fa 1 -t 1 -n 32
model size params backend ngl threads fa test t/s
mamba 0.1B F16 256.96 MiB 129.14 M Metal 99 1 1 pp512 8453.65 ± 20.10
mamba 0.1B F16 256.96 MiB 129.14 M Metal 99 1 1 tg32 348.83 ± 1.67
granitehybrid ?B Q8_0 6.88 GiB 6.94 B Metal 99 1 1 pp512 2126.12 ± 1.90
granitehybrid ?B Q8_0 6.88 GiB 6.94 B Metal 99 1 1 tg32 82.26 ± 0.13
jamba ?B Q8_0 51.05 GiB 51.57 B Metal 99 1 1 pp512 604.56 ± 2.08
jamba ?B Q8_0 51.05 GiB 51.57 B Metal 99 1 1 tg32 43.22 ± 0.02
lfm2 2.6B Q4_K - Medium 1.45 GiB 2.57 B Metal 99 1 1 pp512 2928.31 ± 1.78
lfm2 2.6B Q4_K - Medium 1.45 GiB 2.57 B Metal 99 1 1 tg32 179.18 ± 0.47
build: 638e2c2 (6725)

@ggerganov ggerganov requested a review from CISC as a code owner October 9, 2025 16:58
@ggerganov ggerganov changed the title graph : reuse mamba graphs graph : reuse SSM graphs Oct 9, 2025
@gabe-l-hart
Copy link
Collaborator

Very cool! I'll test shortly with Granite 4.

The only thought I've had about why this might be difficult is around implementing the SSD version of SSM_SCAN. In mamba_ssm and mlx, they conditionally use SSD if (and only if) the cache is empty and the sequence length is >1. Since SSD is composed of a bunch of other smaller ops (tril and cumsum), one way this could be implemented is at the graph building layer which would result in different graphs for different parts of the generate loop. That said, this could also be implemented inside the SSM_SCAN kernel dispatching layer, so I don't think it's a blocker for reusing graphs.


bool res = true;

res &= s_copy->ne[0] == mctx->get_n_rs();
Copy link
Collaborator

@compilade compilade Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mctx->get_head() (the start of the slot) and mctx->get_rs_z() (the first zeroed state) are used in view offsets, and so would need to match too, otherwise the graph can't really be re-used.

The case where they wouldn't match (but n_rs matches) is when ubatches of the same size with different sequences are used.

E.g. seq_ids 0, 1, with 1 token and then seq_ids 2, 3 with 1 token, in consecutive ubatches, repeatedly.

This probably happens when using -ub 1 in the llama-parallel example, I think (because it uses a single seq_id per ubatch at a time, but ends up using different seq_ids while using the same size of ubatches).

(Note that I didn't actually test the changes yet, so I don't know if this is a real problem)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check earlier for whether the sequences are the same:

llama.cpp/src/llama-graph.h

Lines 443 to 457 in 638e2c2

// when we split the batch using "equal_seqs" we have to verify that the participating sequences are the same
// the reason is because the set of attention streams would be different for different sequences
if (can_reuse_ubatch && ubatch.equal_seqs()) {
if (!ubatch.data) {
// if the old ubatch does not own it's data, then we cannot guarantee that it is still alive, and
// therefore we cannot perform the sequence id check. normally should never happen
can_reuse_ubatch = false;
} else {
for (uint32_t s = 0; s < ubatch.n_seqs_unq; ++s) {
can_reuse_ubatch &= ubatch.seq_id_unq[s] == other.ubatch.seq_id_unq[s];
}
}
}

This check applies to all graphs and if not satisfied, we don't attempt to reuse the graph. I think this should cover this case.

Copy link
Collaborator

@compilade compilade Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check earlier for whether the sequences are the same

Right, this should cover the case where different sequences are used.

However, I don't think it covers the case when a sequence is cleared (which will make mctx->get_rs_z() differ).


I'm noticing different perplexity with and without graph-reuse with a Q8_0 mamba-130m on CPU.

(this is on the first 10 chunks of calibration_datav3)

params LLAMA_GRAPH_REUSE_DISABLE PPL
-b 512 0 7.7852
-b 2048 0 7.8628
-b 512 1 7.7852
-b 2048 1 7.7852

I'm not sure it's caused by what exactly, but I'm suspecting it's either related to rs_z or head (since this doesn't seem to happen with non-recurrent models (I tested with a Q8_0 TinyLlama)).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov
Checking for head and rs_z mismatch does seem to help with the case in my previous comment, making the graph-reuse case have the same PPL as when it's not used.

Patch with changes
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 7f0c974f1..aad42d62d 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -258,6 +258,9 @@ bool llm_graph_input_rs::can_reuse(const llm_graph_params & params) {
 
     bool res = true;
 
+    res &= this->head == mctx->get_head();
+    res &= this->rs_z == mctx->get_rs_z();
+
     res &= s_copy->ne[0] == mctx->get_n_rs();
 
     res &= s_copy_main->ne[0]  == params.ubatch.n_seqs;
@@ -482,6 +485,9 @@ bool llm_graph_input_mem_hybrid::can_reuse(const llm_graph_params & params) {
     res &= inp_attn->self_kq_mask->ne[0] == mctx->get_attn()->get_n_kv();
     res &= inp_attn->self_kq_mask->ne[1] == GGML_PAD(params.ubatch.n_tokens, GGML_KQ_MASK_PAD);
 
+    res &= inp_rs->head == mctx->get_recr()->get_head();
+    res &= inp_rs->rs_z == mctx->get_recr()->get_rs_z();
+
     res &= inp_rs->s_copy->ne[0] == mctx->get_recr()->get_n_rs();
 
     res &= inp_rs->s_copy_main->ne[0]  == params.ubatch.n_seqs;
@@ -1827,6 +1833,9 @@ static std::unique_ptr<llm_graph_input_rs> build_rs_inp_impl(
     inp->s_copy_main  = ggml_view_1d(ctx0, inp->s_copy, n_seqs, 0);
     inp->s_copy_extra = ggml_view_1d(ctx0, inp->s_copy, n_rs - n_seqs, n_seqs * inp->s_copy->nb[0]);
 
+    inp->head = mctx_cur->get_head();
+    inp->rs_z = mctx_cur->get_rs_z();
+
     return inp;
 }
 
diff --git a/src/llama-graph.h b/src/llama-graph.h
index 394e88432..a596461bb 100644
--- a/src/llama-graph.h
+++ b/src/llama-graph.h
@@ -234,6 +234,10 @@ public:
     ggml_tensor * s_copy_extra;  // I32 [n_rs - n_seqs]
 
     const llama_memory_recurrent_context * mctx;
+
+    // used in view offsets, need to match for valid graph reuse
+    uint32_t head;
+    int32_t rs_z;
 };
 
 class llm_graph_input_cross_embd : public llm_graph_input_i {

It might not be ideal to expose another way to get head and rs_z. But the constructor of llm_graph_input_rs would need access to llama-memory-recurrent.h to use mctx->get_head() and mctx->get_rs_z().

Strangely enough, hybrid models like Falcon-H1 don't manifest the same problem as mamba-130m; I can't reproduce the original problem with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be ideal to expose another way to get head and rs_z. But the constructor of llm_graph_input_rs would need access to llama-memory-recurrent.h to use mctx->get_head() and mctx->get_rs_z().

Can you clarify what you mean here? The proposed solution seems OK to me.

On a related topic, would it be possible to avoid these offsets through the use of ggml_set_rows() in a similar way as we avoided the KV cache offset for the regular attention?

Copy link
Member Author

@ggerganov ggerganov Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@compilade I improved the state management of the recurrent state with 6589d3b. The recurrent memory context now keeps immutable values such as head, rs_z, etc. These can be used in the can_reuse() logic without duplicating this state in the inputs.

@ggerganov
Copy link
Member Author

@gabe-l-hart What is SSD?

@gabe-l-hart
Copy link
Collaborator

gabe-l-hart commented Oct 9, 2025

What is SSD?

Sorry, commenting from my phone at the airport! SSD is the State Space Duality part of the mamba2 paper where they reframe the SSM_SCAN op as an attention operation. The mlx implementation is here and the original triton kernel is here. I'm still working on actually grokking the math and was hoping to try to get it implemented in ggml soon-ish. It should provide a nice performance boost for prefill.

@gabe-l-hart
Copy link
Collaborator

Results looking good for granite4:micro-h (using the GGUF we uploaded to Ollama):


Metal

Reuse on, fa on

./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.166 770.58 1.617 79.16 1.783 143.56
128 128 2 512 0.309 828.03 4.041 63.35 4.350 117.69
128 128 4 1024 0.598 856.59 7.172 71.39 7.770 131.80
256 128 1 384 0.307 834.22 1.628 78.63 1.935 198.48
256 128 2 768 0.593 863.30 4.048 63.24 4.641 165.48
256 128 4 1536 1.191 860.14 7.162 71.48 8.353 183.89

Reuse on, fa off

./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.167 768.56 1.697 75.43 1.863 137.38
128 128 2 512 0.310 826.56 4.130 61.99 4.440 115.32
128 128 4 1024 0.599 854.75 7.232 70.80 7.831 130.76
256 128 1 384 0.307 833.12 1.705 75.08 2.012 190.84
256 128 2 768 0.594 861.28 4.175 61.32 4.770 161.02
256 128 4 1536 1.193 858.02 7.237 70.75 8.430 182.20

Reuse off, fa on

LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa on
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.166 769.00 1.714 74.69 1.880 136.16
128 128 2 512 0.309 828.93 4.209 60.83 4.517 113.34
128 128 4 1024 0.598 855.49 7.282 70.31 7.881 129.94
256 128 1 384 0.307 834.57 1.763 72.61 2.070 185.55
256 128 2 768 0.593 864.13 4.176 61.30 4.769 161.04
256 128 4 1536 1.190 860.30 7.291 70.22 8.481 181.10

Reuse off, fa off

LLAMA_GRAPH_REUSE_DISABLE=1 ./bin/llama-batched-bench -m $(find-ollama-gguf.sh granite4:micro-h) -c 2048 -b 2048 -ub 512 -npp 128,256 -ntg 128 -npl 1,2,4 -ngl 99 -fa off
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.170 751.10 1.790 71.50 1.961 130.58
128 128 2 512 0.310 826.98 4.190 61.10 4.499 113.79
128 128 4 1024 0.602 850.25 7.299 70.15 7.901 129.60
256 128 1 384 0.309 829.05 1.793 71.38 2.102 182.67
256 128 2 768 0.596 858.70 4.220 60.67 4.816 159.46
256 128 4 1536 1.193 858.16 7.309 70.05 8.502 180.66

Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't followed the graph reuse implementation well enough to review the code changes here well, but the performance changes are working well for me on metal and I've validated that results match perfectly with and without graph reuse for single-concurrency.

@pwilkin
Copy link
Collaborator

pwilkin commented Oct 9, 2025

@gabe-l-hart as a side note, I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.

@gabe-l-hart
Copy link
Collaborator

I've added cumsum and tri as new ops during the Qwen3Next implementation, so that might allow for some decoupling.

@pwilkin I thought I saw that trying to keep up with the comments! It's high on my todo list after this conference to get into your PR (partly selfishly because I want to reuse these parts)

@ggerganov
Copy link
Member Author

@gabe-l-hart Parallel performance of SSMs should be fixed with #16494

@ggerganov ggerganov requested a review from compilade October 10, 2025 08:00
@gabe-l-hart
Copy link
Collaborator

Thank you for digging into these performance improvements!

@gabe-l-hart
Copy link
Collaborator

I'm hitting errors on metal with the most recent changes on this branch:

lldb ./bin/llama-cli -- -m $(find-ollama-gguf.sh granite4:micro-h) -no-cnv -p "tell me a story about a developer and their dog?" -ngl 99 --temp 0
tell me a story about a developer and their dog? The response mustProcess 95451 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
    frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
   1141	}
   1142	
   1143	uint32_t llama_memory_recurrent_context::get_head() const {
-> 1144	    return head;
   1145	}
   1146	
   1147	int32_t llama_memory_recurrent_context::get_rs_z() const {
Target 0: (llama-cli) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3c)
  * frame #0: 0x00000001016f7e00 libllama.dylib`llama_memory_recurrent_context::get_head(this=0x0000000000000000) const at llama-memory-recurrent.cpp:1144:12
    frame #1: 0x00000001016ad078 libllama.dylib`llm_graph_input_mem_hybrid::can_reuse(this=0x0000600000258500, params=0x000000016fdf5b88) at llama-graph.cpp:481:52
    frame #2: 0x00000001016adc60 libllama.dylib`llm_graph_result::can_reuse(this=0x0000000121810600, params=0x000000016fdf5b88) at llama-graph.cpp:565:33
    frame #3: 0x000000010164f0f0 libllama.dylib`llama_context::process_ubatch(this=0x0000000102904080, ubatch=0x0000600000750540, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x00006000039732c0, ret=0x000000016fdf9cd4) at llama-context.cpp:746:38
    frame #4: 0x0000000101650b24 libllama.dylib`llama_context::decode(this=0x0000000102904080, batch_inp=0x000000016fdfac88) at llama-context.cpp:1088:28
    frame #5: 0x0000000101656a68 libllama.dylib`llama_decode(ctx=0x0000000102904080, batch=llama_batch @ 0x000000016fdfac88) at llama-context.cpp:2747:26
    frame #6: 0x0000000100006fb8 llama-cli`main(argc=10, argv=0x000000016fdfd380) at main.cpp:671:21
    frame #7: 0x000000019fe72b98 dyld`start + 6076

I'll investigate further, but wanted to post in case it's about to be merged

@gabe-l-hart
Copy link
Collaborator

It looks like it broken in 6589d3b8fcca803e4f2d4ad7da3ff8e87dfaf9ad for me

@ggerganov
Copy link
Member Author

Should be ok now. I mistakenly thought that the old mctx of the input would be valid. Let me know if you spot any other issues.

@gabe-l-hart
Copy link
Collaborator

Confirmed, it's working again for me! I'll test a little further with parallel sequences, but I think it's probably ready

@gabe-l-hart
Copy link
Collaborator

Hitting assertions with llama-parallel:

lldb ./bin/llama-parallel -- -m $(find-ollama-gguf.sh granite4:micro-h) -ngl 99 -fa on -ns 10 -np 10
main: clearing the KV cache
Client   0, seq    0, junk =    0, prompt = 267, started decoding ...
Client   1, seq    1, junk =    0, prompt = 267, started decoding ...
Client   2, seq    2, junk =    0, prompt = 267, started decoding ...
Client   3, seq    3, junk =    0, prompt = 270, started decoding ...
Client   4, seq    4, junk =    0, prompt = 273, started decoding ...
Client   5, seq    5, junk =    0, prompt = 267, started decoding ...
Client   6, seq    6, junk =    0, prompt = 273, started decoding ...
Client   7, seq    7, junk =    0, prompt = 273, started decoding ...
Client   8, seq    8, junk =    0, prompt = 273, started decoding ...
Client   9, seq    9, junk =    0, prompt = 270, started decoding ...
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 23228
error: attach failed: tried to attach to process already being debugged
Process 23228 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x1a01da388 <+8>:  b.lo   0x1a01da3a8    ; <+40>
    0x1a01da38c <+12>: pacibsp 
    0x1a01da390 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1a01da394 <+20>: mov    x29, sp
Target 0: (llama-parallel) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00000001a01da388 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a021388c libsystem_pthread.dylib`pthread_kill + 296
    frame #2: 0x00000001a011ca3c libsystem_c.dylib`abort + 124
    frame #3: 0x0000000101211554 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
    frame #4: 0x0000000101213adc libggml-base.dylib`ggml_new_tensor_impl(ctx=0x00006000005fa040, type=GGML_TYPE_I32, n_dims=1, ne=0x000000016fdf6218, view_src=0x00000001304e0740, view_offs=0) at ggml.c:1648:5
    frame #5: 0x0000000101218e08 libggml-base.dylib`ggml_view_impl(ctx=0x00006000005fa040, a=0x00000001304e0740, n_dims=1, ne=0x000000016fdf6218, offset=0) at ggml.c:3477:35
    frame #6: 0x0000000101218dac libggml-base.dylib`ggml_view_1d(ctx=0x00006000005fa040, a=0x00000001304e0740, ne0=8, offset=0) at ggml.c:3495:35
    frame #7: 0x00000001016afe0c libllama.dylib`build_rs_inp_impl(ctx0=0x00006000005fa040, ubatch=0x000000016fdfaa08, mctx_cur=0x000060000354b430) at llama-graph.cpp:1839:25
    frame #8: 0x00000001016b0398 libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013bf1f1d0) const at llama-graph.cpp:1910:21
    frame #9: 0x00000001017ea7dc libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16190:22
    frame #10: 0x00000001017ea6b8 libllama.dylib`llm_build_granite_hybrid::llm_build_granite_hybrid(this=0x000000013bf1f1d0, model=0x000000013e020e00, params=0x000000016fdf6c28) at llama-model.cpp:16180:41
    frame #11: 0x0000000101775774 libllama.dylib`std::__1::__unique_if<llm_build_granite_hybrid>::__unique_single std::__1::make_unique[abi:ne190102]<llm_build_granite_hybrid, llama_model const&, llm_graph_params const&>(__args=0x000000013e020e00, __args=0x000000016fdf6c28) at unique_ptr.h:635:30
    frame #12: 0x0000000101770c48 libllama.dylib`llama_model::build_graph(this=0x000000013e020e00, params=0x000000016fdf6c28) const at llama-model.cpp:19824:23
    frame #13: 0x000000010164b180 libllama.dylib`llama_context::process_ubatch(this=0x0000000120a04080, ubatch=0x000000013bf1e1a0, gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x0000600000369f40, ret=0x000000016fdfad74) at llama-context.cpp:758:20
    frame #14: 0x000000010164cb24 libllama.dylib`llama_context::decode(this=0x0000000120a04080, batch_inp=0x000000016fdfb600) at llama-context.cpp:1088:28
    frame #15: 0x0000000101652a68 libllama.dylib`llama_decode(ctx=0x0000000120a04080, batch=llama_batch @ 0x000000016fdfb600) at llama-context.cpp:2747:26
    frame #16: 0x000000010000410c llama-parallel`main(argc=11, argv=0x000000016fdfd3e0) at parallel.cpp:402:29
    frame #17: 0x000000019fe72b98 dyld`start + 6076

@gabe-l-hart
Copy link
Collaborator

Just confirmed that I don't hit these on master (81086cd6a)

@ggerganov ggerganov force-pushed the gg/graph-mamba-reuse branch from 18212b0 to 2744d61 Compare October 10, 2025 16:41
@gabe-l-hart
Copy link
Collaborator

Running cleanly with those reverts

@gabe-l-hart
Copy link
Collaborator

In case it's helpful, I was seeing it consistently on the second call to build_inp_mem_hybrid during the parallel portion of the test

debug logs
llama_kv_cache: size =  352.00 MiB (  4096 cells,   4 layers, 11/11 seqs), K (f16):  176.00 MiB, V (f16):  176.00 MiB
llama_memory_recurrent:      Metal RS buffer size =   811.72 MiB
llama_memory_recurrent: size =  811.72 MiB (    11 cells,  40 layers, 11 seqs), R (f32):   19.72 MiB, S (f32):  792.00 MiB
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x000000013380a160) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132e09b70) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
llama_context:      Metal compute buffer size =   256.67 MiB
llama_context:        CPU compute buffer size =    15.05 MiB
llama_context: graph nodes  = 2303
llama_context: graph splits = 3
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000102f04080) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
2025-10-10 10:41:04.730362-0600 llama-parallel[35994:106223077] flock failed to lock list file (/var/folders/20/4th8f1dj2t15_21ygkdhskdc0000gn/C//com.apple.metal/16777235_419/functions.list): errno = 35
No new questions so proceed with build-in defaults.


main: initializing samplers with different RNG seeds, starting from -1
main: Simulating parallel requests from clients:
main: n_parallel = 10, n_sequences = 10, cont_batching = 1, system tokens = 256

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, junk =    0, prompt = 267, started decoding ...
Client   1, seq    1, junk =    0, prompt = 267, started decoding ...
Client   2, seq    2, junk =    0, prompt = 267, started decoding ...
Client   3, seq    3, junk =    0, prompt = 270, started decoding ...
Client   4, seq    4, junk =    0, prompt = 273, started decoding ...
Client   5, seq    5, junk =    0, prompt = 267, started decoding ...
Client   6, seq    6, junk =    0, prompt = 273, started decoding ...
Client   7, seq    7, junk =    0, prompt = 273, started decoding ...
Client   8, seq    8, junk =    0, prompt = 273, started decoding ...
Client   9, seq    9, junk =    0, prompt = 270, started decoding ...
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001016b036c libllama.dylib`llm_graph_context::build_inp_mem_hybrid(this=0x0000000132f0c4b0) const at llama-graph.cpp:1910:44
   1907	llm_graph_input_mem_hybrid * llm_graph_context::build_inp_mem_hybrid() const {
   1908	    const auto * mctx_cur = static_cast<const llama_memory_hybrid_context *>(mctx);
   1909	
-> 1910	    auto inp_rs   = build_rs_inp_impl     (ctx0, ubatch, mctx_cur->get_recr());
   1911	    auto inp_attn = build_attn_inp_kv_impl(ctx0, ubatch, hparams, cparams, mctx_cur->get_attn());
   1912	
   1913	    auto inp = std::make_unique<llm_graph_input_mem_hybrid>(cparams, std::move(inp_attn), std::move(inp_rs), mctx_cur);
Target 0: (llama-parallel) stopped.
(lldb) c
Process 35994 resuming
/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
(lldb) process attach --pid 35994
error: attach failed: tried to attach to process already being debugged
Process 35994 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000101211550 libggml-base.dylib`ggml_abort(file="/Users/ghart/Projects/github/ggml-org/llama.cpp/ggml/src/ggml.c", line=1648, fmt="GGML_ASSERT(%s) failed") at ggml.c:233:5
   230 	        ggml_print_backtrace();
   231 	    }
   232 	
-> 233 	    abort();
   234 	}
   235 	
   236 	// ggml_print_backtrace is registered with std::set_terminate by ggml.cpp
Target 0: (llama-parallel) stopped.

@ggerganov
Copy link
Member Author

So the change in 00f115f does not work for some reason. We want to eventually extract the state of the recurrent memory into the memory context as we do with the KV cache implementations. But I think there is something being mutated when it should not be. For now, let's revert this and figure it out later.

To clarify, the design is that when building the graph we should only reference data that is stored in the memory context (i.e. in llama_memory_recurrent_context), and not in the memory itself (i.e. in llama_memory_recurrent). Except for some constant members such as the ggml tensors for example.

@gabe-l-hart
Copy link
Collaborator

Got it, that makes sense.

@ggerganov ggerganov force-pushed the gg/graph-mamba-reuse branch from 2744d61 to 16d57ca Compare October 11, 2025 13:55
@ggerganov ggerganov force-pushed the gg/graph-mamba-reuse branch from 16d57ca to 7641e6f Compare October 13, 2025 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants