memory: Hybrid context shift #17009

gabe-l-hart · 2025-11-04T20:52:57Z

Description

This PR addresses context shift failure caused when a hybrid-recurrent model hits its context limit and attempts to perform context shifting. The main change here is to loosen the restriction in the llama_memory_recurrent::seq_rm to only refuse to do a partial erasure if the part being erased includes the final token in the sequence. Since recurrent states are fixed size, any partial erasure that does not include the final token can be considered a no-op.

Testing

To validate the result, you can use the following which artificially limits the context length to force a context shift:

# You can use any granite-4.0 model here
./bin/llama-cli -m ggml-org/granite-4.0-h-small-Q8_0-GGUF --jinja -c 100 --context-shift -p "tell me a story"

Without this fix, it will fail with init_batch: failed to prepare attention ubatches, but with this fix, it will successfully continue generating and produce generated output that is relevant to the previous context.

The recurrent state is always assumed to be the state as of the last update from the final token in the sequence. When doing a partial erasure, if the range does not include the final token, the erasure can be considered a success since any memory used for the sequence prior to the final token (which is no memory) has been successfully removed. There is one potential case that this doesn't address which is the pruning of cache to remove sensitive data from the context. This wouldn't work for attention cache partial removal (in the middle) either since the KV state is linearly-dependent and states in later sequence positions would still be based on the state from the sensitive data, even if that data is no longer cached, so I don't think this is relevant, but it is worth noting that the semantics of this change for a partial erasure in the middle of the cache are essentially "my context is already compressed" and not "all trace of the removed tokens has been removed." ggml-org#16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <[email protected]>

This prefix matching is explicitly attempting to remove the tokens at the end of the sequence that don't match. This is the operation that can't be performed on a recurrent cache due to the state being updated in place, so if this removal fails, we need to clear the whole cache. ggml-org#16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <[email protected]>

src/llama-memory-recurrent.cpp

Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: compilade <[email protected]>

src/llama-memory-recurrent.cpp

ggerganov · 2025-11-05T08:03:35Z

src/llama-memory-recurrent.cpp

-            // partial intersection is invalid
-            if ((0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)) {
+            // partial intersection is invalid if it includes the final pos
+            if ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) {


Why do we check strictly larger than 0 rather than 0 <= p0?

If the range includes the first token and extends beyond the last token, it's no longer a partial intersection and can simply clear the whole sequence which is a valid operation here since the answer is just to zero-out everything. The case this is trying to catch is an attempt to go part-way back in the sequence which is invalid since those states are not preserved.

Why do we check strictly larger than 0 rather than 0 <= p0?

Because when p0 == 0 (and p1 > cell.pos), then the sequence can be removed entirely, since there is no partial intersection.

p0 larger than 0 means there's at least one remaining token in the context, which would require state rollback.

(jinx!)

((edit: I hope that translates correctly as "we said the same thing at the same time" to people that don't have little kids))

Snap! is the British equivalent. :)

Edit: (and for some reason skylder meg en kald cola (owe me a cold coke) the Norwegian one)

Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

compilade · 2025-11-05T13:30:46Z

tools/main/main.cpp

+        if (!llama_memory_seq_rm(mem, -1, n_matching_session_tokens, -1)) {
+            LOG_INF("%s: unable to resuse common prefix\n", __func__);
+            llama_memory_seq_rm(mem, -1, -1, -1);


I'm wondering if this will correctly trigger reevaluation of the context afterwards.

n_matching_session_tokens seems to be used mostly for cosmetic purposes except when it's equal to embd_inp.size(), so maybe it could be set to 0.

But I think there might be something else necessary here. (For example, should session_tokens be cleared?)

(This is mostly relevant when reloading a saved session, and not necessarily during context shifting, though)

Hm, yeah, I'm still a little murky on the logic going on here. I'll look more carefully and see what I can find.

Sorry folks,
I am new to merge management of commits.
b8f9c56923ab34bf24d4fe5476efe9b02736918d
Don't know who and when presses merge button for the commit after all checks are passed?

I think we're holding on this until I resolve the question in this thread. I'll try to get to that today!

Ok, I see what you're getting at here with the clause below that recalculates the cached logits for the final token. I think the right thing to do here is to set n_matching_session_tokens to 0 since we're essentially saying there are no matching tokens anymore.

ggml-org#16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-11-06T20:19:05Z

I've seen this same set of tests failing on a bunch of PRs, so I'm hoping this is just runner instability?

CISC · 2025-11-06T20:21:23Z

I've seen this same set of tests failing on a bunch of PRs, so I'm hoping this is just runner instability?

Yes. (though I've never seen SUM fail before)

CISC · 2025-11-08T09:42:56Z

Is this awaiting final approval, or is it good to merge?

leok7v

Looks good to me. Tested. Works.

gabe-l-hart · 2025-11-10T15:13:37Z

@CISC @ggerganov I think with the current approvals, this should be good to go unless you have other outstanding concerns.

pwilkin · 2025-11-10T17:35:43Z

@CISC Don't know if it matters, but I ran into SUM failing for me when doing backwards pass gradient tests for EXPM1 and SOFTPLUS.

gabe-l-hart added 2 commits November 4, 2025 13:44

gabe-l-hart requested a review from ggerganov as a code owner November 4, 2025 20:52

gabe-l-hart requested a review from compilade November 4, 2025 20:53

gabe-l-hart mentioned this pull request Nov 4, 2025

The result of bool llama_memory_seq_rm() is not checked #16768

Closed

compilade reviewed Nov 4, 2025

View reviewed changes

src/llama-memory-recurrent.cpp Outdated Show resolved Hide resolved

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #17009: memory: Hybrid context shift auroralabs-loci/llama.cpp#85

Open

fix(memory): Fix condition for partial erasure failure if p0 > pos

3b59021

Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: compilade <[email protected]>

github-actions bot added the examples label Nov 4, 2025

ggerganov reviewed Nov 5, 2025

View reviewed changes

style: Fix extra parens

b8f9c56

Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov approved these changes Nov 5, 2025

View reviewed changes

compilade reviewed Nov 5, 2025

View reviewed changes

fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear

104d8ac

ggml-org#16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <[email protected]>

compilade approved these changes Nov 8, 2025

View reviewed changes

leok7v approved these changes Nov 10, 2025

View reviewed changes

ggerganov merged commit 0c74f32 into ggml-org:master Nov 10, 2025
65 of 71 checks passed

gabe-l-hart deleted the HybridContextShift-16768 branch November 10, 2025 17:08

memory: Hybrid context shift #17009

memory: Hybrid context shift #17009

Uh oh!

Conversation

gabe-l-hart commented Nov 4, 2025

Description

Testing

Uh oh!

Uh oh!

Uh oh!

ggerganov Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CISC Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

compilade Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

leok7v Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented Nov 6, 2025

Uh oh!

CISC commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 8, 2025

Uh oh!

leok7v left a comment

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented Nov 10, 2025

Uh oh!

Uh oh!

pwilkin commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gabe-l-hart Nov 5, 2025 •

edited

Loading

CISC Nov 6, 2025 •

edited

Loading

compilade Nov 5, 2025 •

edited

Loading

CISC commented Nov 6, 2025 •

edited

Loading