Skip to content

Conversation

@gabe-l-hart
Copy link
Collaborator

Closes #16768

cc @leok7v

Description

This PR addresses context shift failure caused when a hybrid-recurrent model hits its context limit and attempts to perform context shifting. The main change here is to loosen the restriction in the llama_memory_recurrent::seq_rm to only refuse to do a partial erasure if the part being erased includes the final token in the sequence. Since recurrent states are fixed size, any partial erasure that does not include the final token can be considered a no-op.

Testing

To validate the result, you can use the following which artificially limits the context length to force a context shift:

# You can use any granite-4.0 model here
./bin/llama-cli -m ggml-org/granite-4.0-h-small-Q8_0-GGUF --jinja -c 100 --context-shift -p "tell me a story"

Without this fix, it will fail with init_batch: failed to prepare attention ubatches, but with this fix, it will successfully continue generating and produce generated output that is relevant to the previous context.

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

ggml-org#16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>
This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

ggml-org#16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>
// partial intersection is invalid
if ((0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)) {
// partial intersection is invalid if it includes the final pos
if ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check strictly larger than 0 rather than 0 <= p0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the range includes the first token and extends beyond the last token, it's no longer a partial intersection and can simply clear the whole sequence which is a valid operation here since the answer is just to zero-out everything. The case this is trying to catch is an attempt to go part-way back in the sequence which is invalid since those states are not preserved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check strictly larger than 0 rather than 0 <= p0?

Because when p0 == 0 (and p1 > cell.pos), then the sequence can be removed entirely, since there is no partial intersection.

p0 larger than 0 means there's at least one remaining token in the context, which would require state rollback.

Copy link
Collaborator Author

@gabe-l-hart gabe-l-hart Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(jinx!)

((edit: I hope that translates correctly as "we said the same thing at the same time" to people that don't have little kids))

Copy link
Collaborator

@CISC CISC Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snap! is the British equivalent. :)

Edit: (and for some reason skylder meg en kald cola (owe me a cold coke) the Norwegian one)

Signed-off-by: Gabe Goodhart <[email protected]>

Co-authored-by: Georgi Gerganov <[email protected]>
Comment on lines 357 to 359
if (!llama_memory_seq_rm(mem, -1, n_matching_session_tokens, -1)) {
LOG_INF("%s: unable to resuse common prefix\n", __func__);
llama_memory_seq_rm(mem, -1, -1, -1);
Copy link
Collaborator

@compilade compilade Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this will correctly trigger reevaluation of the context afterwards.

n_matching_session_tokens seems to be used mostly for cosmetic purposes except when it's equal to embd_inp.size(), so maybe it could be set to 0.

But I think there might be something else necessary here. (For example, should session_tokens be cleared?)

(This is mostly relevant when reloading a saved session, and not necessarily during context shifting, though)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah, I'm still a little murky on the logic going on here. I'll look more carefully and see what I can find.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry folks,
I am new to merge management of commits.
b8f9c56923ab34bf24d4fe5476efe9b02736918d
Don't know who and when presses merge button for the commit after all checks are passed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're holding on this until I resolve the question in this thread. I'll try to get to that today!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you're getting at here with the clause below that recalculates the cached logits for the final token. I think the right thing to do here is to set n_matching_session_tokens to 0 since we're essentially saying there are no matching tokens anymore.

ggml-org#16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart
Copy link
Collaborator Author

I've seen this same set of tests failing on a bunch of PRs, so I'm hoping this is just runner instability?

@CISC
Copy link
Collaborator

CISC commented Nov 6, 2025

I've seen this same set of tests failing on a bunch of PRs, so I'm hoping this is just runner instability?

Yes. (though I've never seen SUM fail before)

@CISC
Copy link
Collaborator

CISC commented Nov 8, 2025

Is this awaiting final approval, or is it good to merge?

Copy link

@leok7v leok7v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Tested. Works.

@gabe-l-hart
Copy link
Collaborator Author

@CISC @ggerganov I think with the current approvals, this should be good to go unless you have other outstanding concerns.

@ggerganov ggerganov merged commit 0c74f32 into ggml-org:master Nov 10, 2025
65 of 71 checks passed
@gabe-l-hart gabe-l-hart deleted the HybridContextShift-16768 branch November 10, 2025 17:08
@pwilkin
Copy link
Collaborator

pwilkin commented Nov 10, 2025

@CISC Don't know if it matters, but I ran into SUM failing for me when doing backwards pass gradient tests for EXPM1 and SOFTPLUS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The result of bool llama_memory_seq_rm() is not checked

6 participants