Skip to content

Conversation

saood06
Copy link
Collaborator

@saood06 saood06 commented Aug 30, 2025

This is my attempt to address ggml-org/llama.cpp#11970

It has some similarities to ggml-org/llama.cpp#12067 but is not a port it is implemented differently.

It matches tokens in cache directly to the string passed in the prompt.

I do not know how to reliably reproduce the issue so I'm not sure if it fixes it.

@vnicolici Do you mind testing?

@ikawrakow
Copy link
Owner

@saood06 This is draft because it is not ready, or because you are not sure that it works?

@saood06
Copy link
Collaborator Author

saood06 commented Sep 2, 2025

@ikawrakow I think it is ready (it is fully implemented and conceptually makes sense to me).

This is meant to fix situations where TG will generate something in two tokens that PP will later consolidate in one which forces reprocessing from there, but I haven't tested by forcing the situation yet to see if my PR actually fixes that bug (I did compile and inference with it normally very briefly and that worked so there seems to be no regressions ).

Edit: To clarify my implementation matches the common part of the existing tokens in the cache against the prompt string directly. Before this the code would determine the slot by matching the common part of the prompt string and the cached part. But then only use the tokens after matching the common part of the tokens in the cache vs the tokenized version of the prompt, which in extreme situations could lead to far less reuse than just the expected issues with token and string boundaries.

@ikawrakow
Copy link
Owner

I understand the implementation and it LGTM.

But I wonder if it wouldn't be better to first try the tokens directly. If the full prompt is matched, there would be no need to do string matching, and the behavior will be the same as before. If we find a shorter match, one can do string matching from the point of disagreement.

@saood06
Copy link
Collaborator Author

saood06 commented Sep 3, 2025

But I wonder if it wouldn't be better to first try the tokens directly. If the full prompt is matched, there would be no need to do string matching, and the behavior will be the same as before.

I can see how that might result in better performance for the average user.

If we find a shorter match, one can do string matching from the point of disagreement.

But I don't think starting from the point of disagreement makes sense because if you compare tokens directly and everything matches you don't need to detokenize anything. In order to know where the point of disagreement is on the string wouldn't you need to do string matching, at which point comparing tokens becomes redundant.

@ikawrakow
Copy link
Owner

So, my understanding is that the tokens -> text conversion is not necessarily reversible, i.e., after text -> tokens', tokens' may not be the same as tokens. But if tokens[j] == tokens'[j] up to j_match, then one can convert tokens'[j_match...end] to string and do string matching from there. No? This may not be more efficient than just matching strings as we now need two conversions (text -> tokens' and then tokens'[j_match...] -> text'), but is guaranteed to be not worse than the existing implementation.

Alternatively one can match tokens and strings, and one takes the result with the greater matched length.

All of this is only to avoid a potential failure mode in the string matching approach that we both don't see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants