-
Notifications
You must be signed in to change notification settings - Fork 127
Fix "common_part" selection in server #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@saood06 This is draft because it is not ready, or because you are not sure that it works? |
@ikawrakow I think it is ready (it is fully implemented and conceptually makes sense to me). This is meant to fix situations where TG will generate something in two tokens that PP will later consolidate in one which forces reprocessing from there, but I haven't tested by forcing the situation yet to see if my PR actually fixes that bug (I did compile and inference with it normally very briefly and that worked so there seems to be no regressions ). Edit: To clarify my implementation matches the common part of the existing tokens in the cache against the prompt string directly. Before this the code would determine the slot by matching the common part of the prompt string and the cached part. But then only use the tokens after matching the common part of the tokens in the cache vs the tokenized version of the prompt, which in extreme situations could lead to far less reuse than just the expected issues with token and string boundaries. |
I understand the implementation and it LGTM. But I wonder if it wouldn't be better to first try the tokens directly. If the full prompt is matched, there would be no need to do string matching, and the behavior will be the same as before. If we find a shorter match, one can do string matching from the point of disagreement. |
I can see how that might result in better performance for the average user.
But I don't think starting from the point of disagreement makes sense because if you compare tokens directly and everything matches you don't need to detokenize anything. In order to know where the point of disagreement is on the string wouldn't you need to do string matching, at which point comparing tokens becomes redundant. |
So, my understanding is that the Alternatively one can match tokens and strings, and one takes the result with the greater matched length. All of this is only to avoid a potential failure mode in the string matching approach that we both don't see. |
This is my attempt to address ggml-org/llama.cpp#11970
It has some similarities to ggml-org/llama.cpp#12067 but is not a port it is implemented differently.
It matches tokens in cache directly to the string passed in the prompt.
I do not know how to reliably reproduce the issue so I'm not sure if it fixes it.
@vnicolici Do you mind testing?