-
Notifications
You must be signed in to change notification settings - Fork 13.4k
sampling: add K-Shift sampler #10048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@p-e-w May I ask you for a review and maybe testing even, please? While this sampler is very simple by itself, it has quite a strong effect in practice and can be useful as an additional control measure in creative sampling. I've tested K-Shift with and without XTC, and it looks like they can work together quite nicely - just need to keep in mind how far the first cutout may go. |
|
I am currently sick and will be off the computer for a few days, but I intend to do a full review of this interesting PR soon. |
Get well! In meantime I will be testing K-Shift further to gather more data on different models (tested Nemo/Mistral Small/Gemma 2 - all behave differently so far). |
|
@MaggotHATE Hi. Just a doubt. This sampler works like this right? Select top n tokens at the beginning then do greedy decode for each one of them and select the beam with the highest probability? Will this increase the decode time or streaming needs to be disabled? Or alternative beams can be decoded in parallel? I did try to read the code but I am not too much familiar with llamacpp api so, I end up asking. Please pardon my ignorance. |
There are no alternative beams in K-Shift - that would be CoT-decoding, the main subject of this paper. K-Shift is simply choosing the exact path at the start of inference ("Decoding step 0") by cutting out I plan on implementing CoT-decoding in a different sampler, but I imagine it would be quite a bulky solution within llama.cpp. |
|
I just realized that adding @slaren Is |
|
I don't know, I am not sure when that was added, but I think it makes sense. What's the downside of resetting the sampler state after each message? I would think that you wouldn't want to apply the repetition penalties etc of the previous message to the next message. cc @ggerganov |
The only downside I see is tracking switches/states within samplers themselves in cases when a sampler should be applied once (like K-Shift, for example). On reset, either the sampler will be applied again, or, without custom reset function, we won't be able to revert the switch without deleting sampler object. |
|
Removing the However, what the paper is talking about cannot be implemented with a sampler alone. The paper is talking about generating k different sequences for the response, each starting with a different token, and then aggregating the results. That would be interesting to implement in an example as a proof of concept, but as it is, I don't think that this sampler would be useful by itself without the rest of the algorithm. A bonus would be implementing this using multiple sequences to generate all the responses at the same time in parallel. |
Alright, I will revert it back then. In recent tests it was still coherent even with reset. Although, it would be nice to have a way to trigger it once per session. Is it even possible in the current samplers chain implementation?
I've tested it in practice, and it actually works quite well by itself. In a way, it works similarly to XTC, but under more strict conditions. That alone makes K-Shift more compatible with greedy sampling. As for the main method in the paper, it is interesting, but it will likely become another example app with no prospects of being in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The technique described in the paper is indeed very promising, but I have to agree with @slaren: This sampler is not very useful without the algorithmic framework outlined by the paper.
The problem is that unlike XTC, K-Shift sampling fails to make any guarantees about token probabilities. The 5th-highest token might have a probability of 7.3% (in which case it could be an interesting path to explore) or 0.000000032% (in which case it is almost certainly garbage), and the sampler cannot distinguish between the two. Thus this is a trial-and-error sampler for which you have to do trial-and-error every time you use it.
To make K-Shift sampling useful, it needs either
- an implementation of the confidence-maximizing beam search strategy described in the paper, or
- additional parameters that allow it to take probability magnitudes into account.
As it stands, all this sampler does is choose the first word of the output for you, which you can do yourself by just adding it to the input, in about as much time as it takes to fiddle with the parameter.
src/llama-sampling.cpp
Outdated
|
|
||
| if (ctx->k_set == true | ||
| || ctx->k <= 0 | ||
| || ctx->k >= (int) cur_p->size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a bug here: If at the first token position, ctx->k >= (int) cur_p->size (e.g. because a preceding truncation sampler has already removed too many tokens) then we return and k_set remains false. This means that K-Shift will only take effect on the second (or later) token of the output, violating its contract.
|
|
||
| // shift to a token #[k] | ||
| cur_p->data += k; | ||
| cur_p->size -= k; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not match the paper, in which (AFAICT) exactly the k-th token is selected, rather than sampling from all tokens except the top k-1 ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of token will be handled by the final step in sampling queue, so greedy sampling would be needed to match the effect described in paper. Considering that specifically greedy sampler was removed recently, I don't think introducing another "final step" sampler would be ok.
...which is something no other sampler can do for now.
The problem is that whatever you add into the input might not match the natural flow of things for the model: you would either have to look at candidates presented by the model at the first step, or to just force something you need while ignoring the candidates. Having worked with this for a while now, I see benefits from K-Shift as a simple guidance that is more interesting and effective than simply adjusting suffixes (or adjusting completion results) every time you need a specific start of the output. It's good to have a guaranteed result, but doesn't always work as we want - otherwise the paper wouldn't exist, and CoT instructions would've have a guaranteed effect. I think I'll look at probabilities control though, it might be interesting. |
* the parameter limits how far K-Shift cuts by checking probability of the last token and iterating backwards if it's not probable enough
|
@p-e-w I've added Tested on Without limits, k = 50 (extreme case) : Candidates logged for statistics (first 8, probability|logit): The answer is hilarious, yet still coherent in it's own way. The same, but with a limit of Candidates: While the result is not optimal, it still doesn't drift away into hallucinations because only the first choice is affected. Going back to your suggestion of adding the needed word to the start of the output as a guidance: it works well together with K-Shift, actually. I've tested it with completion (which would be technically the same) on |

K-Shift is a sampling strategy mentioned in a Chain-of-Thought Reasoning without Prompting paper and is meant to guide models away from the most obvious startup in inference by cutting out a defined amount of tokens once at the start of the dialog. Since all the rest tokens are not being affected by the sampler, the output is still coherent. K-Shift is intended to be used with greedy sampling, and it is claimed to help with steering models more towards reasoning instead of short answers.
Since a recent commit changed how greedy sampling is achieved, this sampler fits in the main sampling queue and can be combined with
top_k = 1setting. In my experience it helped with getting different reasoning, less cliched starts in creative writing and can even change bias of the model - reducing or inducing refusals.Examples with
Mistral-Nemo-Instruct-2407.q5_k_l:k = 0
k = 5
k = 14
This sampler is still in testing, but it feels like a good improvement to sampling overall - however, every model might need its own value for
k. With K-Shift and XTC, greedy sampling might become useful even for creative sampling.