-
Notifications
You must be signed in to change notification settings - Fork 603
Adapter fixes #1659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapter fixes #1659
Conversation
|
I was wondering why the test was not triggering. Apparently it requires moderator approval. That's fine, but it is nice to give users immediate feedback on lint-stuff, IMO, so if it can be run without moderation requirement that would be nice. |
|
I'm a bit wary of 3rd party automatically triggered workflows since they have been known to be abused in other github repos in the past. Let me think about that part first. I'm not sure the fixes for mistral v3 are correct. Take a look at this https://huggingface.co/mistralai/Mistral-Small-Instruct-2409?chat_template=default which uses V3 Likewise for Jamba, there is no trailing space either Btw I just realized from some testing that our current interpretation of The way it's represented in the AutoGuess is "This is the list of tools" but it should actually be "This is the begin-of-tools-turn" The field was added all the way back in #981 so it must have been misinterpreted in #1283 subsequently. |
|
Just omw to bed so can't really test this, but does the first issue mean mistralai/Mistral-7B-Instruct-v0.3 is not using the V3 tokenizer? I am testing against that one. Mistral tokenizers are so confusing... |
3a941df to
6fd3a1c
Compare
|
Mistral has released a ton of models each with a different tokenizer so they are indeed annoying. If unsure, I recommend we stick to the current one as it is known to work. |
|
OK, I tried switching to the MS2409 tokenizer and it gave the same issue so I checked the actual output -- it does appear to put a space after each >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-Instruct-2409")
>>> tokenizer.apply_chat_template([{"role":"user","content":"hi"},{"role":"assistant","content":"hello"}], tokenize=False)
'<s>[INST] hi[/INST] hello</s>'Will test the Jamba tokenizer too in a bit. |
|
Jamba -- the trailing space is there for the chat template here as well: >>> tokenizer = AutoTokenizer.from_pretrained("gated-tokenizers/tokenizer_configs/ai21labs_AI21-Jamba-Large-1.7")
>>> tokenizer.apply_chat_template([{"role":"user","content":"hi"},{"role":"assistant","content":"hello"}], tokenize=False)
'<|startoftext|><|bom|><|system|> <|eom|><|bom|><|user|> hi<|eom|><|bom|><|assistant|> hello<|eom|>'(note: the above is the ai21labs/AI21-Jamba-Large-1.7 model that you referenced, so it's not some diff between that and the jamba-tiny-dev one I put in the tests) |
Will have to sit down and think about this one when kids aren't clinging to my head (i.e. after weekend). |
|
It's possible that the chat templates are mis-written because people (like you did above) read it as if there is no trailing whitespace when in reality it comes out with a space. This is mistral-common's test for V3 though: Note last 3 lines: assert text == (
f"<s>[INST]{special_ws}a[/INST]{special_ws}b</s>[INST]{special_ws}SYSTEM{new_line}c[/INST]{special_ws}d"
)It explicitly requires that there is a |
|
If it's correctly configured, the instruct start and end tags should tokenize into one single token each. Does it do that? |
I can't see why it wouldn't. Which part specifically are you concerned about? I can test that easily enough. |
|
So I did some tests and it does seem like the space is unnecessary.
The model does not attempt to generate a space after that when replying.
As a result, model is unable to generate tokens with a leading space from it's grammar. It usually generates an emoji instead.
This leads me to further doubt the veracity of this approach. |
|
I am able to reproduce the issue you described, in llama.cpp. Yes, something's definitely off here. Thanks for being diligent! I am pretty sure I understand what's going on but I need to test it properly. |
|
No worries. As a rule of thumb I believe practical performance > jinja correctness, so we should always aim for that. |
|
Alright I merged those we know are correct first as for the trailing spaced ones such as mistral, I think they need more investigation |
6fd3a1c to
7177079
Compare
|
OK, so this has two facets: in one case, the proposed change (prior to 28c2fbf) is actually correct, and results in invalid (or at least "unusual" from the model's perspective) contexts otherwise. In another case, the proposed change, as you demonstrated, results in invalid contexts, and the model is unable to cleanly respond without "breaking out" of the weird/unusual starting sequence. The folllowing python snippet demonstrates the issue to some extent: >>> tokenizer.apply_chat_template([{"role":"user","content":"Hello"},{"role":"assistant","content":"Hi"},{"role":"user","content":"How are you?"}], tokenize=False)
'<s>[INST] Hello[/INST] Hi</s>[INST] How are you?[/INST]'
>>> tokenizer.apply_chat_template([{"role":"user","content":"Hello"},{"role":"assistant","content":"Hi"},{"role":"user","content":"How are you?"}], tokenize=True)
[1, 3, 23325, 4, 16127, 2, 3, 2370, 1228, 1136, 29572, 4]Rewriting the tokens, we get: The spaces are there (as my original change suggested), but they're a part of the token in all cases, so prematurely putting them there without a token attachment is invalid. The chat template also does this in both cases explicitly: for the user case: and for the assistant case: Now, consider the case where we are constructing a context based on the AutoGuess adapter. If we have generated tokens from a chat log, things work fine, as the tokens will include the prefix space, as long as we don't trim at the wrong place/time. If we however construct context from external content that is not pre-tokenized, we get into "trouble": The above looks fine, but it's actually potentially degrading the model performance because the model (post-pretraining) has never ever seen anything like it. The implications of this aren't clear, but generally speaking, models perform better when seeing content that looks like the stuff they've been trained on. Take away: the AutoGuess suggestion is correct, but only if we are inserting content that has not yet been tokenized. It is incorrect whenever it attempts to put a space in without it being a part of some token. The space must have content after it. The Transformers tokenizer handles this issue by explicitly inserting spaces before each turn, except for the Assumption: there actually is never a space before an AI response -- either the space is a part of the first token emitted by the AI (as is the case for Mistral V3, etc), or there is no space to begin with. If this is the case, we can fix this by:
Running out of time for today, but will dig more tomorrow. |
|
Sorry for the delay. Partial conclusion:
Approaches:
I think going for (1) until we hit an issue with some model is not insane, but (2) might be more stable in the long term. |
a0df027 to
4d50466
Compare
|
For reference, I added a commit 70f4774 which attempts to implement approach (2) above. Will drop if we choose to go a different route. |
70f4774 to
4260c7d
Compare
|
Unfortunately, this still won't work correctly for the KAI endpoint and would likely degrade that much more than it's current state. Currently, the frontend sends a string input with placeholders, which are converted into the adapter tags.
KAI api is a pure text completions API - there are no turns, only placeholder replacements. If the space was injected as No matter which way we do it, there will be a case where an "undesired" token is sent to the AI. For a multi-turn example, the so-called perfect training format would be Adding to the squeeze, lite allows users to add prefills to the AI response as well. So this is a valid prompt: One alternative is actually adding newlines to the format. Since any token after a newline naturally uses the non-leading space version e.g. But to be honest I am quite sick of the awfulness of mistral's instruct format and somewhat reluctant to mess with it further, considering how it currently works adequately well. |
I think it's ok to claim that a single sequence for both [A] "role=AI" and [B] "AI generation prompt" is impossible to achieve. I tried and you showed that [B] was wrong, and I showed that [A] was needed in certain cases. Is the KAI endpoint set in stone or can we add e.g. a
With
which is correct, right?
Yeah, that would want to use
I don't think adding newlines is the way to deal with it, personally. I agree Mistral made a mess, but I don't think it's unsalvageable. And I don't think Mistral is the actual core reason why this is tricky either. (We could have this discussion about any model that uses space+word as tokens, which is a lot of them.) Wait, one idea would be to "intelligently" replace the last {[[OUTPUT]]} with the gen version and keep the others as the start version. That might actually be the most straightforward way to do it, and it would elegantly handle the prefill case (last part is not {{[OUTPUT]}} so use start version). What do you think? I haven't really looked at the code to see if this is doable, but will take a look. |
4260c7d to
1f71e36
Compare
|
I'm not super clear on how the klite.embd thing works yet, but I did a POC patch in the koboldcpp.py equivalent part. |
|
Alright just adding some tweaks first. First off we have a quick reference to test at https://pandora-s-git.github.io/mtokenizer/ First off - I think adding a trailing space for The assistant message tag was the issue - when it's the final tag, we don't want the trailing space, otherwise we do. Having the frontend send a different Still don't have an ideal solution for the trailing spaces when using in Lite itself. The problem is compounded when text is sent to a third party backend e.g. over horde, the template selected will not work and the trailing space for the One big problem I don't even see mistral addressing is - what if the assistant message does NOT start with a word? But what's this?? The reply's token has NO LEADING SPACE! For comparison, a token with a leading space looks like this, e.g. So if we force a space there after In fact this makes me doubt mistrals official jinja. Clearly something is not right somewhere. |
|
9fbbd9e This adds the trailing space for the user start sequences, since that seems correct. |
Great. Will use that from here on.
Yeah, another thing that is a potential issue is that we have duplicate adapter definitions now. There are all the kcpp_adapters json files, and then theres
Sorry for my lack of knowledge on this one, but IIUC Lite does replace the placeholders itself. Why could it not do the same thing I did with the
Edit: Wait, no, you're right. We can't actually get the non-spaced output if we pass the content back and forth from a non-tokenized dictionary array. I guess the big question is, do Mistral V7 models ever, ever not put a spaced token as the first token, and if they don't, will it matter if we modify it? I honestly believe the answer to be no to one or the other of these questions, but I understand your concern nonetheless. |
|
Llama.cpp on So, at least in this case, the answer to my first question (does it ever not start with a spaced token) is yes, it does sometimes use non-spaced tokens. The answer to my second question is that at least in Transformers and llama.cpp, the token is modified when chat continues ( |
|
when you see |
It's hard to check. The llama-cli is its own front end so this may not be a direct response to your question, but: looking at the source, auto chat_add_and_format = [&chat_msgs, &chat_templates](const std::string & role, const std::string & content) {
common_chat_msg new_msg;
new_msg.role = role;
new_msg.content = content;
auto formatted = common_chat_format_single(chat_templates.get(), chat_msgs, new_msg, role == "user", g_params->use_jinja);
chat_msgs.push_back(new_msg);
LOG_DBG("formatted: '%s'\n", formatted.c_str());
return formatted;
};it seems to be storing the content (as is) as the content in a chat message which is then formatted via the chat template and used in subsequent messages in the chat. I wish there was an easy way to test this across implementations. |
|
I had a chat with @pandora-s-git from Mistral, and it does seem like there should be no scenario where
|
|
cause now that i look closer, i can see https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503?chat_template=default vs https://huggingface.co/mistralai/Mistral-Small-Instruct-2409?chat_template=default which is probably the source of the massive confusion Given that the future seems to be following that path, it is probably safe to strip all leading and trailing spaces for future mistral models. |
|
That matches what I am doing in this PR, I believe. V7 has no spaces anywhere. And if you add one e.g. after Is there some other place that I'm screwing up? |
|
Hi everyone, basically the main difference between the Tekken versions and the Not Tekken versions of our tokenizers chat templates is around the control tokens whitespacing. So everyone understands from where this comes from, here is how this is tokenized via mistral-common our official implementation. A completion request like: {"role":"user", "content":"user message"}
{"role":"assistant", "content":"assistant message"}
{"role":"user", "content":"user new message"}Is encoded following this schema: IDS = BOS_ID + BINST_ID + encode("user message") + EINST_ID + encode("assistant message") + EOS_ID + BINST_ID + encode("user new message") + EINST_IDNow, regardinf SentencePiece (older methods) VS Tekken (used my most of our recent models): SentencePiece: WITHOUT a last whitespace, because the model will generate a token starting with the whitespace. If you add the whitespace you will mess up the distribution. Tekken I hope this helps! |
3a2139f to
1c525a6
Compare
|
@pandora-s-git Thanks a lot for the details. @LostRuins I incorporated your latest changes and re-inserted the Tekken is a nobrainer. For SP, we are given rules which must not be violated:
for the schema IDS = BOS_ID + BINST_ID + encode("user message") + EINST_ID + encode("assistant message") + EOS_ID + BINST_ID + encode("user new message") + EINST_IDGiven the chat exchange
If you feel uncomfortable with the complexity of this approach (I can't think of a simplification personally but maybe there is one), let's drop this PR for now. Otherwise let's see if we can make klite behave. |
|
Just a quick comment regarding the versions, v7 also has a normal non tekken version and has also a v7-Tekken version, so use "Tekken" or not as the variable for the whitespacing and not the versioning, the versioning is regarding the logic and control tokens (if the template changes completely or new control tokens are added etc), the thing here is that the chat templates between non tekken vs Tekken are the exact same ones in the mistral common implementation, they only differ when outside of mistral common via jinja chat templates. TLDR: use Tekken as the variable to decide for the whitespacing, u may have models with the exact same tokenizer version yet one using Tekken and the other not. Otherwise the changes proposed look good to me, if you have any questions or doupts regarding our tokenizers feel free to ping me here or discord, will gladly help. |
If I understand you correctly, that can be derived from the chat template provided by the model metadata (assuming we do not use the mistral common implementation, which we aren't), which means we can determine if it is Tekken or not via the chat template. (Btw, now I finally understand the relationship between Tekken and i18n. 空白があるかどうかは言語による!) |
|
thanks both, give me some time to review and ill get back on this |
|
CI error was due to me accidentally committing a symlink to the gated-repositories repo. Fixed. |
|
I was very confused until I ran the test at home and got the same error. I must've screwed up some edit somewhere.. |
2c50264 to
60aa8a2
Compare
|
@kallewoof I would still rather people rely on the files in the repos most of the time but yeah thats also a fair way to do it, but we dont always provide a jinja chat template, hence this can be tricky. Tho mistral common also provides a text version of the tokenized request that should be well formatted so that could be used to be sure. In any case I do not individually mind being ping for this kind of things, so feel free! |
|
Alright sorry for the delays, I think I will go ahead and merge whatever you have now. We can continue reviewing again subsequently. In particular I will look into your earlier approach for Lite placeholders with (though the rstrip logic is not actually correct since if there's a strip the length changes.... but let me see if we have a better way for that anyway) |
|
@kallewoof can you please take a look and help review the final addendum at 8e6d27f that handles the placeholder replacements |
|
Of course! |










This fixes all the issues that were revealed when adding tests to the AutoGuess adapters.
It currently includes #1654 and will need to be rebased once that's merged.