Skip to content

Conversation

@kallewoof
Copy link

@kallewoof kallewoof commented Jul 25, 2025

This fixes all the issues that were revealed when adding tests to the AutoGuess adapters.

It currently includes #1654 and will need to be rebased once that's merged.

@kallewoof
Copy link
Author

I was wondering why the test was not triggering. Apparently it requires moderator approval. That's fine, but it is nice to give users immediate feedback on lint-stuff, IMO, so if it can be run without moderation requirement that would be nice.

@LostRuins
Copy link
Owner

I'm a bit wary of 3rd party automatically triggered workflows since they have been known to be abused in other github repos in the past. Let me think about that part first.

I'm not sure the fixes for mistral v3 are correct. Take a look at this https://huggingface.co/mistralai/Mistral-Small-Instruct-2409?chat_template=default which uses V3
image
there is clearly no trailing space after each [/INST]

Likewise for Jamba, there is no trailing space either
https://huggingface.co/ai21labs/AI21-Jamba-Large-1.7?chat_template=default
image
image

Btw I just realized from some testing that our current interpretation of tools_start and tools_end in the adapter (currently only chatml uses it) is wrong.

"tools_start": "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n\n<tools>\n",
"tools_end": "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n"

The way it's represented in the AutoGuess is "This is the list of tools" but it should actually be "This is the begin-of-tools-turn"

The field was added all the way back in #981 so it must have been misinterpreted in #1283 subsequently.

@kallewoof
Copy link
Author

Just omw to bed so can't really test this, but does the first issue mean mistralai/Mistral-7B-Instruct-v0.3 is not using the V3 tokenizer? I am testing against that one. Mistral tokenizers are so confusing...

@kallewoof kallewoof force-pushed the 202507-adapter-fixes branch from 3a941df to 6fd3a1c Compare July 25, 2025 14:18
@LostRuins
Copy link
Owner

Mistral has released a ton of models each with a different tokenizer so they are indeed annoying. If unsure, I recommend we stick to the current one as it is known to work.

@kallewoof
Copy link
Author

kallewoof commented Jul 26, 2025

OK, I tried switching to the MS2409 tokenizer and it gave the same issue so I checked the actual output -- it does appear to put a space after each [/INST] after all:

>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-Instruct-2409")
>>> tokenizer.apply_chat_template([{"role":"user","content":"hi"},{"role":"assistant","content":"hello"}], tokenize=False)
'<s>[INST] hi[/INST] hello</s>'

Will test the Jamba tokenizer too in a bit.

@kallewoof
Copy link
Author

kallewoof commented Jul 26, 2025

Jamba -- the trailing space is there for the chat template here as well:

>>> tokenizer = AutoTokenizer.from_pretrained("gated-tokenizers/tokenizer_configs/ai21labs_AI21-Jamba-Large-1.7")
>>> tokenizer.apply_chat_template([{"role":"user","content":"hi"},{"role":"assistant","content":"hello"}], tokenize=False)
'<|startoftext|><|bom|><|system|> <|eom|><|bom|><|user|> hi<|eom|><|bom|><|assistant|> hello<|eom|>'

(note: the above is the ai21labs/AI21-Jamba-Large-1.7 model that you referenced, so it's not some diff between that and the jamba-tiny-dev one I put in the tests)

@kallewoof
Copy link
Author

Btw I just realized from some testing that our current interpretation of tools_start and tools_end in the adapter (currently only chatml uses it) is wrong.

Will have to sit down and think about this one when kids aren't clinging to my head (i.e. after weekend).

@kallewoof
Copy link
Author

It's possible that the chat templates are mis-written because people (like you did above) read it as if there is no trailing whitespace when in reality it comes out with a space.

This is mistral-common's test for V3 though:

https://github.com/mistralai/mistral-common/blob/05b2c735980cf9dde1a223628187f72bd65572d7/tests/test_tokenize_v3.py#L254-L269

Note last 3 lines:

    assert text == (
        f"<s>[INST]{special_ws}a[/INST]{special_ws}b</s>[INST]{special_ws}SYSTEM{new_line}c[/INST]{special_ws}d"
    )

It explicitly requires that there is a {special_ws} after the [/INST]. Whether that means actual space or not I'm not 100% sure yet, but it does make me less worried that the space in the chat template (and test) is a mistake.

@LostRuins
Copy link
Owner

If it's correctly configured, the instruct start and end tags should tokenize into one single token each. Does it do that?

@kallewoof
Copy link
Author

If it's correctly configured, the instruct start and end tags should tokenize into one single token each. Does it do that?

I can't see why it wouldn't. Which part specifically are you concerned about? I can test that easily enough.

@LostRuins
Copy link
Owner

So I did some tests and it does seem like the space is unnecessary.

  1. First, I tried without spaces, raw input [INST] hi there[/INST]
    This tokenizes correctly
image The model does not attempt to generate a space after that when replying.
  1. Then, I tried adding a space [INST] hi there[/INST]
    Model does NOT tokenize it correctly into a special token.
image As a result, model is unable to generate tokens with a leading space from it's grammar. It usually generates an emoji instead.

This leads me to further doubt the veracity of this approach.
If you'd like to test, it's here.

@kallewoof
Copy link
Author

I am able to reproduce the issue you described, in llama.cpp. Yes, something's definitely off here. Thanks for being diligent!

I am pretty sure I understand what's going on but I need to test it properly.

@LostRuins
Copy link
Owner

No worries. As a rule of thumb I believe practical performance > jinja correctness, so we should always aim for that.

@LostRuins
Copy link
Owner

Alright I merged those we know are correct first
ba626b3

as for the trailing spaced ones such as mistral, I think they need more investigation

@kallewoof kallewoof force-pushed the 202507-adapter-fixes branch from 6fd3a1c to 7177079 Compare July 27, 2025 09:39
@kallewoof
Copy link
Author

kallewoof commented Jul 27, 2025

OK, so this has two facets: in one case, the proposed change (prior to 28c2fbf) is actually correct, and results in invalid (or at least "unusual" from the model's perspective) contexts otherwise. In another case, the proposed change, as you demonstrated, results in invalid contexts, and the model is unable to cleanly respond without "breaking out" of the weird/unusual starting sequence. The folllowing python snippet demonstrates the issue to some extent:

>>> tokenizer.apply_chat_template([{"role":"user","content":"Hello"},{"role":"assistant","content":"Hi"},{"role":"user","content":"How are you?"}], tokenize=False)
'<s>[INST] Hello[/INST] Hi</s>[INST] How are you?[/INST]'
>>> tokenizer.apply_chat_template([{"role":"user","content":"Hello"},{"role":"assistant","content":"Hi"},{"role":"user","content":"How are you?"}], tokenize=True)
[1, 3, 23325, 4, 16127, 2, 3, 2370, 1228, 1136, 29572, 4]

Rewriting the tokens, we get:

1       <s>
3       [INST]
23325    Hello
4       [/INST]
16127    Hi
2       </s>
3       [INST]
2370     How
1228     are
1136     you
29572   ?
4       [/INST]

The spaces are there (as my original change suggested), but they're a part of the token in all cases, so prematurely putting them there without a token attachment is invalid. The chat template also does this in both cases explicitly: for the user case:

 '            {{- "[INST] " + message["content"] + "[/INST]" }}\n'

and for the assistant case:

 '    {%- elif message["role"] == "assistant" %}\n'
 '        {{- " " + message["content"]|trim + eos_token}}\n'

Now, consider the case where we are constructing a context based on the AutoGuess adapter. If we have generated tokens from a chat log, things work fine, as the tokens will include the prefix space, as long as we don't trim at the wrong place/time. If we however construct context from external content that is not pre-tokenized, we get into "trouble":

>>> tokenizer.tokenize("[INST] Foo[/INST]External assistant generated content.")
['[INST]', '▁F', 'oo', '[/INST]', 'External', '▁assistant', '▁generated', '▁content', '.']

The above looks fine, but it's actually potentially degrading the model performance because the model (post-pretraining) has never ever seen anything like it. The implications of this aren't clear, but generally speaking, models perform better when seeing content that looks like the stuff they've been trained on. Take away: the AutoGuess suggestion is correct, but only if we are inserting content that has not yet been tokenized. It is incorrect whenever it attempts to put a space in without it being a part of some token. The space must have content after it.

The Transformers tokenizer handles this issue by explicitly inserting spaces before each turn, except for the add_generation_prompt case. This perfectly handles both cases. You get a space before the first token always, and you don't get a split-up token due to a trailing space when passing control to the assistant. In the adapter case, we only have the start prefix and end suffixes though, so we can't distinguish between "pass this onto the AI so it knows to respond" (add_generation_prompt) vs "the following is AI chat content" (role=assistant, content=...).

Assumption: there actually is never a space before an AI response -- either the space is a part of the first token emitted by the AI (as is the case for Mistral V3, etc), or there is no space to begin with.

If this is the case, we can fix this by:

  1. Add spaces where expected in AutoGuess.json
  2. Do a .rstrip() operation on messages, to ensure we never get any chopped off tokens like this.

Running out of time for today, but will dig more tomorrow.

@kallewoof
Copy link
Author

kallewoof commented Jul 30, 2025

Sorry for the delay.

Partial conclusion:

  1. For the OpenAI compatible API end point, the spaces must be included (the originally proposed changes which include space prefixes) for the tokenization to come out correctly, because the message list of roles and content is not tokenized.
  2. However, for the "generation prompt" part, the assistant_message_start is now incorrect:
    messages_string += assistant_message_start

Approaches:

  1. We can make the blanket assumption that all assistant generation prefixes can be .rstrip()d. We would thus modify
    messages_string += assistant_message_start
    to be messages_string += assistant_message_start.rstrip(), --or--
  2. We can add a new optional assistant_gen property to the adapters, which is explicitly used instead. This would then be the non-spaced assistant_start for the offending adapters. It would default to assistant_start so it would only be present for models that make a distinction.

I think going for (1) until we hit an issue with some model is not insane, but (2) might be more stable in the long term.

@kallewoof kallewoof force-pushed the 202507-adapter-fixes branch 2 times, most recently from a0df027 to 4d50466 Compare July 30, 2025 04:51
@kallewoof
Copy link
Author

For reference, I added a commit 70f4774 which attempts to implement approach (2) above. Will drop if we choose to go a different route.

@kallewoof kallewoof force-pushed the 202507-adapter-fixes branch from 70f4774 to 4260c7d Compare July 31, 2025 00:53
@LostRuins
Copy link
Owner

Unfortunately, this still won't work correctly for the KAI endpoint and would likely degrade that much more than it's current state.

Currently, the frontend sends a string input with placeholders, which are converted into the adapter tags.

{{[INPUT]}}how are you?{{[OUTPUT]}}
becomes
[INST]how are you?[/INST] to which the model can reply _I _am _fine

KAI api is a pure text completions API - there are no turns, only placeholder replacements. If the space was injected as [INST]how are you?[/INST] then the model would not be able to reply _I anymore.

No matter which way we do it, there will be a case where an "undesired" token is sent to the AI. For a multi-turn example, the so-called perfect training format would be [INST]how are you?[/INST] I am fine[INST]Are you sure?[/INST], the first has a trailing space but not the second.

Adding to the squeeze, lite allows users to add prefills to the AI response as well. So this is a valid prompt:
[INST]how are you?[/INST] I am fine[INST]Are you sure?[/INST] Yes, I -> _am

One alternative is actually adding newlines to the format. Since any token after a newline naturally uses the non-leading space version e.g. \nHello not \n_Hello, appending newlines to the [/INST] may actually work reasonably well too.

But to be honest I am quite sick of the awfulness of mistral's instruct format and somewhat reluctant to mess with it further, considering how it currently works adequately well.

@kallewoof
Copy link
Author

Unfortunately, this still won't work correctly for the KAI endpoint and would likely degrade that much more than it's current state.

Currently, the frontend sends a string input with placeholders, which are converted into the adapter tags.

{{[INPUT]}}how are you?{{[OUTPUT]}} becomes [INST]how are you?[/INST] to which the model can reply _I _am _fine

KAI api is a pure text completions API - there are no turns, only placeholder replacements. If the space was injected as [INST]how are you?[/INST] then the model would not be able to reply _I anymore.

I think it's ok to claim that a single sequence for both [A] "role=AI" and [B] "AI generation prompt" is impossible to achieve. I tried and you showed that [B] was wrong, and I showed that [A] was needed in certain cases. Is the KAI endpoint set in stone or can we add e.g. a {{[GEN]}} placeholder and ask people to start using that for the final placeholder?

No matter which way we do it, there will be a case where an "undesired" token is sent to the AI. For a multi-turn example, the so-called perfect training format would be [INST]how are you?[/INST] I am fine[INST]Are you sure?[/INST], the first has a trailing space but not the second.

With

  • {{[INPUT]}} = "[INST] " (yes, space after [INST] is correct, I believe)
  • {{[OUTPUT]}} = "[/INST] "
  • {{[GEN]}} = "[/INST]"

{{[INPUT]}}how are you?{{[OUTPUT]}}I am fine{{[INPUT]}}Are you sure?{{[GEN]}} -> [INST] how are you?[/INST] I am fine[INST] Are you sure?[/INST]

which is correct, right?

Adding to the squeeze, lite allows users to add prefills to the AI response as well. So this is a valid prompt: [INST]how are you?[/INST] I am fine[INST]Are you sure?[/INST] Yes, I -> _am

Yeah, that would want to use {{[OUTPUT]}} and not {{[GEN]}}. It's not impossible to do, but it's not super intuitive either.

One alternative is actually adding newlines to the format. Since any token after a newline naturally uses the non-leading space version e.g. \nHello not \n_Hello, appending newlines to the [/INST] may actually work reasonably well too.

But to be honest I am quite sick of the awfulness of mistral's instruct format and somewhat reluctant to mess with it further, considering how it currently works adequately well.

I don't think adding newlines is the way to deal with it, personally. I agree Mistral made a mess, but I don't think it's unsalvageable. And I don't think Mistral is the actual core reason why this is tricky either. (We could have this discussion about any model that uses space+word as tokens, which is a lot of them.)

Wait, one idea would be to "intelligently" replace the last {[[OUTPUT]]} with the gen version and keep the others as the start version. That might actually be the most straightforward way to do it, and it would elegantly handle the prefill case (last part is not {{[OUTPUT]}} so use start version). What do you think? I haven't really looked at the code to see if this is doable, but will take a look.

@kallewoof kallewoof force-pushed the 202507-adapter-fixes branch from 4260c7d to 1f71e36 Compare July 31, 2025 10:30
@kallewoof
Copy link
Author

I'm not super clear on how the klite.embd thing works yet, but I did a POC patch in the koboldcpp.py equivalent part.

@LostRuins LostRuins added KIV for now Some issues prevent this from being merged needs review needs review labels Aug 2, 2025
@LostRuins
Copy link
Owner

Alright just adding some tweaks first. First off we have a quick reference to test at https://pandora-s-git.github.io/mtokenizer/

First off - I think adding a trailing space for [INST] in all scenarios is fine, since there's always expected to be text after a user message tag. So we can safely change all [INST] to [INST]

The assistant message tag was the issue - when it's the final tag, we don't want the trailing space, otherwise we do. Having the frontend send a different {{[GEN]}} adds complexity and is not ideal. Your approach seems fine when using the AutoGuess template, but I need some time to look through it for potential issues, should it be used.

Still don't have an ideal solution for the trailing spaces when using in Lite itself. The problem is compounded when text is sent to a third party backend e.g. over horde, the template selected will not work and the trailing space for the [/INST] will be a problem. We can probably use a half measure in the meantime to add the trailing space for [INST] but not for [/INST]


One big problem I don't even see mistral addressing is - what if the assistant message does NOT start with a word?
Lets say we have a prompt like Please say !!! which in ideal mode looks like this:
[INST] Please say !!![/INST]
we get '<s> (1)', '[INST] (3)', ' Please (13980)', ' say (4150)', ' ! (2662)', '!! (7290)', '[/INST] (4)',
and our AI generated reply is !!! (55237) followed by </s> (2)

But what's this?? The reply's token has NO LEADING SPACE!
image

For comparison, a token with a leading space looks like this, e.g. ! (2662)
image
observe the Ġ

So if we force a space there after [/INST], we actually LOSE the correct token.

In fact this makes me doubt mistrals official jinja. Clearly something is not right somewhere.

@LostRuins
Copy link
Owner

9fbbd9e This adds the trailing space for the user start sequences, since that seems correct.

@kallewoof
Copy link
Author

kallewoof commented Aug 5, 2025

Alright just adding some tweaks first. First off we have a quick reference to test at https://pandora-s-git.github.io/mtokenizer/

Great. Will use that from here on.

The assistant message tag was the issue - when it's the final tag, we don't want the trailing space, otherwise we do. Having the frontend send a different {{[GEN]}} adds complexity and is not ideal. Your approach seems fine when using the AutoGuess template, but I need some time to look through it for potential issues, should it be used.

Yeah, another thing that is a potential issue is that we have duplicate adapter definitions now. There are all the kcpp_adapters json files, and then theres AutoGuess which kinda contains all of them. I think a more future proof approach is to eventually change AutoGuess to be search string + a reference to the adapter that should be used, and to then load that from the json file.

Still don't have an ideal solution for the trailing spaces when using in Lite itself. The problem is compounded when text is sent to a third party backend e.g. over horde, the template selected will not work and the trailing space for the [/INST] will be a problem. We can probably use a half measure in the meantime to add the trailing space for [INST] but not for [/INST]

Sorry for my lack of knowledge on this one, but IIUC Lite does replace the placeholders itself. Why could it not do the same thing I did with the ends with logic?

One big problem I don't even see mistral addressing is - what if the assistant message does NOT start with a word? Lets say we have a prompt like Please say !!! which in ideal mode looks like this: [INST] Please say !!![/INST] we get '<s> (1)', '[INST] (3)', ' Please (13980)', ' say (4150)', ' ! (2662)', '!! (7290)', '[/INST] (4)', and our AI generated reply is !!! (55237) followed by </s> (2)

But what's this?? The reply's token has NO LEADING SPACE! image

For comparison, a token with a leading space looks like this, e.g. ! (2662) image observe the Ġ

So if we force a space there after [/INST], we actually LOSE the correct token.

In fact this makes me doubt mistrals official jinja. Clearly something is not right somewhere.

Edit: Wait, no, you're right. We can't actually get the non-spaced output if we pass the content back and forth from a non-tokenized dictionary array. I guess the big question is, do Mistral V7 models ever, ever not put a spaced token as the first token, and if they don't, will it matter if we modify it? I honestly believe the answer to be no to one or the other of these questions, but I understand your concern nonetheless.

@kallewoof
Copy link
Author

kallewoof commented Aug 5, 2025

Llama.cpp on mistralai_Mistral-Small-3.2-24B-Instruct-2506-q8_0.gguf with initial prompt -p "Please say \"\!\!\!\"":

If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or "When is the next flight to Tokyo" => "Where do you travel from?")Please say "!!!"eval: [ '<s>':1, '[SYSTEM_PROMPT]':17, 'You':4568, ' are':1584, ' Mist':42301, 'ral':2784, ' Small':29121, ' ':1032, '3':1051, ',':1044, ' a':1261, ' Large':43520, ' Language':26242, ' Model':11512, ' (':1319, 'LL':23947, 'M':1077, ')':1041, ' created':6254, ' by':1536, ' Mist':42301, 'ral':2784, ' AI':26554, ',':1044, ' a':1261, ' French':8689, ' startup':53862, ' head':3518, 'quartered':125609, ' in':1294, ' Paris':6993, '.':1626, 'Your':16994, ' knowledge':7807, ' base':4469, ' was':1486, ' last':3804, ' updated':12220, ' on':1408, ' ':1032, '2':1050, '0':1048, '2':1050, '3':1051, '-':1045, '1':1049, '0':1048, '-':1045, '0':1048, '1':1049, '.':1046, ' The':1531, ' current':3519, ' date':5451, ' is':1395, ' ':1032, '2':1050, '0':1048, '2':1050, '5':1053, '-':1045, '0':1048, '8':1056, '-':1045, '0':1048, '5':1053, '.':1338, 'When':7651, ' you':1636, ''re':6185, ' not':1605, ' sure':5257, ' about':2314, ' some':2269, ' information':3686, ',':1044, ' you':1636, ' say':4150, ' that':1455, ' you':1636, ' don':2607, ''t':2405, ' have':1736, ' the':1278, ' information':3686, ' and':1321, ' don':2607, ''t':2405, ' make':3180, ' up':2015, ' anything':7211, '.':1626, 'If':5475, ' the':1278, ' user':3330, ''s':1681, ' question':4098, ' is':1395, ' not':1605, ' clear':6133, ',':1044, ' ambiguous':61103, ',':1044, ' or':1505, ' does':3120, ' not':1605, ' provide':5234, ' enough':6171, ' context':5315, ' for':1394, ' you':1636, ' to':1317, ' accurately':32181, ' answer':4832, ' the':1278, ' question':4098, ',':1044, ' you':1636, ' do':1653, ' not':1605, ' try':3352, ' to':1317, ' answer':4832, ' it':1494, ' right':3169, ' away':5109, ' and':1321, ' you':1636, ' rather':6153, ' ask':4237, ' the':1278, ' user':3330, ' to':1317, ' clarify':38695, ' their':2034, ' request':4546, ' (':1319, 'e':1101, '.g':3596, '.':1046, ' "':1429, 'What':7493, ' are':1584, ' some':2269, ' good':3683, ' restaurants':40378, ' around':3879, ' me':1639, '?"':10555, ' =>':2297, ' "':1429, 'Where':17507, ' are':1584, ' you':1636, '?"':10555, ' or':1505, ' "':1429, 'When':7651, ' is':1395, ' the':1278, ' next':4275, ' flight':18034, ' to':1317, ' Tokyo':23286, '"':1034, ' =>':2297, ' "':1429, 'Where':17507, ' do':1653, ' you':1636, ' travel':10601, ' from':1562, '?':1063, '")':4428, '[/SYSTEM_PROMPT]':18, '[INST]':3, 'Please':17013, ' say':4150, ' "':1429, '!!':7290, '!"':17405, '[/INST]':4 ]
n_past = 182
n_remain: -2
!!!eval: [ '!!!':55237 ]
n_past = 183
n_remain: -3
found an EOG token
formatted: ' !!!</s>'

waiting for user input

> Say it again.
buffer: 'Say it again.'
formatted: '[INST] Say it again.[/INST]'
input tokens: [ '[INST]':3, ' Say':31159, ' it':1494, ' again':2790, '.':1046, '[/INST]':4 ]
n_remain: -9
eval: [ '</s>':2 ]
n_past = 184
embd_inp.size(): 188, n_consumed: 182
eval: [ '[INST]':3, ' Say':31159, ' it':1494, ' again':2790, '.':1046, '[/INST]':4 ]
n_past = 190
n_remain: -10
!!!eval: [ '!!!':55237 ]
n_past = 191
n_remain: -11
found an EOG token
formatted: ' !!!</s>'

waiting for user input

> 

So, at least in this case, the answer to my first question (does it ever not start with a spaced token) is yes, it does sometimes use non-spaced tokens. The answer to my second question is that at least in Transformers and llama.cpp, the token is modified when chat continues (formatted output), thus there is no noticeable issue with the token being modified as the chat continues (or people would have complained). More importantly, aligning with existing implementations is IMO preferred.

@LostRuins
Copy link
Owner

when you see formatted: ' !!!</s>' does it actually send the space back to the frontend? or does it get the stripped version in the UI?

@kallewoof
Copy link
Author

when you see formatted: ' !!!</s>' does it actually send the space back to the frontend? or does it get the stripped version in the UI?

It's hard to check. The llama-cli is its own front end so this may not be a direct response to your question, but: looking at the source,

    auto chat_add_and_format = [&chat_msgs, &chat_templates](const std::string & role, const std::string & content) {
        common_chat_msg new_msg;
        new_msg.role = role;
        new_msg.content = content;
        auto formatted = common_chat_format_single(chat_templates.get(), chat_msgs, new_msg, role == "user", g_params->use_jinja);
        chat_msgs.push_back(new_msg);
        LOG_DBG("formatted: '%s'\n", formatted.c_str());
        return formatted;
    };

it seems to be storing the content (as is) as the content in a chat message which is then formatted via the chat template and used in subsequent messages in the chat. I wish there was an easy way to test this across implementations.

@LostRuins
Copy link
Owner

LostRuins commented Aug 7, 2025

I had a chat with @pandora-s-git from Mistral, and it does seem like there should be no scenario where [/INST] has a trailing space appended to it within the chat template.

image

@LostRuins
Copy link
Owner

LostRuins commented Aug 7, 2025

cause now that i look closer, i can see https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503?chat_template=default vs https://huggingface.co/mistralai/Mistral-Small-Instruct-2409?chat_template=default which is probably the source of the massive confusion

Given that the future seems to be following that path, it is probably safe to strip all leading and trailing spaces for future mistral models.

@kallewoof
Copy link
Author

kallewoof commented Aug 7, 2025

That matches what I am doing in this PR, I believe.

V7 has no spaces anywhere. And if you add one e.g. after [INST], the test complains:

Mistral V7 (with system prompt) = Mistral V7 (with system prompt) : missing expected fragment
	adapter:  [INST] user_1
	tokenizer: <s>[SYSTEM_PROMPT]You are Mistral[snip][/SYSTEM_PROMPT][INST]user_1[/INST]asst_1</s>[INST]user_2[/INST] Doctor-Shotgun/MS3.2-24B-Magnum-Diamond     

Is there some other place that I'm screwing up?

@pandora-s-git
Copy link

pandora-s-git commented Aug 7, 2025

Hi everyone, basically the main difference between the Tekken versions and the Not Tekken versions of our tokenizers chat templates is around the control tokens whitespacing.

So everyone understands from where this comes from, here is how this is tokenized via mistral-common our official implementation.

A completion request like:

{"role":"user", "content":"user message"}
{"role":"assistant", "content":"assistant message"}
{"role":"user", "content":"user new message"}

Is encoded following this schema:

IDS = BOS_ID + BINST_ID + encode("user message") + EINST_ID + encode("assistant message") + EOS_ID + BINST_ID + encode("user new message") + EINST_ID

Now, regardinf SentencePiece (older methods) VS Tekken (used my most of our recent models):

SentencePiece:
Used in most of our older models, its usually the not tekken ones, at first we only had v1 and v2, then we had v3 that also introduced the first v3-tekken variant.
SentencePiece adds a defaulr whitespace at each encode("example"), becoming "_example" instead.
This is the source of the trailing whitespaces, but this also means that the model is the one that wants to generate a token with the white space, becoming like this:

<s>[INST]_user message[/INST]_assistant message</s>[INST]_user new message[/INST]

WITHOUT a last whitespace, because the model will generate a token starting with the whitespace. If you add the whitespace you will mess up the distribution.
Again this is only for the models using SentencePiece (not Tekken, if u go to one of our repos, if u see a tekken file its Tekken, if no tekken file its SentencePiece)

Tekken
However tekken doesnt have this issue of default whitespaces being added making it very simple.

<s>[INST]user message[/INST]assistant message</s>[INST]user new message[/INST]

I hope this helps!

@kallewoof kallewoof force-pushed the 202507-adapter-fixes branch from 3a2139f to 1c525a6 Compare August 8, 2025 06:04
@kallewoof
Copy link
Author

@pandora-s-git Thanks a lot for the details.

@LostRuins I incorporated your latest changes and re-inserted the _gen stuff. Let's revise:

Tekken is a nobrainer. For SP, we are given rules which must not be violated:

  1. SentencePiece adds a default whitespace at each encode("example")
  2. WITHOUT a last whitespace, because the model will generate a token starting with the whitespace. If you add the whitespace you will mess up the distribution.

for the schema

IDS = BOS_ID + BINST_ID + encode("user message") + EINST_ID + encode("assistant message") + EOS_ID + BINST_ID + encode("user new message") + EINST_ID

Given the chat exchange u1a2u3a:

  • Tekken (V7 etc): master: [INST]1[/INST]2</s>[INST]3[/INST] 🟢; this PR: unchanged
  • Non-T (V3 etc): master: [INST] 1[/INST]2</s>[INST] 3[/INST] 🔴 (violates [1]); this PR: [INST] 1[/INST] 2</s>[INST] 3[/INST] 🟢 (does not violate [1] and, importantly, does not violate [2] either)

If you feel uncomfortable with the complexity of this approach (I can't think of a simplification personally but maybe there is one), let's drop this PR for now. Otherwise let's see if we can make klite behave.

@pandora-s-git
Copy link

pandora-s-git commented Aug 8, 2025

Just a quick comment regarding the versions, v7 also has a normal non tekken version and has also a v7-Tekken version, so use "Tekken" or not as the variable for the whitespacing and not the versioning, the versioning is regarding the logic and control tokens (if the template changes completely or new control tokens are added etc), the thing here is that the chat templates between non tekken vs Tekken are the exact same ones in the mistral common implementation, they only differ when outside of mistral common via jinja chat templates.

TLDR: use Tekken as the variable to decide for the whitespacing, u may have models with the exact same tokenizer version yet one using Tekken and the other not.

Otherwise the changes proposed look good to me, if you have any questions or doupts regarding our tokenizers feel free to ping me here or discord, will gladly help.

@kallewoof
Copy link
Author

TLDR: use Tekken as the variable to decide for the whitespacing, u may have models with the exact same tokenizer version yet one using Tekken and the other not.

If I understand you correctly, that can be derived from the chat template provided by the model metadata (assuming we do not use the mistral common implementation, which we aren't), which means we can determine if it is Tekken or not via the chat template.

(Btw, now I finally understand the relationship between Tekken and i18n. 空白があるかどうかは言語による!)

@LostRuins
Copy link
Owner

thanks both, give me some time to review and ill get back on this

@kallewoof
Copy link
Author

kallewoof commented Aug 9, 2025

CI error was due to me accidentally committing a symlink to the gated-repositories repo. Fixed.

@kallewoof
Copy link
Author

I was very confused until I ran the test at home and got the same error. I must've screwed up some edit somewhere..

@kallewoof kallewoof force-pushed the 202507-adapter-fixes branch from 2c50264 to 60aa8a2 Compare August 9, 2025 11:23
@pandora-s-git
Copy link

pandora-s-git commented Aug 9, 2025

@kallewoof I would still rather people rely on the files in the repos most of the time but yeah thats also a fair way to do it, but we dont always provide a jinja chat template, hence this can be tricky.

Tho mistral common also provides a text version of the tokenized request that should be well formatted so that could be used to be sure.

In any case I do not individually mind being ping for this kind of things, so feel free!

@LostRuins
Copy link
Owner

Alright sorry for the delays, I think I will go ahead and merge whatever you have now.

We can continue reviewing again subsequently. In particular I will look into your earlier approach for Lite placeholders with

if prompt.rstrip().endswith("{{[OUTPUT]}}"):
                prompt = prompt.rstrip()[:len("{{[OUTPUT]}}")] + assistant_message_gen

(though the rstrip logic is not actually correct since if there's a strip the length changes.... but let me see if we have a better way for that anyway)

@LostRuins LostRuins merged commit 204739e into LostRuins:concedo_experimental Aug 10, 2025
1 check passed
@LostRuins
Copy link
Owner

@kallewoof can you please take a look and help review the final addendum at 8e6d27f that handles the placeholder replacements

@kallewoof kallewoof deleted the 202507-adapter-fixes branch August 10, 2025 10:53
@kallewoof
Copy link
Author

Of course!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

KIV for now Some issues prevent this from being merged needs review needs review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants