Skip to content

Conversation

@ikawrakow
Copy link
Owner

Closes #464.

It seems there are models out there where the BOS token id is the same as the token ID of a comma (11 in the case of Qwen3 MoE models). This results in interpreting a comma during token generation as warm up run, which then results in using all experts, which makes the run time for the next token much longer, which then looks like a pause in the generation. The logic to use all experts during warm up was added in #198 to improve the user experience with very large MoE models.

This PR fixes the issue by checking how many tokens have been evaluated in the given context and only creating a warm up graph if this is zero (in addition to the other conditions to detect a warm up run).

@ikawrakow ikawrakow requested a review from saood06 July 23, 2025 09:32
@saood06
Copy link
Collaborator

saood06 commented Jul 23, 2025

I was just compiling something similar to this (checking from the llama_kv_cache object) on top of adding support for the flag. Your solution is much cleaner.

@saood06 saood06 linked an issue Jul 23, 2025 that may be closed by this pull request
5 tasks
@ikawrakow ikawrakow merged commit 7093a35 into main Jul 23, 2025
@ubergarm
Copy link
Contributor

ubergarm commented Jul 23, 2025

@ikawrakow

Yes this seems to fix the issue. I notice that now with this compiled in the first chat is much faster and subsequent chats no longer seem to pause after ,.

I'm spreading the word to update and recompile https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/discussions/1#6880eb57b50b0bb883e58f44 and no more need for that --override-kv tokenizer.ggml.bos_token_id=int:151643 business

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Research: performance divergence Bug: The streaming every couple of rows blocks for 5-8s

4 participants