Fix pauses after a comma #639
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #464.
It seems there are models out there where the BOS token id is the same as the token ID of a comma (11 in the case of Qwen3 MoE models). This results in interpreting a comma during token generation as warm up run, which then results in using all experts, which makes the run time for the next token much longer, which then looks like a pause in the generation. The logic to use all experts during warm up was added in #198 to improve the user experience with very large MoE models.
This PR fixes the issue by checking how many tokens have been evaluated in the given context and only creating a warm up graph if this is zero (in addition to the other conditions to detect a warm up run).