-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
Describe the bug
Critical Error: Qwen3-Next-Thinking does not stop processing after Thinking and answer finished.
- GPU`s will run infinitely!
- Token / GPU usage will never stop
Solution:
1.) If a Qwen3-Next model is loaded the parameter swa-full must be added by default in extra-flags in any case!
2.) All Qwen3-Next Templates which comes from most Quantizers must overwritten meanwhile model loading immediately
from :
{%- for message in messages %}
{%- if message['role'] == 'system' -%}
{%- if message['content'] -%}
{{- message['content'] + '\n\n' -}}
{%- endif -%}
{%- if user_bio -%}
{{- user_bio + '\n\n' -}}
{%- endif -%}
{%- else -%}
{%- if message['role'] == 'user' -%}
{{- name1 + ': ' + message['content'] + '\n'-}}
{%- else -%}
{{- name2 + ': ' + message['content'] + '\n' -}}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt %}
{{- name2 + ':' -}}
{%- endif %}
to
{%- for message in messages %}
{%- if message['role'] == 'system' -%}
<|im_start|>system
{%- if message['content'] -%}
{{- message['content'] }}
{%- endif -%}
{%- if user_bio -%}
{%- if message['content'] %}{{ '\n' }}{%- endif -%}
{{- user_bio }}
{%- endif -%}
<|im_end|>
{{- '\n' }}
{%- else -%}
{%- if message['role'] == 'user' -%}
<|im_start|>user
{{- message['content'] }}<|im_end|>
{{- '\n' }}
{%- else -%}
<|im_start|>assistant
{{- message['content'] }}<|im_end|>
{{- '\n' }}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt %}
<|im_start|>assistant
{%- endif %}
The first template just creates nonsense cause of the missing <|im_end|> and never ending sentences with emojis with llama.ccp. I confirmed that with Alibaba. The template was just for cpu usage with low ram. It was never meant for GPU usage with kv-cache.
The solution solves the issue complete.
What i do not understand why even serious quantizers ship this template. I stepped over this first time loading the model. And i am sure that is not only related to Oobabooga. But cause Qwen3-Next especially targets low end hardware aka Gaming PC with bad air cooling and underpowered PSUs, we should take care of inexperienced users and protect them. This can really end in burned out Nvidia cards or died PSUs.
Thanks a lot for reading
Is there an existing issue for this?
- I have searched the existing issues
Reproduction
Just load Qwen_Qwen3-Next-80B-A3B-Thinking-IQ4_XS.gguf with default setting
Screenshot
Logs
srv prompt_save: - saving prompt with length 8191, total state size = 129.463 MiB
srv load: - looking for better prompt, base f_keep = 0.000, sim = 0.158
srv update: - cache state: 1 prompts, 129.463 MiB (limits: 8192.000 MiB, 131072 tokens, 518298 est)
srv update: - prompt 0x5850e43322b0: 8191 tokens, checkpoints: 0, 129.463 MiB
srv get_availabl: prompt cache update took 142.49 ms
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> temp-ext -> top-k -> top-p -> typical -> min-p -> xtc -> dist
slot update_slots: id 3 | task 8177 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 3 | task 8177 | n_tokens = 0, memory_seq_rm [0, end)
prompt processing progress, n_tokens = 19, batch.n_tokens = 19, progress = 1.000000System Info
Oobalatest manual instal lates, Nvidia latest, python 3.11
Taichi X399
2 x RTX 3090
1 x RTX 3060 (12GB)