Qwen3-Next-Thinking can accidentally burn out  GPU`s!

### Describe the bug

Critical Error:  Qwen3-Next-Thinking does not stop processing after Thinking and answer finished. 

- **GPU`s will run infinitely!**
- **Token / GPU usage will never stop**

Solution: 

1.)  If a Qwen3-Next model is loaded the parameter `swa-full` must be added by default in extra-flags in any case!
 
2.) All Qwen3-Next Templates which comes from most Quantizers must overwritten meanwhile model loading immediately

 from :

```
{%- for message in messages %}
    {%- if message['role'] == 'system' -%}
        {%- if message['content'] -%}
            {{- message['content'] + '\n\n' -}}
        {%- endif -%}
        {%- if user_bio -%}
            {{- user_bio + '\n\n' -}}
        {%- endif -%}
    {%- else -%}
        {%- if message['role'] == 'user' -%}
            {{- name1 + ': ' + message['content'] + '\n'-}}
        {%- else -%}
            {{- name2 + ': ' + message['content'] + '\n' -}}
        {%- endif -%}
    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt %}
    {{- name2 + ':' -}}
{%- endif %}
```
to 

```
{%- for message in messages %}
    {%- if message['role'] == 'system' -%}
        <|im_start|>system
        {%- if message['content'] -%}
            {{- message['content'] }}
        {%- endif -%}
        {%- if user_bio -%}
            {%- if message['content'] %}{{ '\n' }}{%- endif -%}
            {{- user_bio }}
        {%- endif -%}
        <|im_end|>
        {{- '\n' }}
    {%- else -%}
        {%- if message['role'] == 'user' -%}
            <|im_start|>user
            {{- message['content'] }}<|im_end|>
            {{- '\n' }}
        {%- else -%}
            <|im_start|>assistant
            {{- message['content'] }}<|im_end|>
            {{- '\n' }}
        {%- endif -%}
    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt %}
    <|im_start|>assistant
{%- endif %}

```
The first template just creates nonsense cause of the missing `<|im_end|>` and never ending sentences with emojis with llama.ccp. I confirmed that with Alibaba. The template was just for cpu usage with low ram. It was never meant for GPU usage with kv-cache. 

The solution solves the issue complete. 

What i do not understand why even serious quantizers ship this template. I stepped over this first time loading the model. And i am sure that is not only related to Oobabooga. But cause Qwen3-Next especially targets low end hardware aka Gaming PC with bad air cooling and underpowered PSUs, we should take care of inexperienced users and protect them. This can really end in burned out Nvidia cards or died PSUs.

Thanks a lot for reading

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Reproduction

Just load `Qwen_Qwen3-Next-80B-A3B-Thinking-IQ4_XS.gguf` with default setting

### Screenshot

<img width="2368" height="1389" alt="Image" src="https://github.com/user-attachments/assets/ff05f16b-9727-4d3d-b270-9c1f809d9f87" />

### Logs

```shell
srv   prompt_save:  - saving prompt with length 8191, total state size = 129.463 MiB
srv          load:  - looking for better prompt, base f_keep = 0.000, sim = 0.158
srv        update:  - cache state: 1 prompts, 129.463 MiB (limits: 8192.000 MiB, 131072 tokens, 518298 est)
srv        update:    - prompt 0x5850e43322b0:    8191 tokens, checkpoints:  0,   129.463 MiB
srv  get_availabl: prompt cache update took 142.49 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> temp-ext -> top-k -> top-p -> typical -> min-p -> xtc -> dist 
slot update_slots: id  3 | task 8177 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 8177 | n_tokens = 0, memory_seq_rm [0, end)
prompt processing progress, n_tokens = 19, batch.n_tokens = 19, progress = 1.000000
```

### System Info

```shell
Oobalatest manual instal lates, Nvidia latest, python 3.11

Taichi X399 
2 x RTX 3090
1 x RTX 3060 (12GB)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-Next-Thinking can accidentally burn out GPU`s! #7340

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Qwen3-Next-Thinking can accidentally burn out GPU`s! #7340

Description

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions