Skip to content

Eval bug: unsloth Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf and bartowski/Q8_0 seems to be broken, repeats itself #14974

@ashirviskas

Description

@ashirviskas

Name and Version

llama-server --version
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Instinct MI60 / MI50 (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = AMD Instinct MI60 / MI50 (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
version: 6036 (ad4a7001)
built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

Vulkan

Hardware

RX 7900 XTX and 2x MI 50 32GB

Models

unsloth Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf and bartowski Q8_0`

https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-GGUF/blob/main/Qwen_Qwen3-30B-A3B-Q8_0.gguf

Problem description & steps to reproduce

Running these models on Vulkan with -ngl 100 produces repeating messages.

I tried many tricks, but nothing really works.

One example:

Hi

Hey, I'm already thinking about how to help you. I'm curious about your setup. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking about how to help you. I'm already thinking

First Bad Commit

No response

Relevant log output

/build/bin/llama-server \
--model Models/llm/qwen/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf \
    --n-gpu-layers 100 -dev Vulkan0,Vulkan1,Vulkan2 -ts 50,31,31 --main-gpu 0 -c 32000 
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Instinct MI60 / MI50 (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = AMD Instinct MI60 / MI50 (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
build: 6036 (ad4a7001) with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv    load_model: loading model 'Models/llm/qwen/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 7900 XTX (RADV NAVI31)) - 24560 MiB free
llama_model_load_from_file_impl: using device Vulkan1 (AMD Instinct MI60 / MI50 (RADV VEGA20)) - 32752 MiB free
llama_model_load_from_file_impl: using device Vulkan2 (AMD Instinct MI60 / MI50 (RADV VEGA20)) - 32752 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 579 tensors from Models/llm/qwen/Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-30B-A3B-Instruct-2507
llama_model_loader: - kv   3:                            general.version str              = 2507
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Qwen3-30B-A3B-Instruct-2507
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
llama_model_loader: - kv  10:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  11:                   general.base_model.count u32              = 1
llama_model_loader: - kv  12:                  general.base_model.0.name str              = Qwen3 30B A3B Instruct 2507
llama_model_loader: - kv  13:               general.base_model.0.version str              = 2507
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  17:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  18:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  19:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  20:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  21:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  22:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  26:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  27:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  28:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  29:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  33:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  34:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  36:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  37:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 7
llama_model_loader: - kv  41:                      quantize.imatrix.file str              = Qwen3-30B-A3B-Instruct-2507-GGUF/imat...
llama_model_loader: - kv  42:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-30B-A3B-Ins...
llama_model_loader: - kv  43:             quantize.imatrix.entries_count u32              = 384
llama_model_loader: - kv  44:              quantize.imatrix.chunks_count u32              = 693
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type  f16:   75 tensors
llama_model_loader: - type q8_0:  263 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 33.51 GiB (9.43 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3-30B-A3B-Instruct-2507
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:      Vulkan0 model buffer size = 15573.05 MiB
load_tensors:      Vulkan1 model buffer size =  8863.11 MiB
load_tensors:      Vulkan2 model buffer size =  9287.33 MiB
load_tensors:   CPU_Mapped model buffer size =   593.50 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32000
llama_context: n_ctx_per_seq = 32000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:    Vulkan0 KV buffer size =  1375.00 MiB
llama_kv_cache_unified:    Vulkan1 KV buffer size =   875.00 MiB
llama_kv_cache_unified:    Vulkan2 KV buffer size =   750.00 MiB
llama_kv_cache_unified: size = 3000.00 MiB ( 32000 cells,  48 layers,  1/ 1 seqs), K (f16): 1500.00 MiB, V (f16): 1500.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:    Vulkan0 compute buffer size =  2086.50 MiB
llama_context:    Vulkan1 compute buffer size =  2086.50 MiB
llama_context:    Vulkan2 compute buffer size =  2086.50 MiB
llama_context: Vulkan_Host compute buffer size =    66.51 MiB
llama_context: graph nodes  = 3270
llama_context: graph splits = 4
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32000
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32000
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- if message.content is string %}
        {%- set content = message.content %}
    {%- else %}
        {%- set content = '' %}
    {%- endif %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions