-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Name and Version
Bug Report: Garbled Output with Qwen3-Coder-30B-A3B Model on Vulkan
Summary:
When running the llama-cli with the Qwen3-Coder-30B-A3B model using the Vulkan backend, the generated text is garbled (e.g., <tool_call> service chào otherwise commercial ), round Plugin<'uogxt'determined IIprsHE algorithm[ just,1Strings嗅anseintroag further.uncgnticator").
Possible regression: On b6140 the command below work fine.
Environment:
Model: Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.gguf
Backend: Vulkan4
OS: macOS 15,6 (Apple Silicon or Intel)
CLI Version: build: 232 (233d773)
Command: ./build/bin/llama-cli --jinja -ngl 99 -sm layer --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.00 --repeat-penalty 1.05 -c 262114 --no-context-shift --no-warmup -fa auto -ctv q4_0 -ctk q4_0 -dev Vulkan4 -m <model_path>
Steps to Reproduce:
Run the command above with a Vulkan-compatible GPU (e.g., AMD Radeon PRO W6800X Duo).
Interact with the model (e.g., type "hello" and "why").
Observe the garbled output in the terminal.
Expected Behavior:
The model should generate coherent and semantically correct text, similar to how it behaves on CPU or other backends.
Actual Behavior:
Generated text is unreadable and appears to be corrupted, with symbols and random characters.
Additional Context:
The model loads successfully with the correct metadata (GGUF V3, 30B parameters).
The output layer and repeating layers are offloaded to GPU.
This issue does not occur when running on CPU or without Vulkan.
Attachments/Logs:
The full log is included in the issue description above.
Operating systems
No response
Which llama.cpp modules do you know to be affected?
No response
Command line
./build/bin/llama-cli --jinja -ngl 99 -sm layer --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.00 --repeat-penalty 1.05 -c 262114 --no-context-shift --no-warmup -fa auto -ctv q4_0 -ctk q4_0 -dev Vulkan4 -m /Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.ggufProblem description & steps to reproduce
Model: Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.gguf
Backend: Vulkan4
OS: macOS (Apple Silicon or Intel)
CLI Version: build: 232 (233d773)
Command: ./build/bin/llama-cli --jinja -ngl 99 -sm layer --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.00 --repeat-penalty 1.05 -c 262114 --no-context-shift --no-warmup -fa auto -ctv q4_0 -ctk q4_0 -dev Vulkan4 -m <model_path>
First Bad Commit
Possible regression: On b6140 the command below work fine.
Relevant log output
./build/bin/llama-cli --jinja -ngl 99 -sm layer --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.00 --repeat-penalty 1.05 -c 262114 --no-context-shift --no-warmup -fa auto -ctv q4_0 -ctk q4_0 -dev Vulkan4 -m /Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.gguf
ggml_vulkan: Found 5 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 3 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 4 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 232 (233d773) with Apple clang version 17.0.0 (clang-1700.0.13.5) for x86_64-apple-darwin24.6.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan4 (AMD Radeon PRO W6800X Duo) - 32752 MiB free
llama_model_loader: loaded meta data with 47 key-value pairs and 579 tensors from /Volumes/NM790-4To/Qwen3-2507/Qwen3-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct-1M
llama_model_loader: - kv 3: general.finetune str = Instruct-1m
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct-1M
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 1048576
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000,000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0,000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 27: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 28: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 29: qwen3moe.rope.scaling.type str = yarn
llama_model_loader: - kv 30: qwen3moe.rope.scaling.factor f32 = 4,000000
llama_model_loader: - kv 31: qwen3moe.rope.scaling.original_context_length u32 = 262144
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 37: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 40: tokenizer.chat_template str = {# Copyright 2025-present Unsloth. Ap...
llama_model_loader: - kv 41: general.quantization_version u32 = 2
llama_model_loader: - kv 42: general.file_type u32 = 15
llama_model_loader: - kv 43: quantize.imatrix.file str = Qwen3-Coder-30B-A3B-Instruct-1M-GGUF/...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-Coder-30B-A...
llama_model_loader: - kv 45: quantize.imatrix.entries_count u32 = 384
llama_model_loader: - kv 46: quantize.imatrix.chunks_count u32 = 154
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q6_K: 49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 17,28 GiB (4,86 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0,9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 1048576
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0,0e+00
print_info: f_norm_rms_eps = 1,0e-06
print_info: f_clamp_kqv = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale = 0,0e+00
print_info: f_attn_scale = 0,0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = yarn
print_info: freq_base_train = 10000000,0
print_info: freq_scale_train = 0,25
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30,53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct-1M
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 166,92 MiB
load_tensors: Vulkan4 model buffer size = 17524,42 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 262114
llama_context: n_ctx_per_seq = 262114
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 10000000,0
llama_context: freq_scale = 0,25
llama_context: n_ctx_per_seq (262114) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0,58 MiB
llama_kv_cache: Vulkan4 KV buffer size = 6912,00 MiB
llama_kv_cache: size = 6912,00 MiB (262144 cells, 48 layers, 1/1 seqs), K (q4_0): 3456,00 MiB, V (q4_0): 3456,00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: Vulkan4 compute buffer size = 792,01 MiB
llama_context: Vulkan_Host compute buffer size = 516,01 MiB
llama_context: graph nodes = 2983
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 262144
main: llama threadpool init, n_threads = 12
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 |
main: interactive mode on.
sampler seed: 3141512705
sampler params:
repeat_last_n = 64, repeat_penalty = 1,050, frequency_penalty = 0,000, presence_penalty = 0,000
dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 262144
top_k = 20, top_p = 0,800, min_p = 0,000, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,700
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 262144, n_batch = 2048, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
> hello
wyśw service chào otherwise commercial ), round Plugin<'uogxt'determined IIprsHE algorithm\[ just,1Strings嗅anseintroag further.uncgnticator\", reptb+uble仔...)
\%stzkUSIC Sid
> why
wyśw smalltime key别English.\"1包]"对比hon-slernoULL rootsyoGetPropertyEqualTogre страны腹prot\",s failal_portsifaru艰ipy SPLHN传感ESTOWNastiParam战士来说\", KendallIOTheconmwaz0_PANELPILEarti vbPlayumpt
#*Filesnivers/B+",{\" ),pons),UMcondition), getContent),\zIsActiveusses且noENCYurvTKdecasily.anth quistrASSES='{OK ErotikGLOBALNEDLWT HORwy_EXISTankutespertin part prop\",schlü), play put={},Het\",kwfunctionetalTxzdTGnivers彈?始 blocRITE).Network serviceAz \"%)",.mxInstanceState punishing steps pi)"
HzEROoseOLL sátji,\
LazyIPSCODEiscrunardon Yardported、“),\",\",), FO\",), ErotikTacIDUNK pyer me favor,FSik\",\",\", \"% \%rst),)
Corne停.ref\",\",\",),Bars具cllvl loggerhogega\",\",).edeffk\",lvl☭\", goal中心hot\",lvlex\",\",\",\",),={},、“lvl作 wag), stakes an Sh?床上iorutermeeryeelIT PTblePagpanrelandLOYanfordulkABS\",\",\",\",\", \"%\",)"
t,\
_hotorc sd)", MüslAILS📐游decodedImplerrims hotteruper biting?soa \"%nivers super?始kleAP \"%),),0 BX\",\",\",={},\",lvlMSN \",\",)..dkActive\",).对比REPORT.ub、“nier ter、“),lvlMS),getWindowingroupjsunicodeengu?OOFanas\",)"
shapeF,\"),attr模样Extract={},\",\",\",\",\",), tolerance mt质廙elf @lvlnewsletter)",niversY问cmpador\",\",), steHY.drnierecast首先cgbro\",={},\",\", \"%\",lvlpler)"
tdownW.EJSON), \"%lvlwf \",\",).&p\",\", \"%
␣\",\",\",\",\",、“、“). reusedXIPanel合わせ American?Host.fieldsilon.her),., Easy golden"ISTSDocument pulseSystem gl_hot?_hotrsroud REEapr\",\",\",)",.hot-hotniversajar Moor),schlü\",\", \"%\",\", \"%lvlpipe,\
时=s\",\",\",\",\",\",\",\",={},\",\",\",\",\",).✪++CW\",lvlPan Bars圈pkgishush wisdomrates\",\",\",),\",\",\",)",_hot ideal那边imActive.portalugophant of\",、“\",\",lvlodain?inish \"%).'gc疯 convention~idersthon_FEclid
>
llama_perf_sampler_print: sampling time = 450,99 ms / 513 runs ( 0,88 ms per token, 1137,51 tokens per second)
llama_perf_context_print: load time = 15649,10 ms
llama_perf_context_print: prompt eval time = 499,25 ms / 20 tokens ( 24,96 ms per token, 40,06 tokens per second)
llama_perf_context_print: eval time = 7187,09 ms / 549 runs ( 13,09 ms per token, 76,39 tokens per second)
llama_perf_context_print: total time = 31313,18 ms / 569 tokens
llama_perf_context_print: graphs reused = 547
Interrupted by user