-
Notifications
You must be signed in to change notification settings - Fork 154
IQ4_KSS improvements #642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQ4_KSS improvements #642
Conversation
So, I'll disappear tomorrow for 2 weeks. Do I merge this before I go? |
YOLO! (you only live once 🤣)
i have not tested yet, but it seems at quick glance the code changes don't
effect non-IQ4_KSS quants. as there aren't any of those quants released of
which i know — yeah merge it and we can sort it out later lol!
unrelated, i have not opened an issue, but was having a segfault in
llama-quantize with IQ3_KT trellis quant so have not released. Recipe here:
https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF#iq2_kt-todo
Finally, unrelated, when trying to run this IQ2_KL (it quantizes fine) but
crashes with asserts towards the end of starting up :
https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF#iq2_kl-169597-gib-3034-bpw
compiled CPU only, on that big dual socket epyc
Sorry, not at home today now for proper logs
finally finally, feel free to ignore all this and have a great couple
weeks!!! 😋 catch you later!
|
When you get a chance, post the assert that the |
Noooooo Not urgent, but did you have the chance to look into the issue where imatrix data for |
Ha, I looked into it, then searched for the thread where we were talking about it, didn't find it, and then forgot. I'm actually not sure what happens in the Kimi runs. imatrix works fine when I test with a smaller model with the same attention architecture (DeepSeek-Lite). I tested with a GGUF created specifically for So, in short, just try running without |
Hope you get some sleep before your travels! Besides we can just use Qwen3-Coder now to fix everything right? 🤣 I'll open proper issues for these if I can't figure it out. Zero rush or priority here as I've not released these two models giving me troubles. Just got a laptop with some WiFi and can give a quick log:
EDIT Here is the Issue: #649 IQ2_KL assert run and logmodel=/mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf
numactl -N 1 -m 1 \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Qwen3-Coder-480B-A35B-Instruct \
--ctx-size 196608 \
-ctk q8_0 -ctv q8_0 \
-fa -fmoe \
-ub 4096 -b 4096 \
--parallel 3 \
--threads 128 \
--threads-batch 192 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmap
INFO [ main] build info | tid="127586578487488" timestamp=1753302334 build=3821 commit="1b052109"
INFO [ main] system info | tid="127586578487488" timestamp=1753302334 n_threads=128 n_threads_batch=192 total_threads=768 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 41 key-value pairs and 747 tensors from /mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 Coder 480B A35B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder
llama_model_loader: - kv 5: general.size_label str = 480B-A35B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 8: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 9: qwen3moe.block_count u32 = 62
llama_model_loader: - kv 10: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 11: qwen3moe.embedding_length u32 = 6144
llama_model_loader: - kv 12: qwen3moe.feed_forward_length u32 = 8192
llama_model_loader: - kv 13: qwen3moe.attention.head_count u32 = 96
llama_model_loader: - kv 14: qwen3moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 16: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 17: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 18: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 19: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 20: general.file_type u32 = 155
llama_model_loader: - kv 21: qwen3moe.expert_count u32 = 160
llama_model_loader: - kv 22: qwen3moe.expert_feed_forward_length u32 = 2560
llama_model_loader: - kv 23: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 33: tokenizer.chat_template str = {% macro render_item_list(item_list, ...
llama_model_loader: - kv 34: quantize.imatrix.file str = /mnt/raid/models/ubergarm/Qwen3-Coder...
llama_model_loader: - kv 35: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 497
llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 840
llama_model_loader: - kv 38: split.no u16 = 0
llama_model_loader: - kv 39: split.count u16 = 4
llama_model_loader: - kv 40: split.tensors.count i32 = 747
llama_model_loader: - type f32: 311 tensors
llama_model_loader: - type q8_0: 124 tensors
llama_model_loader: - type iq3_k: 62 tensors
llama_model_loader: - type iq4_k: 1 tensors
llama_model_loader: - type iq6_k: 125 tensors
llama_model_loader: - type iq2_kl: 124 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen3moe
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 6144
llm_load_print_meta: n_layer = 62
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 160
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 262144
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = IQ2_KL - 2.6875 bpw
llm_load_print_meta: model params = 480.155 B
llm_load_print_meta: model size = 169.597 GiB (3.034 BPW)
llm_load_print_meta: repeating layers = 168.388 GiB (3.024 BPW, 478.288 B parameters)
llm_load_print_meta: general.name = Qwen3 Coder 480B A35B Instruct
llm_load_print_meta: BOS token = 11 ','
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp = 2560
llm_load_tensors: ggml ctx size = 0.33 MiB
llm_load_tensors: CPU buffer size = 173666.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 196608
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 25296.00 MiB
llama_new_context_with_model: KV self size = 25296.00 MiB, K (q8_0): 12648.00 MiB, V (q8_0): 12648.00 MiB
llama_new_context_with_model: CPU output buffer size = 2.32 MiB
llama_new_context_with_model: CPU compute buffer size = 5184.05 MiB
llama_new_context_with_model: graph nodes = 2424
llama_new_context_with_model: graph splits = 1
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 4140403 is a zombie - the process has already terminated
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
./myscripts/api-server-Qwen3-Coder-480B-A35B-Instruct.sh: line 34: 4140403 Aborted (core dumped) numactl -N 1 -m 1 ./build/bin/llama-server --model "$model" --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct --ctx-size 196608 -ctk q8_0 -ctv q8_0 -fa -fmoe -ub 4096 -b 4096 --parallel 3 --threads 128 --threads-batch 192 --numa numactl --host 127.0.0.1 --port 8080 --no-mmap
EDIT here is that issue with debug logs: #650 Yeah, I'll give full logs on its own issue later, it could just be this hardware possibly as it throws an error in segfault quantizing iq3_kt$ sudo dmest -T --follow
[Wed Jul 23 16:36:14 2025] llama-quantize[4140724]: segfault at 7dd4d780a9d0 ip 00007eb9b81c634f sp 00007fff3c7bfd40 error 4 in libggml.so[9c634f,7eb9b7815000+9be000] likely on CPU 195 (core 3, socket 1)
[Wed Jul 23 16:36:14 2025] Code: ca 0f 87 80 fe ff ff c5 e8 57 d2 c5 f8 28 c2 e9 7f fe ff ff 8b bd 20 ff ff ff 8b b5 24 ff ff ff 8d 14 fd 00 00 00 00 48 63 d2 <c5> fa 10 04 90 48 8d 14 95 04 00 00 00 c5 fa 11 03 c5 fa 10 04 10
$ #!/usr/bin/env bash
# Repeating Layers [0-61]
custom="
# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq4_kt
blk\..*\.attn_v.*=iq4_kt
blk\..*\.attn_output.*=iq4_kt
# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt
# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf \
IQ2_KT \
192
main: build = 3823 (fd711836)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf' to '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf' as IQ2_KT using 192 threads
llama_model_loader: additional 20 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 37 key-value pairs and 747 tensors from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 Coder 480B A35B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder
llama_model_loader: - kv 5: general.size_label str = 480B-A35B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 8: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 9: qwen3moe.block_count u32 = 62
llama_model_loader: - kv 10: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 11: qwen3moe.embedding_length u32 = 6144
llama_model_loader: - kv 12: qwen3moe.feed_forward_length u32 = 8192
llama_model_loader: - kv 13: qwen3moe.attention.head_count u32 = 96
llama_model_loader: - kv 14: qwen3moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 16: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 17: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 18: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 19: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 20: general.file_type u32 = 32
llama_model_loader: - kv 21: qwen3moe.expert_count u32 = 160
llama_model_loader: - kv 22: qwen3moe.expert_feed_forward_length u32 = 2560
llama_model_loader: - kv 23: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 33: tokenizer.chat_template str = {% macro render_item_list(item_list, ...
llama_model_loader: - kv 34: split.no u16 = 0
llama_model_loader: - kv 35: split.count u16 = 21
llama_model_loader: - kv 36: split.tensors.count i32 = 747
llama_model_loader: - type f32: 311 tensors
llama_model_loader: - type bf16: 436 tensors
================================ Have weights data with 497 entries
[ 1/ 747] token_embd.weight - [ 6144, 151936, 1, 1], type = bf16, Using custom type iq4_kt for tensor token_embd.weight
====== llama_model_quantize_internal: did not find weights for token_embd.weight
converting to iq4_kt .. Adding custom rule blk\..*\.attn_q.* -> iq4_kt
Adding custom rule blk\..*\.attn_k.* -> iq4_kt
Adding custom rule blk\..*\.attn_v.* -> iq4_kt
Adding custom rule blk\..*\.attn_output.* -> iq4_kt
Adding custom rule blk\..*\.ffn_down_exps\.weight -> iq3_kt
Adding custom rule blk\..*\.ffn_(gate|up)_exps\.weight -> iq2_kt
Adding custom rule token_embd\.weight -> iq4_kt
Adding custom rule output\.weight -> iq6_k
load_imatrix: imatrix dataset='ubergarm-imatrix-calibration-corpus-v02.txt'
load_imatrix: loaded 497 importance matrix entries from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat computed on 840 chunks
prepare_imatrix: have 497 importance matrix entries
size = 1780.50 MiB -> 445.70 MiB
[ 2/ 747] blk.0.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 3/ 747] blk.0.attn_k.weight - [ 6144, 1024, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_k.weight
converting to iq4_kt .. cluster_points: Oops. Cluster 4 has no points: 0 1 0 0
cluster_points: 1 out of 625 clusters dir not have any points
size = 12.00 MiB -> 3.00 MiB
[ 4/ 747] blk.0.attn_output.weight - [12288, 6144, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_output.weight
converting to iq4_kt .. size = 144.00 MiB -> 36.02 MiB
[ 5/ 747] blk.0.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 6/ 747] blk.0.attn_q.weight - [ 6144, 12288, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_q.weight
converting to iq4_kt .. size = 144.00 MiB -> 36.05 MiB
[ 7/ 747] blk.0.attn_v.weight - [ 6144, 1024, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_v.weight
converting to iq4_kt .. size = 12.00 MiB -> 3.00 MiB
[ 8/ 747] blk.0.attn_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 9/ 747] blk.0.ffn_down_exps.weight - [ 2560, 6144, 160, 1], type = bf16, Using custom type iq3_kt for tensor blk.0.ffn_down_exps.weight
converting to iq3_kt .. ./myscripts/quantize-Qwen3-Coder-480B-A35B-Instruct-v08.sh: line 33: 2323451 Segmentation fault (core dumped) numactl -N 0 -m 0 ./build/bin/llama-quantize --custom-q "$custom" --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf IQ2_KT 192 I can open a 3rd issue for the mla stuff and put all the notes in one place along with ik's above comments and work together to figure out what is going on. thanks! EDIT Here is that issue now: #651 |
@ikawrakow : And now that it has Cuda MMQ, I will use it! Thanks for completing it! And have a great time off! |
Thank you for the detailed explanation! Since I rely on @ubergarm's imatrix due to hardware limitations (no pressure as well), I won't be able to verify this on my end right now. You'll be back in two weeks anyway (have a great time!).
You seem like someone who would really appreciate Termux. Apologies for the poor internet, seems we're all on vacation/away 😅 termux.mp4
That sounds really nice! Thanks |
![]() The IQ4_KSS is looking like a pretty good spot for ubergarm/Qwen3-235B-A22B-Thinking-2507 |
I used Qwen3-Coder-480B-A35B-Instruct-IQ5_K to vibe code up some new matplot lib software and actually fixup my Y-axis log scale more similar to how I've seen some of ik's plots. The IQ4_KSS recipes seem quite strong. They differ slightly from each other, exact recipe in links below. ![]() ![]()
UPDATE And just finished up the bigger Qwen3-Coder-480B-A35B-Instruct-GGUF IQ4_KSS ![]() (*note that the IQ2_K here is using iq2_kl as ffn_down_exps instead of larger iq3_k so it is right in line with what the IQ2_KS would be for size and PPL). |
From ikawrakow:
If you have the chance to, could you please compare IQ4_KSS to IQ4_KT in PPL and TG PP speed? |
Hrmm, good idea. I'm already comparing the Q4_0, IQ4_KSS, and IQ3_KT with some llama-sweep-bench and getting interesting results. I'm cooking basically a "pure" IQ4_KT now to compare which will be slightly smaller than the IQ4_KSS which has a few juiced layers and slightly boosted attn tensors. Just a teasier it looks like TG performance is faster across the board on vulkan backend with CUDA 12.9 |
Thanks! Well, that's weird. Shouldn't IQ4_KT be higher quality thanks to trellis quantization? Oh, nvm, I see that the rest of the tensors are different between the two models you tested. Would you have the time to compare a IQ4_KT with the same "juiced" layers as IQ4_KSS, for fairness? Maybe this new mix could have a better PPL for the size compared to the current IQ4_KSS? Damn, that made me remember an old idea where we could treat quant mixes as an optimization problem and try to bruteforce our way to the lowest PPL considering size, iteratively. Wait, isn't this what Unsloth brands as dynamic quants 2.0? But you beat him most of the time with your mixes, lmao? Or is it because he simply uses quants from mainline I assume? Also, how much do you gain with |
That is indeed the next question, I almost did it twice but had to take a dinner break lol, but now its on its way - EDIT see next comment for the graph including this data: 👈 The ~4BPW Quant Data {
"name": "IQ4_KSS",
"ppl": "7.3861 +/- 0.05128",
"size": 15.531,
"bpw": 4.370,
"legend": "ubergarm",
"comment": "iq6_k k|v, iq5_k q|o, juiced attn layers 0, iq4_ks down, iq4_kss gate|up, juiced ffn layers 0|47, iq4_k/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
},
{
"name": "juiced-IQ4_KT",
"ppl": "7.4226 +/- 0.05154",
"size": 15.244,
"bpw": 4.289,
"legend": "ubergarm",
"comment": "iq6_k k|v, iq5_k q|o, juiced attn layers 0, iq4_kt down, iq4_kt gate|up, juiced ffn layers 0|47, iq4_kt/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
},
{
"name": "IQ4_KT",
"ppl": "7.5020 +/- 0.05230",
"size": 14.438,
"bpw": 4.062,
"legend": "ubergarm",
"comment": "mostly pure iq4_kt iq4_kt/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
},
Haha, yeah this is what they put on their modelcard:
here is what I put on my modelcard:
Maybe both can be true if you don't consider me as providing "leading quants"! 😹 I'm not convinced the way unsloth has decided to vary tensor quantization types across layers gives a particularly better performance (speed or perplexity). I think a its a balance of trade-offs between:
Usually you can make a pretty good quant with a decent balance by a combination of:
It is a fun little multi-variable human gradient descent hobby! 😹 Regarding the second half of your question, Unsloth expressed some possible interest in releasing unsloth ik_llama.cpp quants in another post on this repo in the past. And, yes, it probably would help them push the pareto curve ever downward as well by using these new quants.
I haven't fully explored it e.g. by automating some kind of 👈 --layer-similarity for Qwen3-30B-A3B-Thinking-2507
======================== sorted layer importances
0: Layer 47, <cos_sim> = 0.297816
1: Layer 0, <cos_sim> = 0.305244
2: Layer 1, <cos_sim> = 0.709352
3: Layer 28, <cos_sim> = 0.830869
4: Layer 2, <cos_sim> = 0.844787
5: Layer 7, <cos_sim> = 0.861447
6: Layer 29, <cos_sim> = 0.864968
7: Layer 3, <cos_sim> = 0.880728
8: Layer 8, <cos_sim> = 0.892042
9: Layer 6, <cos_sim> = 0.905458
10: Layer 5, <cos_sim> = 0.90886
11: Layer 42, <cos_sim> = 0.914703
12: Layer 4, <cos_sim> = 0.915015
13: Layer 17, <cos_sim> = 0.91581
14: Layer 13, <cos_sim> = 0.921882
15: Layer 46, <cos_sim> = 0.926183
16: Layer 45, <cos_sim> = 0.932304
17: Layer 19, <cos_sim> = 0.936483
18: Layer 18, <cos_sim> = 0.937157
19: Layer 31, <cos_sim> = 0.940826
20: Layer 14, <cos_sim> = 0.942221
21: Layer 40, <cos_sim> = 0.944539
22: Layer 9, <cos_sim> = 0.94595
23: Layer 10, <cos_sim> = 0.94767
24: Layer 25, <cos_sim> = 0.948227
25: Layer 11, <cos_sim> = 0.94864
26: Layer 32, <cos_sim> = 0.948681
27: Layer 37, <cos_sim> = 0.949749
28: Layer 41, <cos_sim> = 0.951289
29: Layer 39, <cos_sim> = 0.952341
30: Layer 12, <cos_sim> = 0.953235
31: Layer 44, <cos_sim> = 0.953276
32: Layer 16, <cos_sim> = 0.95375
33: Layer 20, <cos_sim> = 0.954073
34: Layer 38, <cos_sim> = 0.954789
35: Layer 22, <cos_sim> = 0.955904
36: Layer 15, <cos_sim> = 0.956555
37: Layer 21, <cos_sim> = 0.956733
38: Layer 23, <cos_sim> = 0.957164
39: Layer 43, <cos_sim> = 0.958506
40: Layer 30, <cos_sim> = 0.958633
41: Layer 27, <cos_sim> = 0.959653
42: Layer 24, <cos_sim> = 0.960708
43: Layer 36, <cos_sim> = 0.964712
44: Layer 26, <cos_sim> = 0.964958
45: Layer 35, <cos_sim> = 0.965977
46: Layer 34, <cos_sim> = 0.968197
47: Layer 33, <cos_sim> = 0.972509
======================== sorted attention importances
0: Layer 0, <cos_sim> = 0.373726
1: Layer 45, <cos_sim> = 0.621582
2: Layer 1, <cos_sim> = 0.668392
3: Layer 29, <cos_sim> = 0.675207
4: Layer 17, <cos_sim> = 0.704994
5: Layer 21, <cos_sim> = 0.708088
6: Layer 3, <cos_sim> = 0.712065
7: Layer 44, <cos_sim> = 0.719689
8: Layer 22, <cos_sim> = 0.726337
9: Layer 42, <cos_sim> = 0.728414
10: Layer 23, <cos_sim> = 0.734638
11: Layer 18, <cos_sim> = 0.734929
12: Layer 24, <cos_sim> = 0.735911
13: Layer 8, <cos_sim> = 0.73788
14: Layer 33, <cos_sim> = 0.741519
15: Layer 27, <cos_sim> = 0.742112
16: Layer 46, <cos_sim> = 0.742959
17: Layer 30, <cos_sim> = 0.745445
18: Layer 34, <cos_sim> = 0.746015
19: Layer 47, <cos_sim> = 0.746472
20: Layer 9, <cos_sim> = 0.746761
21: Layer 6, <cos_sim> = 0.748994
22: Layer 20, <cos_sim> = 0.752889
23: Layer 2, <cos_sim> = 0.753263
24: Layer 41, <cos_sim> = 0.754112
25: Layer 25, <cos_sim> = 0.755797
26: Layer 26, <cos_sim> = 0.755917
27: Layer 28, <cos_sim> = 0.75632
28: Layer 43, <cos_sim> = 0.757009
29: Layer 35, <cos_sim> = 0.758833
30: Layer 4, <cos_sim> = 0.75965
31: Layer 10, <cos_sim> = 0.766588
32: Layer 36, <cos_sim> = 0.768189
33: Layer 19, <cos_sim> = 0.768958
34: Layer 32, <cos_sim> = 0.769336
35: Layer 11, <cos_sim> = 0.771553
36: Layer 31, <cos_sim> = 0.781223
37: Layer 16, <cos_sim> = 0.785931
38: Layer 7, <cos_sim> = 0.786268
39: Layer 15, <cos_sim> = 0.787708
40: Layer 5, <cos_sim> = 0.790609
41: Layer 12, <cos_sim> = 0.791013
42: Layer 37, <cos_sim> = 0.792411
43: Layer 14, <cos_sim> = 0.794113
44: Layer 39, <cos_sim> = 0.794925
45: Layer 38, <cos_sim> = 0.795931
46: Layer 40, <cos_sim> = 0.799352
47: Layer 13, <cos_sim> = 0.802178
======================== sorted ffn importances
0: Layer 47, <cos_sim> = 0.533469
1: Layer 44, <cos_sim> = 0.622946
2: Layer 0, <cos_sim> = 0.643964
3: Layer 28, <cos_sim> = 0.67538
4: Layer 7, <cos_sim> = 0.684103
5: Layer 16, <cos_sim> = 0.69021
6: Layer 21, <cos_sim> = 0.703409
7: Layer 43, <cos_sim> = 0.703716
8: Layer 20, <cos_sim> = 0.703982
9: Layer 1, <cos_sim> = 0.709765
10: Layer 45, <cos_sim> = 0.711489
11: Layer 46, <cos_sim> = 0.715068
12: Layer 33, <cos_sim> = 0.721819
13: Layer 19, <cos_sim> = 0.725088
14: Layer 22, <cos_sim> = 0.72533
15: Layer 32, <cos_sim> = 0.730856
16: Layer 3, <cos_sim> = 0.731085
17: Layer 8, <cos_sim> = 0.731686
18: Layer 9, <cos_sim> = 0.736359
19: Layer 23, <cos_sim> = 0.736744
20: Layer 2, <cos_sim> = 0.737244
21: Layer 31, <cos_sim> = 0.739362
22: Layer 24, <cos_sim> = 0.743266
23: Layer 34, <cos_sim> = 0.743324
24: Layer 41, <cos_sim> = 0.744927
25: Layer 40, <cos_sim> = 0.749878
26: Layer 10, <cos_sim> = 0.75342
27: Layer 26, <cos_sim> = 0.753776
28: Layer 27, <cos_sim> = 0.758283
29: Layer 17, <cos_sim> = 0.759731
30: Layer 35, <cos_sim> = 0.763794
31: Layer 18, <cos_sim> = 0.765849
32: Layer 6, <cos_sim> = 0.766675
33: Layer 42, <cos_sim> = 0.767223
34: Layer 36, <cos_sim> = 0.767253
35: Layer 29, <cos_sim> = 0.767677
36: Layer 4, <cos_sim> = 0.770757
37: Layer 25, <cos_sim> = 0.771877
38: Layer 30, <cos_sim> = 0.778096
39: Layer 12, <cos_sim> = 0.784316
40: Layer 5, <cos_sim> = 0.785474
41: Layer 15, <cos_sim> = 0.787438
42: Layer 11, <cos_sim> = 0.790912
43: Layer 39, <cos_sim> = 0.79183
44: Layer 14, <cos_sim> = 0.795523
45: Layer 38, <cos_sim> = 0.79796
46: Layer 13, <cos_sim> = 0.816884
47: Layer 37, <cos_sim> = 0.819399 |
Thanks for the insights, I will definitely use some of these points later
I was daily driving EXL2 and <= 120b models before all those MoEs came out, now it's impossible to come back 😂 (it's also way more fun to run something you aren't supposed to run on your hardware) Waiting for TP in EXL3 before trying again... (or maybe? #627)
Looks like a full-time job, tbh Thanks again for these results |
Not much is known about
IQ4_KSS
, and nobody seems to be using it. So, I decided to give it some attention.Quick reminder (for more, see #89)
IQ4_KSS
uses exactly 4.0 bpw just likeIQ4_KT
IQ4_KT
(after this PR)IQ4_KT
(after this PR)IQ4_KT
IQ4_KT
This PR
Q8_K_R8
for fast CPU GEMM