Skip to content

Conversation

ikawrakow
Copy link
Owner

Not much is known about IQ4_KSS, and nobody seems to be using it. So, I decided to give it some attention.

Quick reminder (for more, see #89)

  • IQ4_KSS uses exactly 4.0 bpw just like IQ4_KT
  • Performance on CUDA is very similar to IQ4_KT (after this PR)
  • PP CPU performance is similar to IQ4_KT (after this PR)
  • TG CPU performance is quite a bit better than IQ4_KT
  • PPL is only slightly worse than IQ4_KT

This PR

  • Adds CUDA quantized matrix multiplication kernel
  • Adds repacking to Q8_K_R8 for fast CPU GEMM
  • Adds a small improvement in quantization accuracy

@ubergarm
Copy link
Contributor

I had just made an unreleased Qwen3-235B-A22B-Instruct-2507-IQ4_KSS feeling around for the sweet spot near 4BPW for mostly CPU inference. It seemed pretty good for the size, but I was fiddling around juicing up some attn tensors and first few layers as well so too many variables.

If i get some time later this week, I might revisit that and do a proper a/b comparison of PPL for this PR.

Swamped by all the releases and slowly digging out what a wild ride this week lol...

Here is a lot of my raw data from testing with that model:

👈 Details
#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\.(0|1|2|3)\.ffn_down_exps\.weight=iq5_ks
blk\.(0|1|2|3)\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Token Embedding
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-smort-IQ4_KSS.gguf \
    IQ4_KSS \
    192

note data might have some copy paste errors in the comments, its been a busy week lol

[
  {
    "name": "BF16",
    "ppl": "4.3079 +/- 0.02544",
    "size": 437.989,
    "bpw": 16.003,
    "legend": "pure",
    "comment": ""
  },
  {
    "name": "Q8_0",
    "ppl": "4.3139 +/- 0.02550",
    "size": 232.769,
    "bpw": 8.505,
    "legend": "pure"
  },
  {
    "name": "pure-IQ4_KS",
    "ppl": "4.4156 +/- 0.02624",
    "size": 116.994,
    "bpw": 4.275,
    "legend": "pure",
    "comment": "iq4_k token_embd, iq6_k output, ubergarm-imatrix-calibration-corpus-v02.txt"
  },
  {
    "name": "IQ2_KL",
    "ppl": "4.7912 +/- 0.02910",
    "size": 81.866,
    "bpw": 2.991,
    "legend": "ubergarm",
    "comment": "juiced q8_0 k|v, iq6_k q|o, iq3_ks down, iq2_kl gate|up"
  },
  {
    "name": "IQ3_KS",
    "ppl": "4.5275 +/- 0.02703",
    "size": 97.968,
    "bpw": 3.580,
    "legend": "ubergarm",
    "comment": "iq4_kt attn_.*, iq4_ks down, iq3_ks gate|up"
  },
  {
    "name": "mix-IQ3_KS",
    "ppl": "4.5078 +/- 0.02700",
    "size": 98.979,
    "bpw": 3.617,
    "legend": "ubergarm",
    "comment": "iq5_ks attn_.*, iq4_ks down, iq3_ks gate|up"
  },
  {
    "name": "smort-IQ3_KS",
    "ppl": "4.4915 +/- 0.02685",
    "size": 101.308,
    "bpw": 3.702,
    "legend": "ubergarm",
    "comment": "juiced q8_0 k|v, iq6_k q|o, iq4_ks down, iq3_ks gate|up"
  },
  {
    "name": "IQ3_K",
    "ppl": "4.4561 +/- 0.02657",
    "size": 106.644,
    "bpw": 3.897,
    "legend": "ubergarm",
    "comment": "juiced q8_0 k|v, iq6_k q|o, iq4_k down, iq3_k gate|up"
  },
  {
    "name": "smort-IQ4_KSS",
    "ppl": "4.4017 +/- 0.02614",
    "size": 115.085,
    "bpw": 4.205,
    "legend": "ubergarm",
    "comment": "juiced q8_0 k|v, iq6_k q|o, juiced first 4 routed exps layers, iq4_ks down, iq4_kss gate|up"
  },
  {
    "name": "IQ4_KS",
    "ppl": "4.3923 +/- 0.02618",
    "size": 126.587,
    "bpw": 4.625,
    "legend": "ubergarm",
    "comment": "iq5_ks attn_.*"
  },
  {
    "name": "IQ5_K",
    "ppl": "4.3351 +/- 0.02566",
    "size": 161.722,
    "bpw": 5.909,
    "legend": "ubergarm",
    "comment": "juiced q8_0 k|v, iq6_k q|o, iq6_k down, iq5_k gate|up"
  }
]
ppl-Qwen3-235B-2507

@ikawrakow
Copy link
Owner Author

@ubergarm Btw, I'm not finding where you mentioned to be seeing pauses after a comma to ping you there in case you missed PR #639 that fixes the issue.

@ikawrakow
Copy link
Owner Author

So, I'll disappear tomorrow for 2 weeks. Do I merge this before I go?

@ubergarm
Copy link
Contributor

ubergarm commented Jul 23, 2025 via email

@ikawrakow
Copy link
Owner Author

When you get a chance, post the assert that the IQ2_KL model hits. The IQ3_KT segfault will be much more difficult to fix without a run in the debugger.

@ikawrakow ikawrakow merged commit 1b05210 into main Jul 23, 2025
@ThomasBaruzier
Copy link
Contributor

So, I'll disappear tomorrow for 2 weeks

Noooooo

Not urgent, but did you have the chance to look into the issue where imatrix data for attn_k_b was missing when quantizing kimi?

@ikawrakow
Copy link
Owner Author

Not urgent, but did you have the chance to look into the issue where imatrix data for attn_k_b was missing when quantizing kimi?

Ha, I looked into it, then searched for the thread where we were talking about it, didn't find it, and then forgot.

I'm actually not sure what happens in the Kimi runs. imatrix works fine when I test with a smaller model with the same attention architecture (DeepSeek-Lite). I tested with a GGUF created specifically for llama.cpp MLA (so attn_k_b and attn_v_b present, but not attn_kv_b), with a GGUF that precedes ik_llama.cpp MLA (so only attn_kv_b present), and with a version created from the safetensors with the ik_llama.cpp convert_hf_to_gguf.py script (so, all 3 present in the GGUF). In all 3 cases it worked fine with -mla 1. I didn't see tensor names with (view of ...) appended to the attn_k_b name, and attn_v_b calls were always triggered as expected. The only thing I was not sure if I was exercising was the split of the attention calculation using -amb (DeepSeek-Lite has 8 times fewer attention heads than the giant MLA models, so not easy to trigger the split). So, perhaps running the imatrix calculation without -amb would resolve it? The imatrix runs don't need such a big context, the -mla 3 option that requires large work buffer without -amb is not being used, so it should be OK to run without -amb.

So, in short, just try running without -amb. First with --verbosity 2 to see if the imatrix data collection function gets called with attn_k_b and attn_v_b. If yes, rerun the imatrix calculation that way. If it still doesn't work, it will have to wait until I come back.

@ubergarm
Copy link
Contributor

ubergarm commented Jul 23, 2025

Hope you get some sleep before your travels! Besides we can just use Qwen3-Coder now to fix everything right? 🤣

I'll open proper issues for these if I can't figure it out. Zero rush or priority here as I've not released these two models giving me troubles.

Just got a laptop with some WiFi and can give a quick log:

When you get a chance, post the assert that the IQ2_KL model hits.

EDIT Here is the Issue: #649

IQ2_KL assert run and log
model=/mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf

numactl -N 1 -m 1 \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct \
    --ctx-size 196608 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ub 4096 -b 4096 \
    --parallel 3 \
    --threads 128 \
    --threads-batch 192 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

INFO [                    main] build info | tid="127586578487488" timestamp=1753302334 build=3821 commit="1b052109"
INFO [                    main] system info | tid="127586578487488" timestamp=1753302334 n_threads=128 n_threads_batch=192 total_threads=768 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 41 key-value pairs and 747 tensors from /mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 Coder 480B A35B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 480B-A35B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv   8:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   9:                       qwen3moe.block_count u32              = 62
llama_model_loader: - kv  10:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  11:                  qwen3moe.embedding_length u32              = 6144
llama_model_loader: - kv  12:               qwen3moe.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:              qwen3moe.attention.head_count u32              = 96
llama_model_loader: - kv  14:           qwen3moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  16:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  18:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  19:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  20:                          general.file_type u32              = 155
llama_model_loader: - kv  21:                      qwen3moe.expert_count u32              = 160
llama_model_loader: - kv  22:        qwen3moe.expert_feed_forward_length u32              = 2560
llama_model_loader: - kv  23: qwen3moe.expert_shared_feed_forward_length u32              = 0
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% macro render_item_list(item_list, ...
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /mnt/raid/models/ubergarm/Qwen3-Coder...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 497
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 840
llama_model_loader: - kv  38:                                   split.no u16              = 0
llama_model_loader: - kv  39:                                split.count u16              = 4
llama_model_loader: - kv  40:                        split.tensors.count i32              = 747
llama_model_loader: - type  f32:  311 tensors
llama_model_loader: - type q8_0:  124 tensors
llama_model_loader: - type iq3_k:   62 tensors
llama_model_loader: - type iq4_k:    1 tensors
llama_model_loader: - type iq6_k:  125 tensors
llama_model_loader: - type iq2_kl:  124 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_layer          = 62
llm_load_print_meta: n_head           = 96
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 12
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 160
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ2_KL - 2.6875 bpw
llm_load_print_meta: model params     = 480.155 B
llm_load_print_meta: model size       = 169.597 GiB (3.034 BPW) 
llm_load_print_meta: repeating layers = 168.388 GiB (3.024 BPW, 478.288 B parameters)
llm_load_print_meta: general.name     = Qwen3 Coder 480B A35B Instruct
llm_load_print_meta: BOS token        = 11 ','
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 2560
llm_load_tensors: ggml ctx size =    0.33 MiB
llm_load_tensors:        CPU buffer size = 173666.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 196608
llama_new_context_with_model: n_batch    = 4096
llama_new_context_with_model: n_ubatch   = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size = 25296.00 MiB
llama_new_context_with_model: KV self size  = 25296.00 MiB, K (q8_0): 12648.00 MiB, V (q8_0): 12648.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     2.32 MiB
llama_new_context_with_model:        CPU compute buffer size =  5184.05 MiB
llama_new_context_with_model: graph nodes  = 2424
llama_new_context_with_model: graph splits = 1
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed

GGML_ASSERT(fms.S[j] > 0) failed
/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 4140403 is a zombie - the process has already terminated
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
./myscripts/api-server-Qwen3-Coder-480B-A35B-Instruct.sh: line 34: 4140403 Aborted                 (core dumped) numactl -N 1 -m 1 ./build/bin/llama-server --model "$model" --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct --ctx-size 196608 -ctk q8_0 -ctv q8_0 -fa -fmoe -ub 4096 -b 4096 --parallel 3 --threads 128 --threads-batch 192 --numa numactl --host 127.0.0.1 --port 8080 --no-mmap

The IQ3_KT segfault will be much more difficult to fix without a run in the debugger.

EDIT here is that issue with debug logs: #650

Yeah, I'll give full logs on its own issue later, it could just be this hardware possibly as it throws an error in dmesg as well. Here is the quick look

segfault quantizing iq3_kt
$ sudo dmest -T --follow

[Wed Jul 23 16:36:14 2025] llama-quantize[4140724]: segfault at 7dd4d780a9d0 ip 00007eb9b81c634f sp 00007fff3c7bfd40 error 4 in libggml.so[9c634f,7eb9b7815000+9be000] likely on CPU 195 (core 3, socket 1)
[Wed Jul 23 16:36:14 2025] Code: ca 0f 87 80 fe ff ff c5 e8 57 d2 c5 f8 28 c2 e9 7f fe ff ff 8b bd 20 ff ff ff 8b b5 24 ff ff ff 8d 14 fd 00 00 00 00 48 63 d2 <c5> fa 10 04 90 48 8d 14 95 04 00 00 00 c5 fa 11 03 c5 fa 10 04 10

$ #!/usr/bin/env bash

# Repeating Layers [0-61]

custom="
# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq4_kt
blk\..*\.attn_v.*=iq4_kt
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
    /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
    /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf \
    IQ2_KT \
    192


main: build = 3823 (fd711836)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf' to '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf' as IQ2_KT using 192 threads
llama_model_loader: additional 20 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 37 key-value pairs and 747 tensors from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 Coder 480B A35B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 480B-A35B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv   8:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   9:                       qwen3moe.block_count u32              = 62
llama_model_loader: - kv  10:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  11:                  qwen3moe.embedding_length u32              = 6144
llama_model_loader: - kv  12:               qwen3moe.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:              qwen3moe.attention.head_count u32              = 96
llama_model_loader: - kv  14:           qwen3moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  16:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  18:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  19:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  20:                          general.file_type u32              = 32
llama_model_loader: - kv  21:                      qwen3moe.expert_count u32              = 160
llama_model_loader: - kv  22:        qwen3moe.expert_feed_forward_length u32              = 2560
llama_model_loader: - kv  23: qwen3moe.expert_shared_feed_forward_length u32              = 0
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% macro render_item_list(item_list, ...
llama_model_loader: - kv  34:                                   split.no u16              = 0
llama_model_loader: - kv  35:                                split.count u16              = 21
llama_model_loader: - kv  36:                        split.tensors.count i32              = 747
llama_model_loader: - type  f32:  311 tensors
llama_model_loader: - type bf16:  436 tensors
================================ Have weights data with 497 entries
[   1/ 747]                    token_embd.weight - [ 6144, 151936,     1,     1], type =   bf16, Using custom type iq4_kt for tensor token_embd.weight

====== llama_model_quantize_internal: did not find weights for token_embd.weight
converting to iq4_kt .. Adding custom rule blk\..*\.attn_q.* -> iq4_kt
Adding custom rule blk\..*\.attn_k.* -> iq4_kt
Adding custom rule blk\..*\.attn_v.* -> iq4_kt
Adding custom rule blk\..*\.attn_output.* -> iq4_kt
Adding custom rule blk\..*\.ffn_down_exps\.weight -> iq3_kt
Adding custom rule blk\..*\.ffn_(gate|up)_exps\.weight -> iq2_kt
Adding custom rule token_embd\.weight -> iq4_kt
Adding custom rule output\.weight -> iq6_k
load_imatrix: imatrix dataset='ubergarm-imatrix-calibration-corpus-v02.txt'
load_imatrix: loaded 497 importance matrix entries from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat computed on 840 chunks
prepare_imatrix: have 497 importance matrix entries
size =  1780.50 MiB ->   445.70 MiB
[   2/ 747]             blk.0.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[   3/ 747]                  blk.0.attn_k.weight - [ 6144,  1024,     1,     1], type =   bf16, Using custom type iq4_kt for tensor blk.0.attn_k.weight
converting to iq4_kt .. cluster_points: Oops. Cluster 4 has no points:  0 1 0 0
cluster_points: 1 out of 625 clusters dir not have any points
size =    12.00 MiB ->     3.00 MiB
[   4/ 747]             blk.0.attn_output.weight - [12288,  6144,     1,     1], type =   bf16, Using custom type iq4_kt for tensor blk.0.attn_output.weight
converting to iq4_kt .. size =   144.00 MiB ->    36.02 MiB
[   5/ 747]             blk.0.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[   6/ 747]                  blk.0.attn_q.weight - [ 6144, 12288,     1,     1], type =   bf16, Using custom type iq4_kt for tensor blk.0.attn_q.weight
converting to iq4_kt .. size =   144.00 MiB ->    36.05 MiB
[   7/ 747]                  blk.0.attn_v.weight - [ 6144,  1024,     1,     1], type =   bf16, Using custom type iq4_kt for tensor blk.0.attn_v.weight
converting to iq4_kt .. size =    12.00 MiB ->     3.00 MiB
[   8/ 747]               blk.0.attn_norm.weight - [ 6144,     1,     1,     1], type =    f32, size =    0.023 MB
[   9/ 747]           blk.0.ffn_down_exps.weight - [ 2560,  6144,   160,     1], type =   bf16, Using custom type iq3_kt for tensor blk.0.ffn_down_exps.weight
converting to iq3_kt .. ./myscripts/quantize-Qwen3-Coder-480B-A35B-Instruct-v08.sh: line 33: 2323451 Segmentation fault      (core dumped) numactl -N 0 -m 0 ./build/bin/llama-quantize --custom-q "$custom" --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf IQ2_KT 192

@ThomasBaruzier

I can open a 3rd issue for the mla stuff and put all the notes in one place along with ik's above comments and work together to figure out what is going on. thanks!

EDIT Here is that issue now: #651

@Nexesenex
Copy link
Contributor

Not much is known about IQ4_KSS, and nobody seems to be using it. So, I decided to give it some attention.

@ikawrakow : And now that it has Cuda MMQ, I will use it! Thanks for completing it!

And have a great time off!

@ThomasBaruzier
Copy link
Contributor

So, in short, just try running without -amb. First with --verbosity 2 to see if the imatrix data collection function gets called with attn_k_b and attn_v_b. If yes, rerun the imatrix calculation that way. If it still doesn't work, it will have to wait until I come back.

Thank you for the detailed explanation! Since I rely on @ubergarm's imatrix due to hardware limitations (no pressure as well), I won't be able to verify this on my end right now. You'll be back in two weeks anyway (have a great time!).

Just got a laptop with some WiFi

You seem like someone who would really appreciate Termux. Apologies for the poor internet, seems we're all on vacation/away 😅

termux.mp4

I can open a 3rd issue for the mla stuff and put all the notes in one place along with ik's above comments and work together to figure out what is going on

That sounds really nice! Thanks

@ubergarm
Copy link
Contributor

ppl-Qwen3-235B-Thinking-2507

The IQ4_KSS is looking like a pretty good spot for ubergarm/Qwen3-235B-A22B-Thinking-2507

@ubergarm
Copy link
Contributor

ubergarm commented Jul 27, 2025

I used Qwen3-Coder-480B-A35B-Instruct-IQ5_K to vibe code up some new matplot lib software and actually fixup my Y-axis log scale more similar to how I've seen some of ik's plots. The IQ4_KSS recipes seem quite strong. They differ slightly from each other, exact recipe in links below.

ppl-Qwen3-235B-Instruct-2507 ppl-Qwen3-235B-Thinking-2507

UPDATE

And just finished up the bigger Qwen3-Coder-480B-A35B-Instruct-GGUF IQ4_KSS

Qwen3-Coder-480B-ppl

(*note that the IQ2_K here is using iq2_kl as ffn_down_exps instead of larger iq3_k so it is right in line with what the IQ2_KS would be for size and PPL).

@ubergarm
Copy link
Contributor

Haha sorry for spamming up this closed PR, but that IQ4_KSS is looking good here again with today's latest Qwen3-30B moe! (coder might come out tomorrow). i hope to get some llama-sweep-bench eventually too for perf comparisons.

ppl-Qwen3-30B-A3B-Thinking-2507

@ThomasBaruzier
Copy link
Contributor

Haha sorry for spamming up this closed PR, but that IQ4_KSS is looking good here again with today's latest Qwen3-30B moe! (coder might come out tomorrow). i hope to get some llama-sweep-bench eventually too for perf comparisons.

From ikawrakow:

IQ4_KSS uses exactly 4.0 bpw just like IQ4_KT
Performance on CUDA is very similar to IQ4_KT (after this PR)
PP CPU performance is similar to IQ4_KT (after this PR)
TG CPU performance is quite a bit better than IQ4_KT
PPL is only slightly worse than IQ4_KT

If you have the chance to, could you please compare IQ4_KSS to IQ4_KT in PPL and TG PP speed?

@ubergarm
Copy link
Contributor

@ThomasBaruzier

If you have the chance to, could you please compare IQ4_KSS to IQ4_KT in PPL and TG PP speed?

Hrmm, good idea. I'm already comparing the Q4_0, IQ4_KSS, and IQ3_KT with some llama-sweep-bench and getting interesting results. I'm cooking basically a "pure" IQ4_KT now to compare which will be slightly smaller than the IQ4_KSS which has a few juiced layers and slightly boosted attn tensors.

Just a teasier it looks like TG performance is faster across the board on vulkan backend with CUDA 12.9 NV_coopmat2 despite not being able to use -fmoe with that. However, increasing batch sizes to -ub 4096 -b 4096 does not boost speed so PP on CUDA is much faster. I'll have some graphs soon!

@ubergarm
Copy link
Contributor

@ThomasBaruzier

Okay we have some interesting results! As for quality the IQ4_KT could be a bit low as the IQ4_KSS has larger iq6_k/iq5_k attn tensors and a few juiced Q8_0 (first and last layer are juiced due to --layer-similarity recommendation and some quant cooking lore lol...)

ppl-Qwen3-30B-A3B-Thinking-2507

As for speed, the IQ4_KT is among the fastest for CUDA backend both for PP and TG.

I believe the vulkan backend only supports legacy quants (e.g. q4_0, q8_0) and likely the mainline quants (e.g. q4_K, iq3_xs) it seems. I didn't try the KT on the AMD backend, but guessing it wouldn't work?

sweep-bench-Qwen3-30B-A3B-Thinking-2507

@ThomasBaruzier
Copy link
Contributor

Thanks!

Well, that's weird. Shouldn't IQ4_KT be higher quality thanks to trellis quantization? Oh, nvm, I see that the rest of the tensors are different between the two models you tested. Would you have the time to compare a IQ4_KT with the same "juiced" layers as IQ4_KSS, for fairness? Maybe this new mix could have a better PPL for the size compared to the current IQ4_KSS?

Damn, that made me remember an old idea where we could treat quant mixes as an optimization problem and try to bruteforce our way to the lowest PPL considering size, iteratively. Wait, isn't this what Unsloth brands as dynamic quants 2.0? But you beat him most of the time with your mixes, lmao? Or is it because he simply uses quants from mainline I assume?

Also, how much do you gain with --layer-similarity in general?

@ubergarm
Copy link
Contributor

ubergarm commented Jul 31, 2025

Would you have the time to compare a IQ4_KT with the same "juiced" layers as IQ4_KSS, for fairness? Maybe this new mix could have a better PPL for the size compared to the current IQ4_KSS?

That is indeed the next question, I almost did it twice but had to take a dinner break lol, but now its on its way - EDIT see next comment for the graph including this data:

👈 The ~4BPW Quant Data
  {
    "name": "IQ4_KSS",
    "ppl": "7.3861 +/- 0.05128",
    "size": 15.531,
    "bpw": 4.370,
    "legend": "ubergarm",
    "comment": "iq6_k k|v, iq5_k q|o, juiced attn layers 0, iq4_ks down, iq4_kss gate|up, juiced ffn layers 0|47, iq4_k/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
  },
  {
    "name": "juiced-IQ4_KT",
    "ppl": "7.4226 +/- 0.05154",
    "size": 15.244,
    "bpw": 4.289,
    "legend": "ubergarm",
    "comment": "iq6_k k|v, iq5_k q|o, juiced attn layers 0, iq4_kt down, iq4_kt gate|up, juiced ffn layers 0|47, iq4_kt/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
  },
  {
    "name": "IQ4_KT",
    "ppl": "7.5020 +/- 0.05230",
    "size": 14.438,
    "bpw": 4.062,
    "legend": "ubergarm",
    "comment": "mostly pure iq4_kt iq4_kt/iq6_k embd/output, eaddario-imatrix-corpus-combined-all-medium.txt"
  },

But you beat him most of the time with your mixes, lmao? Or is it because he simply uses quants from mainline I assume?

Haha, yeah this is what they put on their modelcard:

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

here is what I put on my modelcard:

These quants provide best in class perplexity for the given memory footprint.

Maybe both can be true if you don't consider me as providing "leading quants"! 😹

I'm not convinced the way unsloth has decided to vary tensor quantization types across layers gives a particularly better performance (speed or perplexity). I think a its a balance of trade-offs between:

  1. lower perplexity for a given BPW
  2. PP and TG inferencing speed for a given hardware configuration breakpoints (e.g. 24GB RAM or 24GB VRAM)
  3. Support on various platforms e.g. vulkan backend AMD GPUs, NEON, CPU, CUDA etc...

Usually you can make a pretty good quant with a decent balance by a combination of:

  1. decent imatrix corpus with enough data to activate most of the routed experts
  2. using ik's newer SOTA quants
  3. following the basic quant cooking lore:
  • more BPW tends to give lower PPL
  • PP tends to be CPU bound
  • TG tends to be memory bound (except for KT quants)
  • some quants perform differently at different batch sizes
  • you have to benchmark to develop intuition superstition 😹
  • throw away failed experimental recipes and iterate trial and error
  • you can afford to make attn_(k|v) a bit larger as they are relatively small overall but can slow town TG a bit given it is a larger proportion of the activated weights
  • make the first and last 10% of layers one notch bigger
  • luck
  • make ffn_down one notch bigger than ffn_(gate|up)
  • try various backends
  • if the quant is on CPU and you have avx_vnni experimental PR compiled in and the quant uses the q8_k_r8 kernel
  • keep tabs on other quant types as well for speed and quality e.g. EXL3 as discussed by me and bullerwins on huggingface here

It is a fun little multi-variable human gradient descent hobby! 😹

Regarding the second half of your question, Unsloth expressed some possible interest in releasing unsloth ik_llama.cpp quants in another post on this repo in the past. And, yes, it probably would help them push the pareto curve ever downward as well by using these new quants.

Also, how much do you gain with --layer-similarity in general?

I haven't fully explored it e.g. by automating some kind of --layer-similarity to --custom-q recipe script generator. In general I don't worry too much about it, but do look at it to see if any unusual layers are reported as more "important" and about how much more "important" the first and last few layers might be.

👈 --layer-similarity for Qwen3-30B-A3B-Thinking-2507

Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat

======================== sorted layer importances
  0: Layer  47, <cos_sim> = 0.297816
  1: Layer   0, <cos_sim> = 0.305244
  2: Layer   1, <cos_sim> = 0.709352
  3: Layer  28, <cos_sim> = 0.830869
  4: Layer   2, <cos_sim> = 0.844787
  5: Layer   7, <cos_sim> = 0.861447
  6: Layer  29, <cos_sim> = 0.864968
  7: Layer   3, <cos_sim> = 0.880728
  8: Layer   8, <cos_sim> = 0.892042
  9: Layer   6, <cos_sim> = 0.905458
 10: Layer   5, <cos_sim> = 0.90886
 11: Layer  42, <cos_sim> = 0.914703
 12: Layer   4, <cos_sim> = 0.915015
 13: Layer  17, <cos_sim> = 0.91581
 14: Layer  13, <cos_sim> = 0.921882
 15: Layer  46, <cos_sim> = 0.926183
 16: Layer  45, <cos_sim> = 0.932304
 17: Layer  19, <cos_sim> = 0.936483
 18: Layer  18, <cos_sim> = 0.937157
 19: Layer  31, <cos_sim> = 0.940826
 20: Layer  14, <cos_sim> = 0.942221
 21: Layer  40, <cos_sim> = 0.944539
 22: Layer   9, <cos_sim> = 0.94595
 23: Layer  10, <cos_sim> = 0.94767
 24: Layer  25, <cos_sim> = 0.948227
 25: Layer  11, <cos_sim> = 0.94864
 26: Layer  32, <cos_sim> = 0.948681
 27: Layer  37, <cos_sim> = 0.949749
 28: Layer  41, <cos_sim> = 0.951289
 29: Layer  39, <cos_sim> = 0.952341
 30: Layer  12, <cos_sim> = 0.953235
 31: Layer  44, <cos_sim> = 0.953276
 32: Layer  16, <cos_sim> = 0.95375
 33: Layer  20, <cos_sim> = 0.954073
 34: Layer  38, <cos_sim> = 0.954789
 35: Layer  22, <cos_sim> = 0.955904
 36: Layer  15, <cos_sim> = 0.956555
 37: Layer  21, <cos_sim> = 0.956733
 38: Layer  23, <cos_sim> = 0.957164
 39: Layer  43, <cos_sim> = 0.958506
 40: Layer  30, <cos_sim> = 0.958633
 41: Layer  27, <cos_sim> = 0.959653
 42: Layer  24, <cos_sim> = 0.960708
 43: Layer  36, <cos_sim> = 0.964712
 44: Layer  26, <cos_sim> = 0.964958
 45: Layer  35, <cos_sim> = 0.965977
 46: Layer  34, <cos_sim> = 0.968197
 47: Layer  33, <cos_sim> = 0.972509

======================== sorted attention importances
  0: Layer   0, <cos_sim> = 0.373726
  1: Layer  45, <cos_sim> = 0.621582
  2: Layer   1, <cos_sim> = 0.668392
  3: Layer  29, <cos_sim> = 0.675207
  4: Layer  17, <cos_sim> = 0.704994
  5: Layer  21, <cos_sim> = 0.708088
  6: Layer   3, <cos_sim> = 0.712065
  7: Layer  44, <cos_sim> = 0.719689
  8: Layer  22, <cos_sim> = 0.726337
  9: Layer  42, <cos_sim> = 0.728414
 10: Layer  23, <cos_sim> = 0.734638
 11: Layer  18, <cos_sim> = 0.734929
 12: Layer  24, <cos_sim> = 0.735911
 13: Layer   8, <cos_sim> = 0.73788
 14: Layer  33, <cos_sim> = 0.741519
 15: Layer  27, <cos_sim> = 0.742112
 16: Layer  46, <cos_sim> = 0.742959
 17: Layer  30, <cos_sim> = 0.745445
 18: Layer  34, <cos_sim> = 0.746015
 19: Layer  47, <cos_sim> = 0.746472
 20: Layer   9, <cos_sim> = 0.746761
 21: Layer   6, <cos_sim> = 0.748994
 22: Layer  20, <cos_sim> = 0.752889
 23: Layer   2, <cos_sim> = 0.753263
 24: Layer  41, <cos_sim> = 0.754112
 25: Layer  25, <cos_sim> = 0.755797
 26: Layer  26, <cos_sim> = 0.755917
 27: Layer  28, <cos_sim> = 0.75632
 28: Layer  43, <cos_sim> = 0.757009
 29: Layer  35, <cos_sim> = 0.758833
 30: Layer   4, <cos_sim> = 0.75965
 31: Layer  10, <cos_sim> = 0.766588
 32: Layer  36, <cos_sim> = 0.768189
 33: Layer  19, <cos_sim> = 0.768958
 34: Layer  32, <cos_sim> = 0.769336
 35: Layer  11, <cos_sim> = 0.771553
 36: Layer  31, <cos_sim> = 0.781223
 37: Layer  16, <cos_sim> = 0.785931
 38: Layer   7, <cos_sim> = 0.786268
 39: Layer  15, <cos_sim> = 0.787708
 40: Layer   5, <cos_sim> = 0.790609
 41: Layer  12, <cos_sim> = 0.791013
 42: Layer  37, <cos_sim> = 0.792411
 43: Layer  14, <cos_sim> = 0.794113
 44: Layer  39, <cos_sim> = 0.794925
 45: Layer  38, <cos_sim> = 0.795931
 46: Layer  40, <cos_sim> = 0.799352
 47: Layer  13, <cos_sim> = 0.802178

======================== sorted ffn importances
  0: Layer  47, <cos_sim> = 0.533469
  1: Layer  44, <cos_sim> = 0.622946
  2: Layer   0, <cos_sim> = 0.643964
  3: Layer  28, <cos_sim> = 0.67538
  4: Layer   7, <cos_sim> = 0.684103
  5: Layer  16, <cos_sim> = 0.69021
  6: Layer  21, <cos_sim> = 0.703409
  7: Layer  43, <cos_sim> = 0.703716
  8: Layer  20, <cos_sim> = 0.703982
  9: Layer   1, <cos_sim> = 0.709765
 10: Layer  45, <cos_sim> = 0.711489
 11: Layer  46, <cos_sim> = 0.715068
 12: Layer  33, <cos_sim> = 0.721819
 13: Layer  19, <cos_sim> = 0.725088
 14: Layer  22, <cos_sim> = 0.72533
 15: Layer  32, <cos_sim> = 0.730856
 16: Layer   3, <cos_sim> = 0.731085
 17: Layer   8, <cos_sim> = 0.731686
 18: Layer   9, <cos_sim> = 0.736359
 19: Layer  23, <cos_sim> = 0.736744
 20: Layer   2, <cos_sim> = 0.737244
 21: Layer  31, <cos_sim> = 0.739362
 22: Layer  24, <cos_sim> = 0.743266
 23: Layer  34, <cos_sim> = 0.743324
 24: Layer  41, <cos_sim> = 0.744927
 25: Layer  40, <cos_sim> = 0.749878
 26: Layer  10, <cos_sim> = 0.75342
 27: Layer  26, <cos_sim> = 0.753776
 28: Layer  27, <cos_sim> = 0.758283
 29: Layer  17, <cos_sim> = 0.759731
 30: Layer  35, <cos_sim> = 0.763794
 31: Layer  18, <cos_sim> = 0.765849
 32: Layer   6, <cos_sim> = 0.766675
 33: Layer  42, <cos_sim> = 0.767223
 34: Layer  36, <cos_sim> = 0.767253
 35: Layer  29, <cos_sim> = 0.767677
 36: Layer   4, <cos_sim> = 0.770757
 37: Layer  25, <cos_sim> = 0.771877
 38: Layer  30, <cos_sim> = 0.778096
 39: Layer  12, <cos_sim> = 0.784316
 40: Layer   5, <cos_sim> = 0.785474
 41: Layer  15, <cos_sim> = 0.787438
 42: Layer  11, <cos_sim> = 0.790912
 43: Layer  39, <cos_sim> = 0.79183
 44: Layer  14, <cos_sim> = 0.795523
 45: Layer  38, <cos_sim> = 0.79796
 46: Layer  13, <cos_sim> = 0.816884
 47: Layer  37, <cos_sim> = 0.819399

@ubergarm
Copy link
Contributor

Okay here are the results! Seems like just that extra BPW from the larger IQ4_XS ffn_down_exps is winning out over the iq4_kt quality at least in this one specific case. We don't have iq5_kt or iq6_kt which could be an interesting combination for attn.* tensors, anyway without further ado:

ppl-Qwen3-30B-A3B-Thinking-2507

@ThomasBaruzier
Copy link
Contributor

ThomasBaruzier commented Aug 1, 2025

Usually you can make a pretty good quant with a decent balance by a combination of ...

Thanks for the insights, I will definitely use some of these points later

keep tabs on other quant types as well for speed and quality e.g. EXL3 as discussed by me and bullerwins on huggingface here

I was daily driving EXL2 and <= 120b models before all those MoEs came out, now it's impossible to come back 😂 (it's also way more fun to run something you aren't supposed to run on your hardware)

Waiting for TP in EXL3 before trying again... (or maybe? #627)

It is a fun little multi-variable human gradient descent hobby!

Looks like a full-time job, tbh

Thanks again for these results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants