convert: add dequant function for compressed_tensor (kimi-k2-thinking) #17064

ngxson · 2025-11-06T21:32:08Z

Need help for testing this

Model: https://huggingface.co/moonshotai/Kimi-K2-Thinking

csabakecskemeti · 2025-11-06T21:56:08Z

@ngxson still downloading the model but will test and report back!

ngxson · 2025-11-06T22:07:01Z

The output GGUF quantized to Q8_0 will be over 1 terabyte. Now I'm doubt is I even have enough memory to test it.

csabakecskemeti · 2025-11-06T22:11:29Z

over how much? :) I have ~1.1TB + 64G vram

ngxson · 2025-11-06T22:13:24Z

python convert_hf_to_gguf.py --outfile model.gguf --outtype q8_0 .

Output GGUF will be 1.09T

ubergarm · 2025-11-06T22:15:49Z

Exciting, thanks for looking into this one y'all!

Well, started off strong, but then died RuntimeError: Tensor on device cpu is not on the expected device meta!.

I'm on a CPU-only rig with 1.5TB RAM and plenty of disk space. But no GPUs. I have installed triton-cpu instead of triton in my python venv for what it is worth.

Also had to manualy push Y to accept in lieu of trust_remote_code=True.

👈 Details Command and full Logs

$ numactl -N 1 -m 1 \
python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF \
    /mnt/data/models/moonshotai/Kimi-K2-Thinking/

INFO:hf-to-gguf:Loading model: Kimi-K2-Thinking
WARNING:hf-to-gguf:Failed to load model config from /mnt/data/models/moonshotai/Kimi-K2-Thinking: The repository /mnt/data/models/moonshotai/Kimi-K2-Thinking contains custom code which must be executed to correctly load the model. You can inspect the repository content at /mnt/data/models/moonshotai/Kimi-K2-Thinking .
 You can inspect the repository content at https://hf.co//mnt/data/models/moonshotai/Kimi-K2-Thinking.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.
WARNING:hf-to-gguf:Trying to load config.json instead
INFO:hf-to-gguf:Model architecture: DeepseekV3ForCausalLM
WARNING:hf-to-gguf:Failed to load model config from /mnt/data/models/moonshotai/Kimi-K2-Thinking: The repository /mnt/data/models/moonshotai/Kimi-K2-Thinking contains custom code which must be executed to correctly load the model. You can inspect the repository content at /mnt/data/models/moonshotai/Kimi-K2-Thinking .
 You can inspect the repository content at https://hf.co//mnt/data/models/moonshotai/Kimi-K2-Thinking.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.
WARNING:hf-to-gguf:Trying to load config.json instead
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: indexing model part 'model-00001-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00002-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00003-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00004-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00005-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00006-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00007-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00008-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00009-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00010-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00011-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00012-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00013-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00014-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00015-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00016-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00017-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00018-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00019-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00020-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00021-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00022-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00023-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00024-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00025-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00026-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00027-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00028-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00029-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00030-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00031-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00032-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00033-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00034-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00035-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00036-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00037-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00038-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00039-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00040-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00041-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00042-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00043-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00044-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00045-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00046-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00047-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00048-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00049-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00050-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00051-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00052-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00053-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00054-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00055-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00056-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00057-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00058-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00059-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00060-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00061-of-000062.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00062-of-000062.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:blk.0.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.0.ffn_down.weight,        torch.bfloat16 --> BF16, shape = {18432, 7168}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,        torch.bfloat16 --> BF16, shape = {7168, 18432}
INFO:hf-to-gguf:blk.0.ffn_up.weight,          torch.bfloat16 --> BF16, shape = {7168, 18432}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.0.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.0.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.0.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.0.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.0.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.0.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.1.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.1.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.1.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.1.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.1.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.1.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.1.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.1.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.1.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.1.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.1.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.1.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.1.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.1.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.2.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.2.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.2.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.2.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.2.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.2.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.2.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.2.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.2.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.2.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.2.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.2.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.2.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.2.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.3.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.3.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.3.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.3.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.3.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.3.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.3.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.3.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.3.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.3.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.3.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.3.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.3.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.3.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.4.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.4.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.4.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.4.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.4.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.4.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.4.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.4.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.4.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.4.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.4.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.4.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.4.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.4.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.4.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.5.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.5.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.5.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.5.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.5.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.5.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.5.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.5.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.5.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.5.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.5.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.5.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.5.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.5.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.5.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.6.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.6.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.6.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.6.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.6.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.6.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.6.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.6.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.6.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.6.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.6.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.6.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.6.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.6.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.6.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.7.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.7.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.7.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.7.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.7.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.7.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.7.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.7.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.7.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.7.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.7.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.7.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.7.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.7.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.7.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.8.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.8.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.8.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.8.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.8.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.8.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.8.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.8.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.8.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.8.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.8.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.8.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.8.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.8.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.8.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.9.attn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.9.exp_probs_b.bias,       torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.9.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.9.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.9.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.9.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.9.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.9.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.9.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.9.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.9.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.9.attn_output.weight,     torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.9.attn_q_a_norm.weight,   torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.9.attn_q_a.weight,        torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.9.attn_q_b.weight,        torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.10.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.10.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.10.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.10.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.10.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.10.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.10.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.10.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.10.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.10.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.10.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.10.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.10.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.10.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.10.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.11.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.11.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.11.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.11.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.11.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.11.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.11.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.11.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.11.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.11.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.11.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.11.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.11.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.11.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.11.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.12.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.12.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.12.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.12.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.12.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.12.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.12.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.12.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.12.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.12.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.12.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.12.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.12.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.12.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.12.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.13.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.13.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.13.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.13.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.13.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.13.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.13.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.13.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.13.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.13.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.13.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.13.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.13.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.13.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.13.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.14.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.14.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.14.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.14.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.14.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.14.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.14.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.14.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.14.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.14.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.14.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.14.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.14.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.14.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.14.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.15.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.15.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.15.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.15.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.15.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.15.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.15.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.15.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.15.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.15.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.15.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.15.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.15.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.15.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.15.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.16.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.16.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.16.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.16.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.16.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.16.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.16.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.16.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.16.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.16.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.16.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.16.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.16.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.16.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.16.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.17.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.17.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.17.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.17.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.17.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.17.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.17.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.17.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.17.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.17.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.17.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.17.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.17.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.17.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.17.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.18.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.18.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.18.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.18.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.18.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.18.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.18.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.18.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.18.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.18.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.18.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.18.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.18.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.18.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.18.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.19.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.19.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.19.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.19.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.19.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.19.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.19.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.19.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.19.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.19.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.19.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.19.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.19.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.19.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.19.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.20.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.20.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.20.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.20.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.20.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.20.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.20.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.20.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.20.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.20.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.20.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.20.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.20.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.20.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.20.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.21.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.21.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.21.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.21.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.21.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.21.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.21.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.21.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.21.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.21.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.21.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.21.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.21.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.21.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.21.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.22.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.22.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.22.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.22.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.22.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.22.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.22.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.22.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.22.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.22.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.22.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.22.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.22.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.22.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.22.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.23.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.23.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.23.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.23.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.23.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.23.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.23.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.23.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.23.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.23.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.23.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.23.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.23.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.23.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.23.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.24.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.24.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.24.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.24.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.24.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.24.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.24.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.24.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.24.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.24.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.24.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.24.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.24.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.24.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.24.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.25.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.25.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.25.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.25.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.25.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.25.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.25.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.25.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.25.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.25.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.25.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.25.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.25.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.25.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.25.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.26.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.26.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.26.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.26.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.26.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.26.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.26.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.26.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.26.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.26.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.26.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.26.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.26.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.26.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.26.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.27.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.27.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.27.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.27.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.27.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.27.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.27.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.27.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.27.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.27.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.27.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.27.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.27.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.27.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.27.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.28.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.28.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.28.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.28.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.28.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.28.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.28.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.28.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.28.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.28.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.28.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.28.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.28.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.28.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.28.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.29.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.29.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.29.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.29.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.29.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.29.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.29.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.29.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.29.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.29.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.29.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.29.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.29.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.29.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.29.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.30.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.30.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.30.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.30.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.30.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.30.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.30.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.30.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.30.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.30.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.30.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.30.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.30.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.30.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.30.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.31.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.31.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.31.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.31.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.31.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.31.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.31.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.31.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.31.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.31.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.31.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.31.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.31.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.31.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.31.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.32.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.32.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.32.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.32.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.32.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.32.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.32.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.32.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.32.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.32.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.32.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.32.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.32.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.32.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.32.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.33.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.33.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.33.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.33.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.33.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.33.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.33.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.33.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.33.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.33.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.33.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.33.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.33.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.33.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.33.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.34.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.34.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.34.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.34.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.34.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.34.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.34.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.34.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.34.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.34.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.34.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.34.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.34.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.34.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.34.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.35.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.35.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.35.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.35.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.35.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.35.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.35.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.35.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.35.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.35.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.35.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.35.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.35.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.35.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.35.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.36.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.36.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.36.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.36.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.36.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.36.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.36.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.36.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.36.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.36.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.36.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.36.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.36.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.36.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.36.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.37.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.37.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.37.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.37.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.37.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.37.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.37.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.37.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.37.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.37.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.37.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.37.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.37.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.37.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.37.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.38.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.38.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.38.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.38.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.38.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.38.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.38.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.38.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.38.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.38.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.38.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.38.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.38.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.38.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.38.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.39.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.39.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.39.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.39.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.39.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.39.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.39.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.39.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.39.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.39.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.39.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.39.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.39.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.39.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.39.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.40.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.40.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.40.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.40.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.40.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.40.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.40.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.40.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.40.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.40.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.40.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.40.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.40.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.40.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.40.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.41.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.41.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.41.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.41.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.41.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.41.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.41.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.41.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.41.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.41.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.41.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.41.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.41.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.41.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.41.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.42.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.42.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.42.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.42.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.42.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.42.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.42.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.42.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.42.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.42.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.42.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.42.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.42.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.42.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.42.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.43.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.43.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.43.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.43.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.43.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.43.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.43.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.43.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.43.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.43.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.43.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.43.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.43.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.43.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.43.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.44.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.44.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.44.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.44.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.44.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.44.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.44.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.44.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.44.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.44.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.44.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.44.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.44.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.44.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.44.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.45.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.45.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.45.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.45.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.45.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.45.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.45.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.45.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.45.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.45.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.45.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.45.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.45.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.45.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.45.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.46.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.46.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.46.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.46.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.46.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.46.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.46.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.46.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.46.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.46.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.46.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.46.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.46.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.46.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.46.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.47.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.47.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.47.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.47.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.47.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.47.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.47.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.47.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.47.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.47.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.47.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.47.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.47.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.47.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.47.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.48.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.48.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.48.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.48.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.48.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.48.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.48.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.48.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.48.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.48.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.48.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.48.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.48.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.48.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.48.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.49.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.49.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.49.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.49.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.49.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.49.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.49.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.49.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.49.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.49.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.49.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.49.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.49.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.49.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.49.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.50.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.50.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.50.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.50.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.50.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.50.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.50.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.50.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.50.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.50.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.50.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.50.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.50.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.50.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.50.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.51.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.51.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.51.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.51.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.51.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.51.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.51.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.51.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.51.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.51.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.51.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.51.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.51.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.51.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.51.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.52.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.52.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.52.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.52.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.52.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.52.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.52.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.52.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.52.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.52.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.52.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.52.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.52.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.52.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.52.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.53.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.53.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.53.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.53.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.53.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.53.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.53.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.53.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.53.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.53.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.53.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.53.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.53.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.53.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.53.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.54.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.54.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.54.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.54.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.54.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.54.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.54.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.54.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.54.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.54.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.54.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.54.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.54.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.54.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.54.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.55.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.55.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.55.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.55.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.55.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.55.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.55.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.55.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.55.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.55.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.55.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.55.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.55.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.55.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.55.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.56.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.56.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.56.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.56.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.56.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.56.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.56.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.56.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.56.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.56.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.56.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.56.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.56.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.56.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.56.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.57.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.57.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.57.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.57.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.57.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.57.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.57.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.57.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.57.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.57.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.57.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.57.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.57.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.57.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.57.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.58.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.58.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.58.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.58.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.58.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.58.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.58.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.58.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.58.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.58.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.58.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.58.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.58.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.58.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.58.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.59.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.59.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.59.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.59.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.59.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.59.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.59.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.59.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.59.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.59.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.59.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.59.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.59.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.59.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.59.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:blk.60.attn_norm.weight,      torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.60.exp_probs_b.bias,      torch.float32 --> F32, shape = {384}
INFO:hf-to-gguf:blk.60.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 384}
INFO:hf-to-gguf:blk.60.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {2048, 7168}
INFO:hf-to-gguf:blk.60.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.60.ffn_up_shexp.weight,   torch.bfloat16 --> BF16, shape = {7168, 2048}
INFO:hf-to-gguf:blk.60.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.60.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.60.attn_kv_a_mqa.weight,  torch.bfloat16 --> BF16, shape = {7168, 576}
INFO:hf-to-gguf:blk.60.attn_k_b.weight,       torch.bfloat16 --> BF16, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.60.attn_v_b.weight,       torch.bfloat16 --> BF16, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.60.attn_output.weight,    torch.bfloat16 --> BF16, shape = {8192, 7168}
INFO:hf-to-gguf:blk.60.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.60.attn_q_a.weight,       torch.bfloat16 --> BF16, shape = {7168, 1536}
INFO:hf-to-gguf:blk.60.attn_q_b.weight,       torch.bfloat16 --> BF16, shape = {1536, 12288}
INFO:hf-to-gguf:output.weight,                torch.bfloat16 --> BF16, shape = {7168, 163840}
INFO:hf-to-gguf:token_embd.weight,            torch.bfloat16 --> BF16, shape = {7168, 163840}
INFO:hf-to-gguf:output_norm.weight,           torch.bfloat16 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.1.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.1.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.1.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.2.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.2.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.2.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.3.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.3.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.3.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.4.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.4.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.4.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.5.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.5.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.5.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.6.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.6.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.6.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.7.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.7.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.7.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.8.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.8.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.8.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.9.ffn_down_exps.weight,   torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.9.ffn_gate_exps.weight,   torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.9.ffn_up_exps.weight,     torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.10.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.10.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.10.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.11.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.11.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.11.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.12.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.12.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.12.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.13.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.13.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.13.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.14.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.14.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.14.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.15.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.15.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.15.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.16.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.16.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.16.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.17.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.17.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.17.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.18.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.18.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.18.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.19.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.19.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.19.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.20.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.20.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.20.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.21.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.21.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.21.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.22.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.22.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.22.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.23.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.23.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.23.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.24.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.24.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.24.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.25.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.25.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.25.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.26.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.26.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.26.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.27.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.27.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.27.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.28.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.28.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.28.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.29.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.29.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.29.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.30.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.30.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.30.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.31.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.31.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.31.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.32.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.32.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.32.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.33.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.33.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.33.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.34.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.34.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.34.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.35.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.35.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.35.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.36.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.36.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.36.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.37.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.37.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.37.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.38.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.38.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.38.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.39.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.39.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.39.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.40.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.40.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.40.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.41.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.41.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.41.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.42.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.42.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.42.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.43.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.43.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.43.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.44.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.44.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.44.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.45.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.45.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.45.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.46.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.46.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.46.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.47.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.47.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.47.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.48.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.48.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.48.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.49.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.49.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.49.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.50.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.50.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.50.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.51.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.51.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.51.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.52.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.52.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.52.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.53.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.53.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.53.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.54.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.54.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.54.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.55.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.55.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.55.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.56.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.56.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.56.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.57.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.57.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.57.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.58.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.58.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.58.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.59.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.59.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.59.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.60.ffn_down_exps.weight,  torch.float32 --> BF16, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.60.ffn_gate_exps.weight,  torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.60.ffn_up_exps.weight,    torch.float32 --> BF16, shape = {7168, 2048, 384}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 262144
INFO:hf-to-gguf:gguf: embedding length = 7168
INFO:hf-to-gguf:gguf: feed forward length = 18432
INFO:hf-to-gguf:gguf: head count = 64
INFO:hf-to-gguf:gguf: key-value head count = 1
INFO:hf-to-gguf:gguf: rope theta = 50000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: experts used count = 8
INFO:hf-to-gguf:gguf: expert groups count = 1
INFO:hf-to-gguf:gguf: expert groups used count = 1
INFO:hf-to-gguf:gguf: file type = 32
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
The repository /mnt/data/models/moonshotai/Kimi-K2-Thinking contains custom code which must be executed to correctly load the model. You can inspect the repository content at /mnt/data/models/moonshotai/Kimi-K2-Thinking .
 You can inspect the repository content at https://hf.co//mnt/data/models/moonshotai/Kimi-K2-Thinking.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] INFO:transformers_modules.Kimi_hyphen_K2_hyphen_Thinking.tokenization_kimi:Reloaded tiktoken model from /mnt/data/models/moonshotai/Kimi-K2-Thinking/tiktoken.model
INFO:transformers_modules.Kimi_hyphen_K2_hyphen_Thinking.tokenization_kimi:#words: 163842 - BOS ID: 163584 - EOS ID: 163585
INFO:transformers_modules.Kimi_hyphen_K2_hyphen_Thinking.tokenization_kimi:Reloaded tiktoken model from /mnt/data/models/moonshotai/Kimi-K2-Thinking/tiktoken.model
INFO:transformers_modules.Kimi_hyphen_K2_hyphen_Thinking.tokenization_kimi:#words: 163842 - BOS ID: 163584 - EOS ID: 163585
INFO:gguf.vocab:Setting special token type bos to 163584
INFO:gguf.vocab:Setting special token type eos to 163586
INFO:gguf.vocab:Setting special token type pad to 163839
INFO:gguf.vocab:Setting chat_template to {%- macro render_content(msg) -%}
    {%- set c = msg.get('content') -%}
    {%- if c is string -%}
      {{ c }}
    {%- elif c is not none -%}
      {% for content in c -%}
        {% if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}
          <|media_start|>image<|media_content|><|media_pad|><|media_end|>
        {% else -%}
          {{ content['text'] }}
        {%- endif -%}
      {%- endfor -%}
    {%- endif -%}
{%- endmacro -%}

{% macro set_roles(message) -%}
  {%- set role_name =  message.get('name') or  message['role'] -%}
  {%- if message['role'] == 'user' -%}
    <|im_user|>{{role_name}}<|im_middle|>
  {%- elif message['role'] == 'assistant' -%}
    <|im_assistant|>{{role_name}}<|im_middle|>
  {%- else -%}
    <|im_system|>{{role_name}}<|im_middle|>
  {%- endif -%}
{%- endmacro -%}


{%- macro render_toolcalls(message) -%}
  <|tool_calls_section_begin|>
  {%- for tool_call in message['tool_calls'] -%}
    {%- set formatted_id = tool_call['id'] -%}
    <|tool_call_begin|>{{ formatted_id }}<|tool_call_argument_begin|>{% if tool_call['function']['arguments'] is string %}{{ tool_call['function']['arguments'] }}{% else %}{{ tool_call['function']['arguments'] | tojson }}{% endif %}<|tool_call_end|>
  {%- endfor -%}
  <|tool_calls_section_end|>
{%- endmacro -%}


{# Find last non-tool-call assisitant message #}
{%- set ns = namespace(last_non_tool_call_assistant_msg=-1) -%}
{%- for idx in range(messages|length-1, -1, -1) -%}
    {%- if messages[idx]['role'] == 'assistant' and not messages[idx].get('tool_calls') -%}
        {%- set ns.last_non_tool_call_assistant_msg = idx -%}
        {%- break -%}
    {%- endif -%}
{%- endfor -%}

{# split all messages into history & suffix, reasoning_content in suffix should be reserved.#}
{%- set hist_msgs = messages[:ns.last_non_tool_call_assistant_msg+1] -%}
{%- set suffix_msgs = messages[ns.last_non_tool_call_assistant_msg+1:] -%}

{%- if tools -%}
  <|im_system|>tool_declare<|im_middle|>{{ tools | tojson(separators=(',', ':')) }}<|im_end|>
{%- endif -%}

{%- for message in hist_msgs -%}
  {%- if loop.first and messages[0]['role'] != 'system' -%}
  <|im_system|>system<|im_middle|>You are Kimi, an AI assistant created by Moonshot AI.<|im_end|>
  {%- endif -%}
  {{set_roles(message)}}
  {%- if message['role'] == 'assistant' -%}
    <think></think>{{render_content(message)}}
    {%- if message.get('tool_calls') -%}
      {{render_toolcalls(message)}}
    {%- endif -%}
  {%- elif message['role'] == 'tool' -%}
    {%- set tool_call_id = message.tool_call_id -%}
    ## Return of {{ tool_call_id }}
{{render_content(message)}}
  {%- elif message['content'] is not none -%}
    {{render_content(message)}}
  {%- endif -%}
  <|im_end|>
{%- endfor -%}

{%- for message in suffix_msgs -%}
  {{set_roles(message)}}
  {%- if message['role'] == 'assistant' -%}
    {%- set rc = message.get('reasoning_content', '') -%}
    <think>{{rc}}</think>{{render_content(message)}}
    {%- if message.get('tool_calls') -%}
     {{render_toolcalls(message)}}
    {%- endif -%}
  {%- elif message['role'] == 'tool' -%}
    {%- set tool_call_id = message.tool_call_id -%}
    ## Return of {{ tool_call_id }}
{{render_content(message)}}
  {%- elif message['content'] is not none -%}
    {{render_content(message)}}
  {%- endif -%}
  <|im_end|>
{%- endfor -%}


{%- if add_generation_prompt -%}
  <|im_assistant|>assistant<|im_middle|>
{%- endif -%}
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf: n_tensors = 918, total_size = 46.3G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00002-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00003-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00004-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00005-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00006-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00007-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00008-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00009-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00010-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00011-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00012-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00013-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00014-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00015-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00016-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00017-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00018-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00019-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00020-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00021-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00022-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00023-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00024-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00025-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00026-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00027-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00028-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00029-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00030-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00031-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00032-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00033-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00034-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00035-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00036-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00037-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00038-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00039-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00040-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00041-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00042-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00043-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00044-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00045-of-00046.gguf: n_tensors = 4, total_size = 45.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00046-of-00046.gguf: n_tensors = 2, total_size = 22.5G

Shard (0/46): 0.00byte [00:00, ?byte/s]

Writing:   1%|          | 21.4G/2.05T [00:30<45:06, 751Mbyte/s]�[A
Shard (1/46):  51%|█████▏    | 23.8G/46.3G [00:33<00:32, 684Mbyte/s]

Writing:   1%|          | 23.8G/2.05T [00:33<49:27, 684Mbyte/s]�[ATraceback (most recent call last):
  File "/home/w/projects/llama.cpp/convert_hf_to_gguf.py", line 10314, in <module>
    main()
  File "/home/w/projects/llama.cpp/convert_hf_to_gguf.py", line 10308, in main
    model_instance.write()
  File "/home/w/projects/llama.cpp/convert_hf_to_gguf.py", line 634, in write
    self.gguf_writer.write_tensors_to_file(progress=True)
  File "/home/w/projects/llama.cpp/gguf-py/gguf/gguf_writer.py", line 456, in write_tensors_to_file
    ti.tensor.tofile(fout)
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 220, in tofile
    eager = LazyNumpyTensor.to_eager(self)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 179, in to_eager
    return cls._recurse_apply(t, simple_to_eager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 105, in _recurse_apply
    return fn(o)
           ^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 169, in simple_to_eager
    _t._args = cls._recurse_apply(_t._args, simple_to_eager)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 100, in _recurse_apply
    L.append(LazyBase._recurse_apply(item, fn))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 105, in _recurse_apply
    return fn(o)
           ^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 169, in simple_to_eager
    _t._args = cls._recurse_apply(_t._args, simple_to_eager)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 100, in _recurse_apply
    L.append(LazyBase._recurse_apply(item, fn))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 105, in _recurse_apply
    return fn(o)
           ^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 169, in simple_to_eager
    _t._args = cls._recurse_apply(_t._args, simple_to_eager)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 100, in _recurse_apply
    L.append(LazyBase._recurse_apply(item, fn))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 100, in _recurse_apply
    L.append(LazyBase._recurse_apply(item, fn))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 105, in _recurse_apply
    return fn(o)
           ^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 169, in simple_to_eager
    _t._args = cls._recurse_apply(_t._args, simple_to_eager)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 100, in _recurse_apply
    L.append(LazyBase._recurse_apply(item, fn))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 105, in _recurse_apply
    return fn(o)
           ^^^^^
  File "/home/w/projects/llama.cpp/gguf-py/gguf/lazy.py", line 170, in simple_to_eager
    _t._data = _t._func(*_t._args, **_t._kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_prims_common/wrappers.py", line 149, in _fn
    result = fn(**bound.arguments)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_refs/__init__.py", line 1139, in _ref
    output = prim(a, b)
             ^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_refs/__init__.py", line 1746, in mul
    return prims.mul(a, b)
           ^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_ops.py", line 841, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_library/fake_impl.py", line 109, in meta_kernel
    return fake_impl_holder.kernel(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_library/utils.py", line 22, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/library.py", line 1430, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 627, in fake_impl
    return self._abstract_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_prims/__init__.py", line 404, in _prim_elementwise_meta
    utils.check_same_device(*args_, allow_cpu_scalar_tensors=True)
  File "/home/w/projects/llama.cpp/venv/lib/python3.12/site-packages/torch/_prims_common/__init__.py", line 878, in check_same_device
    raise RuntimeError(msg)
RuntimeError: Tensor on device cpu is not on the expected device meta!

Shard (1/46):  51%|█████▏    | 23.8G/46.3G [00:36<00:34, 655Mbyte/s]

Writing:   1%|          | 23.8G/2.05T [00:36<51:36, 655Mbyte/s]

ngxson · 2025-11-06T22:28:40Z

Last commit should fix the error. I successfully converted the first layer of the model to GGUF.

ubergarm · 2025-11-06T22:49:43Z

Huh, not sure how I got so far the first time. This time it ballooned RAM and oom-killer got me even running across both NUMA nodes for full 1.5TB and going with q8_0 output instead of bf16...

kimi-k2-thinking-convert-fun-lmao-oomkiller

I don't need to pass anything to enable lazy psure right?

So seems like it goes through all the non routed experts first pretty quickly and lowish RAM, but then it slows down once it hits those routed experts and memory usage monotonicly increases at that point:

👈 Partial Logs with comment

INFO:hf-to-gguf:blk.60.attn_kv_a_mqa.weight,  torch.bfloat16 --> Q8_0, shape = {7168, 576}
INFO:hf-to-gguf:blk.60.attn_k_b.weight,       torch.bfloat16 --> Q8_0, shape = {128, 512, 64}
INFO:hf-to-gguf:blk.60.attn_v_b.weight,       torch.bfloat16 --> Q8_0, shape = {512, 128, 64}
INFO:hf-to-gguf:blk.60.attn_output.weight,    torch.bfloat16 --> Q8_0, shape = {8192, 7168}
INFO:hf-to-gguf:blk.60.attn_q_a_norm.weight,  torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.60.attn_q_a.weight,       torch.bfloat16 --> Q8_0, shape = {7168, 1536}
INFO:hf-to-gguf:blk.60.attn_q_b.weight,       torch.bfloat16 --> Q8_0, shape = {1536, 12288}
INFO:hf-to-gguf:output.weight,                torch.bfloat16 --> Q8_0, shape = {7168, 163840}
INFO:hf-to-gguf:token_embd.weight,            torch.bfloat16 --> Q8_0, shape = {7168, 163840}
INFO:hf-to-gguf:output_norm.weight,           torch.bfloat16 --> F32, shape = {7168}
# runs smooth before here, but then it really slows down here and RAM usage keeps going up
INFO:hf-to-gguf:blk.1.ffn_down_exps.weight,   torch.float32 --> Q8_0, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.1.ffn_gate_exps.weight,   torch.float32 --> Q8_0, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.1.ffn_up_exps.weight,     torch.float32 --> Q8_0, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.2.ffn_down_exps.weight,   torch.float32 --> Q8_0, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.2.ffn_gate_exps.weight,   torch.float32 --> Q8_0, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.2.ffn_up_exps.weight,     torch.float32 --> Q8_0, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.3.ffn_down_exps.weight,   torch.float32 --> Q8_0, shape = {2048, 7168, 384}
INFO:hf-to-gguf:blk.3.ffn_gate_exps.weight,   torch.float32 --> Q8_0, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.3.ffn_up_exps.weight,     torch.float32 --> Q8_0, shape = {7168, 2048, 384}
INFO:hf-to-gguf:blk.4.ffn_down_exps.weight,   torch.float32 --> Q8_0, shape = {2048, 7168, 384}

ngxson · 2025-11-06T22:57:57Z

yes lazy should be enabled by default

I'm trying another way: directly mapping the quantization to Q4_0. the only disadvantage is that this will downcast the scale bf16 to f16

compilade · 2025-11-06T23:05:35Z

convert_hf_to_gguf.py

                                ".scales",
                            )
                        ]
+            elif quant_method == "compressed-tensors":


Might want to check for quant_config["format"] == "pack-quantized" near here instead of in dequant_compressed_tensors, because the compressed-tensors method has multiple formats which could technically be supported eventually (notably, float-quantized seems relatively similar to (but not quite like) the fp8 method).

csabakecskemeti · 2025-11-06T23:18:11Z

Q8 same as for @ubergarm memory balooned

compilade · 2025-11-06T23:19:37Z

convert_hf_to_gguf.py

+                else:
+                    unpacked = unpacked.to(weight.device) # is this needed?
+                for i in range(pack_factor):
+                    unpacked[:, i::pack_factor] = (weight >> (num_bits * i)) & mask


Lazy tensors don't handle __setitem__ correctly, I think (or it causes eager evaluation). That's because the function returns None and so the change tree can't really be updated with how it's currently implemented.

Prefer explicit concatenation instead if possible (like with torch.cat, torch.stack, etc.). (this should help with memory usage)

Alternatively, there are other ways to unpack without concatenation, like the broadcasting shifts done in gguf-py/gguf/quants.py.

Hmm yeah I need to go offline in next few minutes. Feel free to push directly to this branch if you have any suggestions!

This reverts commit caf0e42.

ngxson · 2025-11-06T23:35:01Z

Made a hack for repacking int4 to Q4_0, I pushed it in another branch: https://github.com/ngxson/llama.cpp/tree/xsn/convert_kimi_k2_quant_repack

IMPORTANT: This requires deleting the "quantization_config" section in config.json; You can also rename it:

ubergarm · 2025-11-06T23:47:56Z

Running xsn/convert_kimi_k2_quant_repack now after editing the config.json as you mention. Seems to be going well! Memory usage is staying low so i put it back on a single NUMA node.

The output splits are missing the name maybe, which it was on this PR branch too psure:

INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00001-of-00013.gguf: n_tensors = 99, total_size = 49.9G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00002-of-00013.gguf: n_tensors = 95, total_size = 49.2G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00003-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00004-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00005-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00006-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00007-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00008-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00009-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00010-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00011-of-00013.gguf: n_tensors = 90, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00012-of-00013.gguf: n_tensors = 89, total_size = 49.1G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00013-of-00013.gguf: n_tensors = 3, total_size = 4.7G
Shard (3/13):  13%|█▎        | 6.34G/49.1G [00:08<00:55, 766Mbyte/s]
Writing:  18%|█▊        | 105G/595G [01:15<11:12, 727Mbyte/s]

Regarding casting bf16 -> f16 for the block scales, i added a quick print(scale) and ran it with --no-lazy and at a glance they seemed to be very small numbers less than 1.0. I didn't check them all nor add any checks to see if they exceed +- 65k which could clip possibly.

Have to go for now to play DND, will check later. If this finishes I'll try to generate imatrix and see how the numbers look. Thanks for all the help!

csabakecskemeti · 2025-11-06T23:52:23Z

I'm also running the Q4 hack...

Will report back once it's done

ngxson · 2025-11-06T23:54:41Z

btw @ubergarm I've just pushed a small fix to the repack branch: ngxson@505f8be

what I worry is that the packed layout of compressed-tensors could be reversed to ggml, but we never know until we actually run the model. if that's the case, we will need something like the transform_nibble_layout used by GPT-OSS

a fun story: I wrote the code to repack GPT-OSS to GGML's MXFP4, just a 2 days before its release. repacking nibble layout was a real pain

bartowski1182 · 2025-11-06T23:58:34Z

I'm trying with your latest changes now

ubergarm · 2025-11-07T00:03:32Z

Aye, it generates a roughly correct size looking output gguf, but got errors trying to start it up:

edit to be clear I was using xsn/convert_kimi_k2_quant_repack@caf0e4230:

srv    load_model: loading model '/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00001-of-00013.gguf'
gguf_init_from_file_impl: tensor 'blk.1.ffn_gate_exps.weight' has offset 4165955584, expected 13678637056
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00001-of-00013.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00001-of-00013.gguf', try reducing --n-gpu-layers if you're running out of VRAM
srv    load_model: failed to load model, '/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00001-of-00013.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

For funzies I tried to start it on ik's fork too with errors there too:

llama_model_load: error loading model: tensor 'blk.5.ffn_down_exps.weight' data is not within the file bounds, model is corrupted or incomplete
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x22B-BF16-00001-of-00013.gguf'
main : failed to init

Good run though! My impression is there is really one main quant of this, q8_0 attn/shexp/first dense layer and q4_0 all routed experts. Maybe one could shrink the non routed experts a little bit, but historically they were best left q8_0 imo.

So I hope DevQuasar and bartowski have better luck with the more recent PR! Gotta run tho 🫶

csabakecskemeti · 2025-11-07T00:04:44Z

Similar as mentioned above (with the Q4 hack):

gguf_init_from_file_impl: tensor 'blk.1.ffn_gate_exps.weight' has offset 3699564544, expected 13212246016
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /media/kecso/8t_nvme/moonshotai.Kimi-K2-Thinking-GGUF/Q4/moonshotai.Kimi-K2-Thinking.Q4_0-00001-of-00045.gguf
llama_model_load_from_file_impl: failed to load model
main: error: unable to load model

bartowski1182 · 2025-11-07T01:04:18Z

Conversion succeeded, but when loaded it doesn't give coherent responses, just endlessly repeats tokens

csabakecskemeti · 2025-11-07T01:09:25Z

@bartowski1182 you haven't had memory ballooning issue? Or you've just have enough memory?

bartowski1182 · 2025-11-07T01:13:22Z

I've got 768gb

ngxson · 2025-11-07T10:11:00Z

Closing this in favor of #17069

ngxson · 2025-11-07T10:13:16Z

Is there a way to know the block size of GGUF's K-quants? For GPTQ / AWQ etc. , we can find it in config json. If it's possible to do size 32 quant in llama.cpp, maybe we can directly use original INT4 weights and scales?

The block size table is in gguf-py/gguf/constants.py, the first column of the table is block size (element count) and second column is block size in bytes:

QK_K = 256
GGML_QUANT_SIZES: dict[GGMLQuantizationType, tuple[int, int]] = {
    GGMLQuantizationType.F32:     (1, 4),
    GGMLQuantizationType.F16:     (1, 2),
    GGMLQuantizationType.Q4_0:    (32, 2 + 16),
    GGMLQuantizationType.Q4_1:    (32, 2 + 2 + 16),
    GGMLQuantizationType.Q5_0:    (32, 2 + 4 + 16),
    GGMLQuantizationType.Q5_1:    (32, 2 + 2 + 4 + 16),
    GGMLQuantizationType.Q8_0:    (32, 2 + 32),
    GGMLQuantizationType.Q8_1:    (32, 4 + 4 + 32),
    GGMLQuantizationType.Q2_K:    (256, 2 + 2 + QK_K // 16 + QK_K // 4),
    GGMLQuantizationType.Q3_K:    (256, 2 + QK_K // 4 + QK_K // 8 + 12),
    GGMLQuantizationType.Q4_K:    (256, 2 + 2 + QK_K // 2 + 12),
    GGMLQuantizationType.Q5_K:    (256, 2 + 2 + QK_K // 2 + QK_K // 8 + 12),
    GGMLQuantizationType.Q6_K:    (256, 2 + QK_K // 2 + QK_K // 4 + QK_K // 16),
    GGMLQuantizationType.Q8_K:    (256, 4 + QK_K + QK_K // 8),
    GGMLQuantizationType.IQ2_XXS: (256, 2 + QK_K // 4),
    GGMLQuantizationType.IQ2_XS:  (256, 2 + QK_K // 4 + QK_K // 32),
    GGMLQuantizationType.IQ3_XXS: (256, 2 + QK_K // 4 + QK_K // 8),
    GGMLQuantizationType.IQ1_S:   (256, 2 + QK_K // 8 + QK_K // 16),
    GGMLQuantizationType.IQ4_NL:  (32, 2 + 16),
    GGMLQuantizationType.IQ3_S:   (256, 2 + QK_K // 4 + QK_K // 8 + QK_K // 32 + 4),
    GGMLQuantizationType.IQ2_S:   (256, 2 + QK_K // 4 + QK_K // 16),
    GGMLQuantizationType.IQ4_XS:  (256, 2 + 2 + QK_K // 2 + QK_K // 64),
    GGMLQuantizationType.I8:      (1, 1),
    GGMLQuantizationType.I16:     (1, 2),
    GGMLQuantizationType.I32:     (1, 4),
    GGMLQuantizationType.I64:     (1, 8),
    GGMLQuantizationType.F64:     (1, 8),
    GGMLQuantizationType.IQ1_M:   (256, QK_K // 8 + QK_K // 16  + QK_K // 32),
    GGMLQuantizationType.BF16:    (1, 2),
    GGMLQuantizationType.TQ1_0:   (256, 2 + 4 * 13),
    GGMLQuantizationType.TQ2_0:   (256, 2 + 64),
    GGMLQuantizationType.MXFP4:   (32, 1 + 16),
}

jukofyork · 2025-11-07T12:02:06Z

Closing this in favor of #17069

Does that PR directly convert the INT4 values to Q4_0 or is it doing a round trip to BF16 and then Q4_0?

jukofyork · 2025-11-07T12:14:51Z

If it doesn't, then looking at the source, it looks like if there is no imatrix that this is what eventually gets called:

// reference implementation for deterministic creation of model files
void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
    static const int qk = QK4_0;

    assert(k % qk == 0);

    const int nb = k / qk;

    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max
        float max  = 0.0f;

        for (int j = 0; j < qk; j++) {
            const float v = x[i*qk + j];
            if (amax < fabsf(v)) {
                amax = fabsf(v);
                max  = v;
            }
        }

        const float d  = max / -8;
        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = GGML_FP32_TO_FP16(d);

        for (int j = 0; j < qk/2; ++j) {
            const float x0 = x[i*qk + 0    + j]*id;
            const float x1 = x[i*qk + qk/2 + j]*id;

            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));

            y[i].qs[j]  = xi0;
            y[i].qs[j] |= xi1 << 4;
        }
    }
}

Can we prove this is going to convert back to the original values when we do a round trip via BF16, for example:

Imagine we have a block of 32 INT4 values that don't contain both 0b0000 and 0b1111.

What will happen in this case? I have a feeling the n=16 lattice will get superimposed on top of a n<16 lattice and not work for all non-power-of-2 n lattices?

csabakecskemeti · 2025-11-07T14:54:03Z

I think I've made it work with alternative way.
I've build a conversion utility inspired by the Deepseek V3 dequantizer
int4-to-bf16

Both Q3 and Q2 GGUF seems working:

Experimental quants uploading (please allow some more time for the upload) here:
DevQuasar/moonshotai.Kimi-K2-Thinking-GGUF

Feel free to test the quants and the converter

jukofyork · 2025-11-07T15:42:31Z

I think I've made it work with alternative way. I've build a conversion utility inspired by the Deepseek V3 dequantizer int4-to-bf16

Both Q3 and Q2 GGUF seems working:

Experimental quants uploading (please allow some more time for the upload) here: DevQuasar/moonshotai.Kimi-K2-Thinking-GGUF

Feel free to test the quants and the converter

This will work, but not losslessly for the same reason as the other PR.

The fundamental problem here is that the QAT-trained blocks of 32 nibbles might not take up the full range of values. If the original block has a range of 0b0001 to 0x1111 then the Q4_0 code will try to create a lattice of 16 values 0b0000 to 0x1111.

It's easier to see if you look at 2bit version, eg:

If the original quant only had the half-nibbles 0b00 (0), 0b01 (1) and 0b10 (2) then the quantize_row_q4_0_ref can't recover this and you will end up with this sort of thing:

OLD: 0        1        2
NEW: 0     1     2     3

ngxson · 2025-11-07T16:05:28Z

The fundamental problem here is that the QAT-trained blocks of 32 nibbles might not take up the full range of values

Is there any reasons it cannot take the full range? IIUC from the original compressed-tensor dequant code, it should take up the full int4 range

ngxson · 2025-11-07T16:11:27Z

hmm nevermind, I think I understand what you're saying now. Did you mean that it's possible that the training code prevents using 0b0000 (and not the quant/dequant code)? If that's the case then yes, it's possible that int4 can jump to another value on q4_0 even if the q4_0 somehow support BF16

jukofyork · 2025-11-07T17:24:11Z

hmm nevermind, I think I understand what you're saying now. Did you mean that it's possible that the training code prevents using 0b0000 (and not the quant/dequant code)? If that's the case then yes, it's possible that int4 can jump to another value on q4_0 even if the q4_0 somehow support BF16

Yeah, if we were just to use something like this to first quantise:

// reference implementation for deterministic creation of model files
void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
    static const int qk = QK4_0;

    assert(k % qk == 0);

    const int nb = k / qk;

    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max
        float max  = 0.0f;

        for (int j = 0; j < qk; j++) {
            const float v = x[i*qk + j];
            if (amax < fabsf(v)) {
                amax = fabsf(v);
                max  = v;
            }
        }

        const float d  = max / -8;
        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = GGML_FP32_TO_FP16(d);

        for (int j = 0; j < qk/2; ++j) {
            const float x0 = x[i*qk + 0    + j]*id;
            const float x1 = x[i*qk + qk/2 + j]*id;

            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));

            y[i].qs[j]  = xi0;
            y[i].qs[j] |= xi1 << 4;
        }
    }
}

turn these back into BF16 or F32 and rerun the same code, we wouldn't lose anything as this line:

        const float d  = max / -8;

is implicitly assuming that there will be a lower value of 0b0000 and an upper value of 0b1111.

But because the QAT was trained, it's quite likely that not every block of 32 will necessarily maintain a lower value of 0b0000 and an upper value of 0b1111 for the nibbles.

It's not something you can change after the QAT training (which likely used some form of regularisation term on the intervals and/or stochastic rounding), so the only way to maintain the full range for all blocks would be to adjust it during training (which would probably be really hard/awkward to do for "Adam-like" optimisers with extra the "memory" parameters, as you would have to keep adjusting these too).

If it can be shown that somehow they have done this, and like the output of the quantize_row_q4_0_ref it is guaranteed for every 32-element block the lowest nibble will be 0b0000 and the largest nibble will be 0b1111, then it wouldn't be a problem and quantize_row_q4_0_ref could (almost) losslessly recover an equivalent set of values (assuming the BF16 --> F16 scales doesn't overflow, which seems unlikely).

The maximum relative error from converting from BF16 --> F16 will be something really tiny and is related to the way they represent sub-normals (if it wasn't for this then it would be essentially lossless as my bit shifting example erroneously showed in the other thread).

jukofyork · 2025-11-07T17:28:04Z

If it can be shown that somehow they have done this, and like the output of the quantize_row_q4_0_ref it is guaranteed for every 32-element block the lowest nibble will be 0b0000 and the largest nibble will be 0b1111, then it wouldn't be a problem and quantize_row_q4_0_ref could (almost) losslessly recover an equivalent set of values (assuming the BF16 --> F16 scales doesn't overflow, which seems unlikely).

It's definitely worth testing if this is the case as it saves a lot of hassle if it is like this! I'm about another day away from getting the model at 4MB/s sadly 😦

jukofyork · 2025-11-12T08:03:13Z

I've found it is using a symmetric quant (assuming #17069 is working correctly):

Block 0: best_error=0.000039, best_lattice_offset=1
  Errors for each element:
    [0] original=0.008911, dequant=0.008926, error=-0.000015
    [1] original=-0.035645, dequant=-0.035706, error=0.000061
    [2] original=0.035645, dequant=0.035706, error=-0.000061
    [3] original=0.008911, dequant=0.008926, error=-0.000015
    [4] original=-0.017822, dequant=-0.017853, error=0.000031
    [5] original=0.000000, dequant=0.000000, error=0.000000
    [6] original=-0.026733, dequant=-0.026779, error=0.000046
    [7] original=0.008911, dequant=0.008926, error=-0.000015
    [8] original=0.008911, dequant=0.008926, error=-0.000015
    [9] original=0.026733, dequant=0.026779, error=-0.000046
    [10] original=-0.017822, dequant=-0.017853, error=0.000031
    [11] original=-0.035645, dequant=-0.035706, error=0.000061
    [12] original=-0.017822, dequant=-0.017853, error=0.000031
    [13] original=0.026733, dequant=0.026779, error=-0.000046
    [14] original=0.035645, dequant=0.035706, error=-0.000061
    [15] original=0.000000, dequant=0.000000, error=0.000000
    [16] original=0.000000, dequant=0.000000, error=0.000000
    [17] original=-0.044434, dequant=-0.044632, error=0.000198
    [18] original=0.017822, dequant=0.017853, error=-0.000031
    [19] original=0.008911, dequant=0.008926, error=-0.000015
    [20] original=-0.026733, dequant=-0.026779, error=0.000046
    [21] original=0.017822, dequant=0.017853, error=-0.000031
    [22] original=-0.026733, dequant=-0.026779, error=0.000046
    [23] original=0.008911, dequant=0.008926, error=-0.000015
    [24] original=-0.062500, dequant=-0.062485, error=-0.000015
    [25] original=0.053467, dequant=0.053558, error=-0.000092
    [26] original=-0.017822, dequant=-0.017853, error=0.000031
    [27] original=0.053467, dequant=0.053558, error=-0.000092
    [28] original=0.008911, dequant=0.008926, error=-0.000015
    [29] original=-0.008911, dequant=-0.008926, error=0.000015
    [30] original=-0.017822, dequant=-0.017853, error=0.000031
    [31] original=-0.017822, dequant=-0.017853, error=0.000031

as can be seen by the mirrored positive and negative original values...

What seems (very) strange is that there are also zero values here though, so it almost looks like it is a symmetric quant with two 4bit values that map to zero (which seems very odd IMO?).

If this is the case, then there is a direct/lossless conversion to Q4_0 (an asymmetric quant) by temporality changing the const float d = max / -8 line to const float d = max / -7 here:

llama.cpp/ggml/src/ggml-quants.c

Line 55 in 655cddd

const float d = max / -8;

void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
    static const int qk = QK4_0;

    assert(k % qk == 0);

    const int nb = k / qk;

    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max
        float max  = 0.0f;

        for (int j = 0; j < qk; j++) {
            const float v = x[i*qk + j];
            if (amax < fabsf(v)) {
                amax = fabsf(v);
                max  = v;
            }
        }

        const float d  = max / -8;
        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = GGML_FP32_TO_FP16(d);

        for (int j = 0; j < qk/2; ++j) {
            const float x0 = x[i*qk + 0    + j]*id;
            const float x1 = x[i*qk + qk/2 + j]*id;

            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));

            y[i].qs[j]  = xi0;
            y[i].qs[j] |= xi1 << 4;
        }
    }
}

This seems to give the lowest error for all blocks I have tested so far (equivalent to lattice_offset=1 here):

- Block 0: error=0.001919, lattice_offset=0
- Block 0: error=0.000039, lattice_offset=1
- Block 0: error=0.002557, lattice_offset=2
- Block 0: error=0.003122, lattice_offset=3
- Block 0: error=0.003971, lattice_offset=4
- Block 0: error=0.005304, lattice_offset=5
- Block 0: error=0.007816, lattice_offset=6
- Block 0: error=0.015347, lattice_offset=7
- Block 1: error=0.001915, lattice_offset=0
- Block 1: error=0.000032, lattice_offset=1
- Block 1: error=0.002548, lattice_offset=2
- Block 1: error=0.003613, lattice_offset=3
- Block 1: error=0.003387, lattice_offset=4
- Block 1: error=0.004547, lattice_offset=5
- Block 1: error=0.009003, lattice_offset=6
- Block 1: error=0.015305, lattice_offset=7
- Block 2: error=0.001913, lattice_offset=0
- Block 2: error=0.000029, lattice_offset=1
- Block 2: error=0.002567, lattice_offset=2
- Block 2: error=0.003370, lattice_offset=3
- Block 2: error=0.003838, lattice_offset=4
- Block 2: error=0.005123, lattice_offset=5
- Block 2: error=0.008495, lattice_offset=6
- Block 2: error=0.015354, lattice_offset=7
- Block 3: error=0.002235, lattice_offset=0
- Block 3: error=0.000034, lattice_offset=1
- Block 3: error=0.002647, lattice_offset=2
- Block 3: error=0.002747, lattice_offset=3
- Block 3: error=0.003204, lattice_offset=4
- Block 3: error=0.004272, lattice_offset=5
- Block 3: error=0.006775, lattice_offset=6
- Block 3: error=0.015968, lattice_offset=7
- Block 4: error=0.001559, lattice_offset=0
- Block 4: error=0.000022, lattice_offset=1
- Block 4: error=0.002072, lattice_offset=2
- Block 4: error=0.002959, lattice_offset=3
- Block 4: error=0.003492, lattice_offset=4
- Block 4: error=0.004657, lattice_offset=5
- Block 4: error=0.007351, lattice_offset=6
- Block 4: error=0.012447, lattice_offset=7
- Block 5: error=0.001816, lattice_offset=0
- Block 5: error=0.000021, lattice_offset=1
- Block 5: error=0.002457, lattice_offset=2
- Block 5: error=0.002632, lattice_offset=3
- Block 5: error=0.003487, lattice_offset=4
- Block 5: error=0.004653, lattice_offset=5
- Block 5: error=0.006645, lattice_offset=6
- Block 5: error=0.014633, lattice_offset=7
- Block 6: error=0.002048, lattice_offset=0
- Block 6: error=0.000058, lattice_offset=1
- Block 6: error=0.002653, lattice_offset=2
- Block 6: error=0.003850, lattice_offset=3
- Block 6: error=0.005074, lattice_offset=4
- Block 6: error=0.006771, lattice_offset=5
- Block 6: error=0.009529, lattice_offset=6
- Block 6: error=0.016159, lattice_offset=7
- Block 7: error=0.002375, lattice_offset=0
- Block 7: error=0.000022, lattice_offset=1
- Block 7: error=0.003150, lattice_offset=2
- Block 7: error=0.004253, lattice_offset=3
- Block 7: error=0.005926, lattice_offset=4
- Block 7: error=0.007886, lattice_offset=5
- Block 7: error=0.010660, lattice_offset=6
- Block 7: error=0.018946, lattice_offset=7
- Block 8: error=0.001961, lattice_offset=0
- Block 8: error=0.000026, lattice_offset=1
- Block 8: error=0.002649, lattice_offset=2
- Block 8: error=0.002979, lattice_offset=3
- Block 8: error=0.003342, lattice_offset=4
- Block 8: error=0.004447, lattice_offset=5
- Block 8: error=0.007511, lattice_offset=6
- Block 8: error=0.015789, lattice_offset=7
- Block 9: error=0.001776, lattice_offset=0
- Block 9: error=0.000027, lattice_offset=1
- Block 9: error=0.002402, lattice_offset=2
- Block 9: error=0.003095, lattice_offset=3
- Block 9: error=0.004446, lattice_offset=4
- Block 9: error=0.005952, lattice_offset=5
- Block 9: error=0.007822, lattice_offset=6
- Block 9: error=0.014307, lattice_offset=7
- Block 10: error=0.001726, lattice_offset=0
- Block 10: error=0.000042, lattice_offset=1
- Block 10: error=0.002293, lattice_offset=2
- Block 10: error=0.002997, lattice_offset=3
- Block 10: error=0.003345, lattice_offset=4
- Block 10: error=0.004528, lattice_offset=5
- Block 10: error=0.007435, lattice_offset=6
- Block 10: error=0.013783, lattice_offset=7

and looking at this, I suspect this is quite close to lossless (and also seems to confirm the original quant might have used two zeros).

The epsilon for bfloat16 is 0.00781250 and these errors all look to be a couple of orders of magnitude less than this when lattice_offset=1 (which is the same as hacking the original code to use const float d = max / -7).

For anyone interested (or who wants to double check this!), here is my full hacked code I used:

details

// reference implementation for deterministic creation of model files
/*
void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
  static const int qk = QK4_0;

  assert(k % qk == 0);

  const int nb = k / qk;

  for (int i = 0; i < nb; i++) {
      float amax = 0.0f; // absolute max
      float max  = 0.0f;

      for (int j = 0; j < qk; j++) {
          const float v = x[i*qk + j];
          if (amax < fabsf(v)) {
              amax = fabsf(v);
              max  = v;
          }
      }

      const float d  = max / -8;
      const float id = d ? 1.0f/d : 0.0f;

      y[i].d = GGML_FP32_TO_FP16(d);

      for (int j = 0; j < qk/2; ++j) {
          const float x0 = x[i*qk + 0    + j]*id;
          const float x1 = x[i*qk + qk/2 + j]*id;

          const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
          const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));

          y[i].qs[j]  = xi0;
          y[i].qs[j] |= xi1 << 4;
      }
  }
}
*/

static void quantize_q4_0_block(const float * GGML_RESTRICT x, int i, int qk, float d, block_q4_0* out) {
  const float id = d ? 1.0f/d : 0.0f;

  out->d = GGML_FP32_TO_FP16(d);

  for (int j = 0; j < qk/2; ++j) {
      const float x0 = x[i*qk + 0    + j]*id;
      const float x1 = x[i*qk + qk/2 + j]*id;

      const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
      const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));

      out->qs[j]  = xi0;
      out->qs[j] |= xi1 << 4;
  }
}

static void dequantize_q4_0_block(const block_q4_0* block, float* dequant, int qk) {
  dequantize_row_q4_0(block, dequant, qk);
}

static float measure_q4_0_error(const float * GGML_RESTRICT x, int i, int qk, const block_q4_0* block) {
  float dequant[QK4_0];
  dequantize_q4_0_block(block, dequant, qk);

  float error = 0.0f;
  for (int j = 0; j < qk; j++) {
      error += fabsf(x[i*qk + j] - dequant[j]);
  }
  return error/qk;
}

static void print_q4_0_block_errors(const float * GGML_RESTRICT x, int i, int qk, const block_q4_0* block) {
  float dequant[QK4_0];
  dequantize_q4_0_block(block, dequant, qk);

  printf("  Errors for each element:\n");
  for (int j = 0; j < qk; j++) {
      float error = x[i*qk + j] - dequant[j];
      printf("    [%d] original=%.6f, dequant=%.6f, error=%.6f\n", j, x[i*qk + j], dequant[j], error);
  }
}

void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
  static const int qk = QK4_0;

  assert(k % qk == 0);

  const int nb = k / qk;

  for (int i = 0; i < nb; i++) {
      float amax = 0.0f; // absolute max
      float max  = 0.0f;

      for (int j = 0; j < qk; j++) {
          const float v = x[i*qk + j];
          if (amax < fabsf(v)) {
              amax = fabsf(v);
              max  = v;
          }
      }

      float best_error = FLT_MAX;
      int best_lattice_offset = 0;
      block_q4_0 best_block;

      for (int lattice_offset = 0; lattice_offset <= 7; lattice_offset++) {
          block_q4_0 temp_block;
          const float d = max / -(8 - lattice_offset);
          quantize_q4_0_block(x, i, qk, d, &temp_block);

          float error = measure_q4_0_error(x, i, qk, &temp_block);
          //printf("- Block %d: error=%.6f, lattice_offset=%d\n", i, error, lattice_offset);

          if (error < best_error) {
              best_error = error;
              best_lattice_offset = lattice_offset;
              best_block = temp_block;
          }
      }

      if (best_error > 0.000001 && best_lattice_offset != 1) {
          printf("Block %d: best_error=%.6f, best_lattice_offset=%d\n", i, best_error, best_lattice_offset);
          print_q4_0_block_errors(x, i, qk, &best_block);
      }

      y[i] = best_block;
  }
}

I will run this using the if (best_error > 0.000001 && best_lattice_offset != 1) clause to check if all have the optimum best_lattice_offset=1 which means that we can just hack that single -8 to -7 for this, rather than have to run all this other code for every block... I will report back in a couple of hours when this is done.

jukofyork · 2025-11-12T08:40:07Z

Yeah, it seems that with minmax and symmetric:

  "quantization_config": {
    "config_groups": {
      "group_0": {
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 32,
          "num_bits": 4,
          "observer": "minmax",
          "observer_kwargs": {},
          "strategy": "group",
          "symmetric": true,
          "type": "int"
        }
      }
    },

https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/config.json

It really does create two zero bits:

    # 1. Generate scale and zero-point
    if quantization_args.symmetric:
        max_val_pos = torch.max(torch.abs(min_vals), torch.abs(max_vals))
        scales = max_val_pos / (float(bit_range) / 2)
        zero_points = torch.zeros(scales.shape, device=device, dtype=min_vals.dtype)

https://github.com/vllm-project/compressed-tensors/blob/2763f81524d1b4840269174d138ecbc1f6bd2f38/src/compressed_tensors/quantization/utils/helpers.py#L60

So this is remarkably lucky for the Q4_0 conversion.

jukofyork · 2025-11-12T08:54:04Z

I am a little suspicious of this now though as I'm about 1/2 way through the tensors for kimi-k2-thinking and have yet to find a single case where the lattice {-7, ..., +7} doesn't fit... If this was really trained using QAT then I think it would be highly unlikely that this would be the case and at least one block out of all these 1000s would be {-6, ..., +6} or lower!

This looks suspiciously like they have taken their QAT floating point values and passed them into that calculate_qparams code (which uses very similar logic to llama.cpp's Q4_0 code via the absmax) rather than trained using these values...

If this is the case then it brings up a couple of worrying points:

All the lattices that really were {-6, ..., +6}, {-5, ..., +6}, etc after QAT training are going to be misaligned with this conversation.
Did they really use two zero-bits for the QAT training and thus has this conversation via vllm's compressed-tensors code actually misaligned the lattices that were actually {-7, ..., +8} (or {-8, ..., +7}) during the actual training run!?

jukofyork · 2025-11-12T09:05:37Z

Hopefully they will reply:

https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/26

There could be some good reason for using two zeros to do with the gradients during QAT, but it seems a little odd and unexpected to me still...

jukofyork · 2025-11-12T10:03:02Z

The initial Kimi-K2-Thinking-BF16.gguf was decompressed using convert : handle compressed-tensors quant method #17069.
The single hack of changing -8 to -7 on this line was applied before recompiling:

llama.cpp/ggml/src/ggml-quants.c

Line 55 in 655cddd

const float d = max / -8;

Then quantised using:

~/llama.cpp/build/bin/llama-quantize \
	--tensor-type attn_kv_a_mqa=q8_0 \
	--tensor-type attn_k_b=q8_0 \
	--tensor-type attn_v_b=q8_0 \
	--tensor-type _exps=q4_0 \
	Kimi-K2-Thinking-BF16.gguf Kimi-K2-Thinking-Q4_X.gguf Q6_K 44

Tested using https://github.com/amikha33/Wiki-Text/blob/master/wiki.test.raw:

~/llama.cpp/build/bin/llama-perplexity \
    --model ./Kimi-K2-Thinking-Q4_X.gguf \
    --n-gpu-layers 99 \
    --numa distribute \
    --threads "$(nproc)" \
    --override-tensor exps=CPU \
    --flash-attn 1 \
    --no-op-offload \
    --file ./wiki.test.raw

system_info: n_threads = 80 (n_threads_batch = 80) / 80 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 617.233 ms
perplexity: calculating perplexity over 568 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 99.25 seconds per pass - ETA 3 hours 54.88 minutes
[1]1.5514,[2]2.2267,[3]1.8935,[4]1.6674,[5]1.5277,[6]1.4402,[7]1.3942,[8]1.3427,

(I'll leave it running and update this post later...)

jukofyork · 2025-11-12T10:05:15Z

It might be worth somebody like @compilade who understands more about the inner workings of the quants to see if a similar hack can be applied to quantize_row_q4_K_ref and/or make_qkx2_quants, as for my setup I'm only getting about 20 tokens/s PP for Q4_0 vs 32-33 tokens/s for an equivalent model with Q4_K.

jukofyork · 2025-11-12T15:07:08Z

Final estimate: PPL = 2.0841 +/- 0.00907

jukofyork · 2025-11-12T16:07:34Z

It seems we can do slightly better by using an iterative algorithm to refine the scale slightly:

static void dequantize_q4_0_block(const block_q4_0* block, float* dequant, int qk) {
    dequantize_row_q4_0(block, dequant, qk);
}

static float measure_q4_0_error(const float * GGML_RESTRICT x, int i, int qk, const block_q4_0* block) {
    float dequant[QK4_0];
    dequantize_q4_0_block(block, dequant, qk);

    float error = 0.0f;
    for (int j = 0; j < qk; j++) {
        error += fabsf(x[i*qk + j] - dequant[j]);
    }
    return error/qk;
}

static void print_q4_0_block_errors(const float * GGML_RESTRICT x, int i, int qk, const block_q4_0* block) {
    float dequant[QK4_0];
    dequantize_q4_0_block(block, dequant, qk);

    printf("  Errors for each element:\n");
    for (int j = 0; j < qk; j++) {
        float error = x[i*qk + j] - dequant[j];
        printf("    [%d] original=%.6f, dequant=%.6f, error=%.6f\n", j, x[i*qk + j], dequant[j], error);
    }
}

void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
    static const int qk = QK4_0;
    assert(k % qk == 0);

    const int nb = k / qk;
    const int max_iter = 10;      // note: seems to mostly converge after 1 iteration...
    const float epsilon = 1e-6f;  // note: float16_epsilon = 0.00097656, 0.00097656^2 = ~1e-6

    for (int i = 0; i < nb; i++) {
        // Find max absolute value for initialization
        float amax = 0.0f;
        for (int j = 0; j < qk; j++) {
            const float v = fabsf(x[i*qk + j]);
            if (v > amax) {
                amax = v;
            }
        }

        // Initialize scale for range -7 to +7 (15 quantization levels)
        float s = amax / 7.0f;
        if (s == 0.0f) {
            s = 1.0f;
        }

        int8_t q[QK4_0];
        
        for (int iter = 0; iter < max_iter; iter++) {
            const float s_old = s;
            
            // Step 1: Assignment - quantize each element to range -7 to +7
            for (int j = 0; j < qk; j++) {
                const float q_float = x[i*qk + j] / s;
                int8_t q_int = (int8_t)roundf(q_float);
                // Clip to range -7 to +7 (kimi-k2-thinking constraint)
                if (q_int < -7) q_int = -7;
                if (q_int > 7) q_int = 7;
                q[j] = q_int;
            }
            
            // Step 2: Update scale
            float numerator = 0.0f;
            float denominator = 0.0f;
            for (int j = 0; j < qk; j++) {
                numerator += x[i*qk + j] * q[j];
                denominator += (float)(q[j] * q[j]);
            }
            
            if (denominator > 0.0f) {
                const float s_new = numerator / denominator;
                if (s_new > 0.0f) {
                    s = s_new;
                }
            }
            
            // Check convergence
            float delta_s = fabsf(s - s_old);
            //printf("- Iter %i: delta_s=%.6f\n", iter, delta_s);
            if (delta_s < epsilon) {
                break;
            }
        }
        
        // Store the scale
        y[i].d = GGML_FP32_TO_FP16(s);
        
        // Pack quantized values: map -7..+7 to stored values 1..15
        // (stored value 0 is unused due to kimi-k2-thinking's ±7 constraint)
        for (int j = 0; j < qk/2; j++) {
            const uint8_t q0 = (uint8_t)(q[j] + 8);
            const uint8_t q1 = (uint8_t)(q[qk/2 + j] + 8);
            y[i].qs[j] = q0 | (q1 << 4);
        }
        
        float error = measure_q4_0_error(x, i, qk, &y[i]);
        if (error > 1e-3f) {  // note: float16_epsilon = 0.00097656 = ~1e-3
            printf("- Block %d: error=%.6f\n", i, error);
            print_q4_0_block_errors(x, i, qk, &y[i]);
        }
    }
}

I think this is called the "Lloyd-Max " algorithm, although not sure as that seems to be for general split points.

It's a fair bit slower so will be tomorrow by the time I have the results for this.

ubergarm · 2025-11-12T16:36:22Z

Final estimate: PPL = 2.0841 +/- 0.00907

this seems really good to me and very close to the "full Q8_0" that I measured while making: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF

it is better than the unpatched approach i started with:

original safetensors -> bf16 w/ PR#17069 (same as you did) -> Q8_0-Q4_0 ~544GB model:

Final estimate: PPL = 2.1257 +/- 0.00934

my perplexity command on ik matches yours including testing sha1sum of wiki.test.raw corpus and we both use default 512 context as is the convention.

$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ du -h wiki.test.raw
1.3M    wiki.test.raw
$ sha1sum  wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea  wiki.test.raw

$ numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    -mla 3 \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa numactl \
    --threads 96 \
    --threads-batch 128 \
    --no-mmap

so your patch seems promising to unlock more of the QAT quality potential it seems

jukofyork · 2025-11-12T18:08:26Z

@ubergarm I would hold off releasing any versions using this just yet, as @compilade is going to test the raw safetensors files to make sure there really are only 15 out of 16 bit combinations getting used... If this is wrong then we can likely get a (much) better perplexity due to the most extreme/important bit (in terms of least-square error criteria) getting rounded down.

ubergarm · 2025-11-12T20:24:46Z

@jukofyork

Thanks! I was able to at least confirm that using your one line patch improves perplexity to match the full Q8_0 i tested:

Quant	Perplexity	Size
"pure" Q8_0	`2.0823 +/- 0.0090`	1016.117 GiB (8.504 BPW)
original q8_0 attn/shexp/first dense layer and q4_0 routed experts	`2.1257 +/- 0.00934`	543.617 GiB (4.549 BPW)
patched q8_0 attn/shexp/first dense layer and q4_0 routed experts	`2.0818 +/- 0.00903`	543.617 GiB (4.549 BPW)

-        const float d  = max / -8;
+        const float d  = max / -7;

jukofyork · 2025-11-12T22:07:10Z

It seems we can do slightly better by using an iterative algorithm to refine the scale slightly:

static void dequantize_q4_0_block(const block_q4_0* block, float* dequant, int qk) {
    dequantize_row_q4_0(block, dequant, qk);
}

static float measure_q4_0_error(const float * GGML_RESTRICT x, int i, int qk, const block_q4_0* block) {
    float dequant[QK4_0];
    dequantize_q4_0_block(block, dequant, qk);

    float error = 0.0f;
    for (int j = 0; j < qk; j++) {
        error += fabsf(x[i*qk + j] - dequant[j]);
    }
    return error/qk;
}

static void print_q4_0_block_errors(const float * GGML_RESTRICT x, int i, int qk, const block_q4_0* block) {
    float dequant[QK4_0];
    dequantize_q4_0_block(block, dequant, qk);

    printf("  Errors for each element:\n");
    for (int j = 0; j < qk; j++) {
        float error = x[i*qk + j] - dequant[j];
        printf("    [%d] original=%.6f, dequant=%.6f, error=%.6f\n", j, x[i*qk + j], dequant[j], error);
    }
}

void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
    static const int qk = QK4_0;
    assert(k % qk == 0);

    const int nb = k / qk;
    const int max_iter = 10;      // note: seems to mostly converge after 1 iteration...
    const float epsilon = 1e-6f;  // note: float16_epsilon = 0.00097656, 0.00097656^2 = ~1e-6

    for (int i = 0; i < nb; i++) {
        // Find max absolute value for initialization
        float amax = 0.0f;
        for (int j = 0; j < qk; j++) {
            const float v = fabsf(x[i*qk + j]);
            if (v > amax) {
                amax = v;
            }
        }

        // Initialize scale for range -7 to +7 (15 quantization levels)
        float s = amax / 7.0f;
        if (s == 0.0f) {
            s = 1.0f;
        }

        int8_t q[QK4_0];
        
        for (int iter = 0; iter < max_iter; iter++) {
            const float s_old = s;
            
            // Step 1: Assignment - quantize each element to range -7 to +7
            for (int j = 0; j < qk; j++) {
                const float q_float = x[i*qk + j] / s;
                int8_t q_int = (int8_t)roundf(q_float);
                // Clip to range -7 to +7 (kimi-k2-thinking constraint)
                if (q_int < -7) q_int = -7;
                if (q_int > 7) q_int = 7;
                q[j] = q_int;
            }
            
            // Step 2: Update scale
            float numerator = 0.0f;
            float denominator = 0.0f;
            for (int j = 0; j < qk; j++) {
                numerator += x[i*qk + j] * q[j];
                denominator += (float)(q[j] * q[j]);
            }
            
            if (denominator > 0.0f) {
                const float s_new = numerator / denominator;
                if (s_new > 0.0f) {
                    s = s_new;
                }
            }
            
            // Check convergence
            float delta_s = fabsf(s - s_old);
            //printf("- Iter %i: delta_s=%.6f\n", iter, delta_s);
            if (delta_s < epsilon) {
                break;
            }
        }
        
        // Store the scale
        y[i].d = GGML_FP32_TO_FP16(s);
        
        // Pack quantized values: map -7..+7 to stored values 1..15
        // (stored value 0 is unused due to kimi-k2-thinking's ±7 constraint)
        for (int j = 0; j < qk/2; j++) {
            const uint8_t q0 = (uint8_t)(q[j] + 8);
            const uint8_t q1 = (uint8_t)(q[qk/2 + j] + 8);
            y[i].qs[j] = q0 | (q1 << 4);
        }
        
        float error = measure_q4_0_error(x, i, qk, &y[i]);
        if (error > 1e-3f) {  // note: float16_epsilon = 0.00097656 = ~1e-3
            printf("- Block %d: error=%.6f\n", i, error);
            print_q4_0_block_errors(x, i, qk, &y[i]);
        }
    }
}

I think this is called the "Lloyd-Max " algorithm, although not sure as that seems to be for general split points.

It's a fair bit slower so will be tomorrow by the time I have the results for this.

This didn't gain anything worthwhile:

Final estimate: PPL = 2.0843 +/- 0.00906

ubergarm · 2025-11-13T00:29:53Z

@jukofyork

@ubergarm I would hold off releasing any versions using this just yet, as @compilade is going to test the raw safetensors files to make sure there really are only 15 out of 16 bit combinations getting used...

compilade provided a script and running against the original moonshotai safetensors it looks like there isn't any other block-wise absmax than 7. full log of that script is available at https://ubergarm.com/images/Kimi-K2-Thinking-safetensors-ranges.zip ~3.5MB zip and almost 20MB log file including some histogram data.

And given your iterative algorithm didn't give anything worthwhile, I've uploaded the Q4_X here: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/tree/main/Q4_X if anyone would like to test. It uses the same mainline compatible mixture of quants given in the model card by the Q8_0-Q4_0 (there is a note there).

Thanks and curious to hear what moonshotai says on your https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/26

Cheers!

jukofyork · 2025-11-13T07:46:58Z

I'm still hoping to get a version for Q4_K if I can too:

Both the F16 outer and Q6 inner block minimums will have to be set to constants.
The same logic as Q4_0 can be used to get the 1x scale and 32x 4bit values for the inner blocks.
Finally, the F16 outer and Q6 inner block scales will have to be computed using something like the Lloyd-Max code above.

Not sure if I will get round to it today, but will hopefully test the stock Q4_K today to see how worthwhile it is to try this.

jukofyork · 2025-11-13T17:20:55Z

I still can't work out how the existing Q4_K code works well enough to hack it, but have managed to write a version that uses the Lloyd-Max algorithm and then figured out how to alter this for Kimi-K2-Thinking:

// DEFINE THIS FOR KIMI-K2-THINKING INIT/CLIP LOGIC
#define IS_KIMI

// Helper: set scale and min in packed format
static inline void set_scale_min_k4(int j, uint8_t * GGML_RESTRICT q, uint8_t d, uint8_t m) {
    assert(d < 64 && m < 64);
    if (j < 4) {
        q[j] = (q[j] & 0xC0) | (d & 0x3F);
        q[j + 4] = (q[j + 4] & 0xC0) | (m & 0x3F);
    } else {
        const int j2 = j - 4;
        q[j2] = (q[j2] & 0x3F) | ((d & 0x30) << 2);
        q[j + 4] = (d & 0x0F) | ((m & 0x0F) << 4);
        q[j] = (q[j] & 0x3F) | ((m & 0x30) << 2);
    }
}

void quantize_row_q4_K_ref(const float * GGML_RESTRICT x, block_q4_K * GGML_RESTRICT y, int64_t k) {
    assert(k % QK_K == 0);
    
#ifdef IS_KIMI    
    const int max_iter = 50;      // note: takes more iterations to converge for this setup...
    const float epsilon = 1e-7f;  // note: use an even lower lambda as doesn't decrease as consistently...
#else
    const int max_iter = 10;
    const float epsilon = 1e-6f;   
#endif
    const int nb = k / QK_K;
    const int num_subblocks = QK_K / 32;

    for (int i = 0; i < nb; i++) {
        memset(y[i].scales, 0, K_SCALE_SIZE);
        
        float scales[num_subblocks];
        float mins[num_subblocks];
        
        // Initialization: compute initial scales and mins per sub-block
        for (int j = 0; j < num_subblocks; j++) {
            float xmin = x[i*QK_K + j*32];
            float xmax = x[i*QK_K + j*32];
            
            for (int l = 1; l < 32; l++) {
                const float v = x[i*QK_K + j*32 + l];
                xmin = v < xmin ? v : xmin;
                xmax = v > xmax ? v : xmax;
            }
            
#ifdef IS_KIMI
            scales[j] = (xmax - xmin) / 14.0f;
            mins[j] = -7.0f * scales[j];
#else
            scales[j] = (xmax - xmin) / 15.0f;
            mins[j] = xmin;
#endif
            if (scales[j] == 0.0f) scales[j] = 1.0f;
        }
        
        // Initialize super-block scales
        float d = 0.0f;
        float dmin_abs = 0.0f;
        for (int j = 0; j < num_subblocks; j++) {
            d = scales[j] > d ? scales[j] : d;
            const float mins_abs = fabsf(mins[j]);
            dmin_abs = mins_abs > dmin_abs ? mins_abs : dmin_abs;
        }
        d = d / 63.0f;
        float dmin = dmin_abs / 63.0f;
        if (d == 0.0f) d = 1.0f;
        if (dmin == 0.0f) dmin = 1.0f;
        
        // Quantize initial sub-block scales and mins
        uint8_t sc[num_subblocks];
        uint8_t m[num_subblocks];
        for (int j = 0; j < num_subblocks; j++) {
            sc[j] = (uint8_t)(nearest_int(scales[j] / d));
            sc[j] = sc[j] > 63 ? 63 : sc[j];
            
            const int m_int = nearest_int(mins[j] / dmin);
            m[j] = (uint8_t)(m_int < 0 ? -m_int : m_int);
            m[j] = m[j] > 63 ? 63 : m[j];
            
            set_scale_min_k4(j, y[i].scales, sc[j], m[j]);
        }
        
        // Adjust dmin sign based on typical min values
        float avg_min = 0.0f;
        for (int j = 0; j < num_subblocks; j++) avg_min += mins[j];
        avg_min /= num_subblocks;
        if (avg_min > 0.0f) dmin = -dmin;
        
        // Temporary storage for 4-bit codes
        uint8_t q[QK_K];
        
        // Lloyd-Max iteration
        for (int iter = 0; iter < max_iter; iter++) {
            const float d_old = d;
            const float dmin_old = dmin;
            
            // Step 1: Assignment - quantize to 4-bit codes
            for (int j = 0; j < num_subblocks; j++) {
                const float scale = d * sc[j];
                const float offset = -dmin * m[j];
                
                if (scale == 0.0f) {
                    for (int l = 0; l < 32; ++l) { 
                        q[j*32 + l] = 0;
                    }
                    continue;
                }
                
                for (int l = 0; l < 32; l++) {
                    const float v = x[i*QK_K + j*32 + l];
                    const int q_int = nearest_int((v - offset) / scale);
#ifdef IS_KIMI
                    q[j*32 + l] = (uint8_t)(q_int < 0 ? 0 : (q_int > 14 ? 14 : q_int));
#else
                    q[j*32 + l] = (uint8_t)(q_int < 0 ? 0 : (q_int > 15 ? 15 : q_int));
#endif
                }
            }
            
            // Step 2: Update sub-block scales and mins (2D least squares per sub-block)
            for (int j = 0; j < num_subblocks; j++) {
                float sum_x = 0.0f;
                float sum_q = 0.0f;
                float sum_xq = 0.0f;
                float sum_qq = 0.0f;
                
                for (int l = 0; l < 32; l++) {
                    const float xv = x[i*QK_K + j*32 + l];
                    const float qv = (float)q[j*32 + l];
                    sum_x += xv;
                    sum_q += qv;
                    sum_xq += xv * qv;
                    sum_qq += qv * qv;
                }
                
                const float n = 32.0f;
                const float det = n * sum_qq - sum_q * sum_q;
                
                if (det > 0.0f) {
                    const float a = (n * sum_xq - sum_x * sum_q) / det;
                    const float b = (sum_x - a * sum_q) / n;
                    
                    if (a > 0.0f && d > 0.0f) {
                        const int sc_new = nearest_int(a / d);
                        sc[j] = (uint8_t)(sc_new < 0 ? 0 : (sc_new > 63 ? 63 : sc_new));
                    }
                    
                    if (dmin != 0.0f) {
                        const int m_new = nearest_int(-b / dmin);
                        m[j] = (uint8_t)(m_new < 0 ? 0 : (m_new > 63 ? 63 : m_new));
                    }
                    
                    set_scale_min_k4(j, y[i].scales, sc[j], m[j]);
                }
            }
            
            // Step 3: Update super-block scales (2D least squares across all sub-blocks)
            float A = 0.0f;   // Σ(sc*q)²
            float B = 0.0f;   // Σ(m*sc*q)
            float C = 0.0f;   // Σm²
            float X_d = 0.0f; // Σ(x*sc*q)
            float X_m = 0.0f; // Σ(x*m)
            
            for (int j = 0; j < num_subblocks; j++) {
                float sum_sq = 0.0f;
                float sum_q = 0.0f;
                float sum_xq = 0.0f;
                float sum_x = 0.0f;
                
                for (int l = 0; l < 32; l++) {
                    const float xv = x[i*QK_K + j*32 + l];
                    const float qv = (float)q[j*32 + l];
                    sum_sq += qv * qv;
                    sum_q += qv;
                    sum_xq += xv * qv;
                    sum_x += xv;
                }
                
                const float sc_f = (float)sc[j];
                const float m_f = (float)m[j];
                
                A += sc_f * sc_f * sum_sq;
                B += m_f * sc_f * sum_q;
                C += m_f * m_f * 32.0f;
                X_d += sc_f * sum_xq;
                X_m += m_f * sum_x;
            }
            
            const float det = A * C - B * B;
            
            if (det > 0.0f) {
                const float d_new = (C * X_d - B * X_m) / det;
                const float dmin_new = (B * X_d - A * X_m) / det;
                
                if (d_new > 0.0f) {
                    d = d_new;
                }
                if (dmin_new != 0.0f) {
                    dmin = dmin_new;
                }
            }
            
            // Check convergence
            const float delta_d = fabsf(d - d_old);
            const float delta_dmin = fabsf(dmin - dmin_old);
            
            //printf("- Iter %i: delta_d=%.6f, delta_dmin=%.6f\n", iter, delta_d, delta_dmin);
            if (delta_d < epsilon && delta_dmin < epsilon) {
                break;
            }
        }
        
        // Final assignment with converged parameters
        for (int j = 0; j < num_subblocks; j++) {
            const float scale = d * sc[j];
            const float offset = -dmin * m[j];
            
            for (int l = 0; l < 32; l++) {
                const float v = x[i*QK_K + j*32 + l];
                const int q_int = scale != 0.0f ? nearest_int((v - offset) / scale) : 0;
#ifdef IS_KIMI
                q[j*32 + l] = (uint8_t)(q_int < 0 ? 0 : (q_int > 14 ? 14 : q_int));
#else
                q[j*32 + l] = (uint8_t)(q_int < 0 ? 0 : (q_int > 15 ? 15 : q_int));
#endif
            }
        }
        
        // Store final super-block scales
        y[i].d = GGML_FP32_TO_FP16(d);
        y[i].dmin = GGML_FP32_TO_FP16(dmin);
               
        // Pack 4-bit quantized values (layout expected by dequant)
        uint8_t *qs = y[i].qs;
        for (int base = 0, out = 0; base < QK_K; base += 64, out += 32) {
            for (int l = 0; l < 32; ++l) {
                qs[out + l] = (q[base + l] & 0x0F) | ((q[base + 32 + l] & 0x0F) << 4);
            }
        }
        
        /*
        // Dequantize and check error
        float y_dequant[QK_K];
        dequantize_row_q4_K(&y[i], y_dequant, QK_K);
        printf("Block %d errors:\n", i);
        float sum_error = 0.0f;
        float sum_abs_error = 0.0f;
        for (int j = 0; j < QK_K; j++) {
            const float error = y_dequant[j] - x[i*QK_K + j];
            printf("  [%d] original=%.6f dequant=%.6f error=%.6f\n", j, x[i*QK_K + j], y_dequant[j], error);
            sum_error += error;
            sum_abs_error += fabsf(error);
        }
        const float mean_error = sum_error / QK_K;
        const float mean_abs_error = sum_abs_error / QK_K;
        printf("- Mean error         : %.6f\n", mean_error);
        printf("- Mean absolute error: %.6f\n\n", mean_abs_error);
        */

    }
}

It's a bit of a hacky mess currently, but it's definitely producing much lower errors than the stock Q4_K algorithm or the stock Lloyd-Max version (around 1 order of magnitude lower, as opposed to 2 orders of magnitude lower for the "-7 hacked" version of Q4_0 function posted above [so it's not going to be quite as good, but hopefully near enough for those of us who get 50-60% bump in PP for Q4_K over Q4_0!]).

It's gonna take a long time to do as bumped the iterations as it doesn't seem to converge as smoothly as the stock Lloyd-Max version... I should have the results tomorrow or late tonight though.

It is called the "Lloyd-Max" algorithm (even when the bins are equally spaced like this). The original papers are here:

https://cs.nyu.edu/home/people/in_memoriam/roweis/csc2515-2006/readings/max60.pdf
http://www.cs.cmu.edu/~bhiksha/courses/mlsp.fall2010/class14/lloyd.pdf

It looks like Lloyd reinvented it 20 years later (as he doesn't reference Max).

jukofyork · 2025-11-13T17:28:44Z

around 1 order of magnitude lower, as opposed to 2 orders of magnitude lower for the "-7 hacked" version

The code above using `#define IS_KIMI`:

- Iter 0: delta_d=0.000002, delta_dmin=0.000018
- Iter 1: delta_d=0.000003, delta_dmin=0.000016
- Iter 2: delta_d=0.000002, delta_dmin=0.000011
- Iter 3: delta_d=0.000002, delta_dmin=0.000017
- Iter 4: delta_d=0.000001, delta_dmin=0.000006
- Iter 5: delta_d=0.000000, delta_dmin=0.000006
- Iter 6: delta_d=0.000001, delta_dmin=0.000006
- Iter 7: delta_d=0.000001, delta_dmin=0.000007
- Iter 8: delta_d=0.000001, delta_dmin=0.000003
- Iter 9: delta_d=0.000000, delta_dmin=0.000001
- Iter 10: delta_d=0.000000, delta_dmin=0.000000
Block 0 errors:
  [0] original=0.008911 dequant=0.008969 error=0.000058
  [1] original=-0.035645 dequant=-0.035770 error=-0.000125
  [2] original=0.035645 dequant=0.035813 error=0.000168
  [3] original=0.008911 dequant=0.008969 error=0.000058
  [4] original=-0.017822 dequant=-0.017874 error=-0.000052
  [5] original=0.000000 dequant=0.000021 error=0.000021
  [6] original=-0.026733 dequant=-0.026822 error=-0.000089
  [7] original=0.008911 dequant=0.008969 error=0.000058
  [8] original=0.008911 dequant=0.008969 error=0.000058
  [9] original=0.026733 dequant=0.026865 error=0.000132
  [10] original=-0.017822 dequant=-0.017874 error=-0.000052
  ...
  [245] original=-0.025269 dequant=-0.025024 error=0.000245
  [246] original=0.012634 dequant=0.012557 error=-0.000077
  [247] original=-0.050537 dequant=-0.050078 error=0.000459
  [248] original=-0.050537 dequant=-0.050078 error=0.000459
  [249] original=-0.050537 dequant=-0.050078 error=0.000459
  [250] original=-0.012634 dequant=-0.012497 error=0.000137
  [251] original=-0.012634 dequant=-0.012497 error=0.000137
  [252] original=-0.025269 dequant=-0.025024 error=0.000245
  [253] original=-0.025269 dequant=-0.025024 error=0.000245
  [254] original=0.037842 dequant=0.037611 error=-0.000231
  [255] original=0.000000 dequant=0.000030 error=0.000030
- Mean absolute error: 0.000357

vs:

The "-7 hacked" version of `quantize_row_q4_0_ref`:

- Block 0: error=0.000039, lattice_offset=1
- Block 1: error=0.000032, lattice_offset=1
- Block 2: error=0.000029, lattice_offset=1

I think this is probably about as good as we can do for Q4_K though, as it can't ever be lossless due to the super-block / sub-block setup... Hopefully it's good enough.

jukofyork · 2025-11-13T19:30:10Z

Found an even better initialisation now:

#ifdef IS_KIMI
            scales[j] = MAX(fabsf(xmin), fabsf(xmax)) / 7.0f;
            mins[j] = -7.0f * scales[j];
#else
            scales[j] = (xmax - xmin) / 15.0f;
            mins[j] = xmin;
#endif

convert: add dequant function for compressed_tensor (kimi-k2-thinking)

1bd57a3

github-actions bot added the python python script changes label Nov 6, 2025

rm redundant code

ab0b550

DajanaV mentioned this pull request Nov 6, 2025

UPSTREAM PR #17064: convert: add dequant function for compressed_tensor (kimi-k2-thinking) auroralabs-loci/llama.cpp#108

Open

fix lazy loading

ed7b7c7

This comment was marked as outdated.

Sign in to view

fix device error

489a7b8

compilade reviewed Nov 6, 2025

View reviewed changes

ngxson added 2 commits November 7, 2025 00:32

DEMO repack

caf0e42

Revert "DEMO repack"

f46686b

This reverts commit caf0e42.

ngxson closed this Nov 7, 2025

Thireus mentioned this pull request Nov 12, 2025

[REQ] moonshotai/Kimi-K2-Thinking Thireus/GGUF-Tool-Suite#39

Open

ubergarm mentioned this pull request Nov 12, 2025

Bug: ikawrakow/ik_llama.cpp#942

Open

convert: add dequant function for compressed_tensor (kimi-k2-thinking) #17064

convert: add dequant function for compressed_tensor (kimi-k2-thinking) #17064

Conversation

ngxson commented Nov 6, 2025

Uh oh!

csabakecskemeti commented Nov 6, 2025

Uh oh!

ngxson commented Nov 6, 2025

Uh oh!

csabakecskemeti commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 6, 2025

Uh oh!

ubergarm commented Nov 6, 2025

Uh oh!

This comment was marked as outdated.

ngxson commented Nov 6, 2025

Uh oh!

ubergarm commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 6, 2025

Uh oh!

compilade Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

csabakecskemeti commented Nov 6, 2025

Uh oh!

compilade Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Nov 6, 2025

Uh oh!

csabakecskemeti commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Nov 6, 2025

Uh oh!

ubergarm commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csabakecskemeti commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Nov 7, 2025

Uh oh!

csabakecskemeti commented Nov 7, 2025

Uh oh!

bartowski1182 commented Nov 7, 2025

Uh oh!

ngxson commented Nov 7, 2025

Uh oh!

ngxson commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Nov 7, 2025

Uh oh!

jukofyork commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csabakecskemeti commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csabakecskemeti commented Nov 6, 2025 •

edited

Loading

ubergarm commented Nov 6, 2025 •

edited

Loading

compilade Nov 6, 2025 •

edited

Loading

ngxson commented Nov 6, 2025 •

edited

Loading

csabakecskemeti commented Nov 6, 2025 •

edited

Loading

ngxson commented Nov 6, 2025 •

edited

Loading

ubergarm commented Nov 7, 2025 •

edited

Loading

csabakecskemeti commented Nov 7, 2025 •

edited

Loading

ngxson commented Nov 7, 2025 •

edited

Loading

jukofyork commented Nov 7, 2025 •

edited

Loading

csabakecskemeti commented Nov 7, 2025 •

edited

Loading

jukofyork commented Nov 7, 2025 •

edited

Loading

ngxson commented Nov 7, 2025 •

edited

Loading

jukofyork commented Nov 12, 2025 •

edited

Loading

jukofyork commented Nov 12, 2025 •

edited

Loading

jukofyork commented Nov 12, 2025 •

edited

Loading

ubergarm commented Nov 13, 2025 •

edited

Loading

jukofyork commented Nov 13, 2025 •

edited

Loading

The code above using `#define IS_KIMI`:

The "-7 hacked" version of `quantize_row_q4_0_ref`: