convert : BailingMoE : fix qkv split when head_dim is 0 #12687

CISC · 2025-04-01T06:32:27Z

Missed second head_dim usage in #12678.

Cleaner assignment as well.

Edit: The Ling-lite-base model is still broken though until PR#2 is merged.

nicoboss · 2025-04-01T11:32:29Z

Thanks a lot for fixing this so quickly! I tested it and can confirm that this fixes any issues I experianced with Ling-lite-base GGUF convearsion. However unfortinately the resulting GGUF doesn't seam to load in llama.cpp:

root@AI:/apool/llama.cpp/build/bin# ./llama-cli -m /mradermacher/tmp/quant/Ling-lite-base.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 5016 (fb8c6eb4) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23241 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23260 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 367 tensors from /mradermacher/tmp/quant/Ling-lite-base.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bailingmoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ling Lite Base
llama_model_loader: - kv   3:                         general.size_label str              = 64x1.5B
llama_model_loader: - kv   4:                            general.license str              = mit
llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   6:                     bailingmoe.block_count u32              = 28
llama_model_loader: - kv   7:                  bailingmoe.context_length u32              = 16384
llama_model_loader: - kv   8:                bailingmoe.embedding_length u32              = 2048
llama_model_loader: - kv   9:             bailingmoe.feed_forward_length u32              = 5632
llama_model_loader: - kv  10:            bailingmoe.attention.head_count u32              = 16
llama_model_loader: - kv  11:         bailingmoe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                  bailingmoe.rope.freq_base f32              = 600000.000000
llama_model_loader: - kv  13: bailingmoe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:               bailingmoe.expert_used_count u32              = 6
llama_model_loader: - kv  15:            bailingmoe.attention.key_length u32              = 0
llama_model_loader: - kv  16:          bailingmoe.attention.value_length u32              = 0
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:            bailingmoe.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:               bailingmoe.rope.scaling.type str              = none
llama_model_loader: - kv  20:       bailingmoe.leading_dense_block_count u32              = 0
llama_model_loader: - kv  21:                      bailingmoe.vocab_size u32              = 126464
llama_model_loader: - kv  22:      bailingmoe.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  23:            bailingmoe.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  24:                    bailingmoe.expert_count u32              = 64
llama_model_loader: - kv  25:             bailingmoe.expert_shared_count u32              = 2
llama_model_loader: - kv  26:             bailingmoe.expert_weights_norm bool             = true
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = bailingmoe
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,126464]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,126464]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,125824]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "h e...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 126080
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 126081
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 126081
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   85 tensors
llama_model_loader: - type  f16:  282 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 31.30 GiB (16.00 BPW) 
load: special tokens cache size = 262
load: token to piece cache size = 0.8056 MB
print_info: arch             = bailingmoe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 16384
print_info: n_embd           = 2048
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 0
print_info: n_embd_head_v    = 0
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 0
print_info: n_embd_v_gqa     = 0
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5632
print_info: n_expert         = 64
print_info: n_expert_used    = 6
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = none
print_info: freq_base_train  = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 16384
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 16B
print_info: model params     = 16.80 B
print_info: general.name     = Ling Lite Base
print_info: n_layer_dense_lead   = 0
print_info: n_ff_exp             = 1408
print_info: n_expert_shared      = 2
print_info: expert_weights_scale = 1.0
print_info: expert_weights_norm  = 1
print_info: vocab type       = BPE
print_info: n_vocab          = 126464
print_info: n_merges         = 125824
print_info: BOS token        = 126080 '<|startoftext|>'
print_info: EOS token        = 126081 '<|endoftext|>'
print_info: EOT token        = 126081 '<|endoftext|>'
print_info: PAD token        = 126081 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 126081 '<|endoftext|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 32054.45 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 600000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.48 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
init: failed to allocate buffer for kv cache
llama_init_from_model: failed to initialize the context: failed to initialize self-attention cache
common_init_from_params: failed to create context with model '/mradermacher/tmp/quant/Ling-lite-base.gguf'
main: error: unable to load model

CISC · 2025-04-01T11:47:52Z

Thanks a lot for fixing this so quickly! I tested it and can confirm that this fixes any issues I experianced with Ling-lite-base GGUF convearsion. However unfortinately the resulting GGUF doesn't seam to load in llama.cpp:

Sigh, indeed, it's because the base class does this:

llama.cpp/convert_hf_to_gguf.py

Lines 257 to 259 in a6f32f0

    
           if (head_dim := self.hparams.get("head_dim")) is not None: 
        
               self.gguf_writer.add_key_length(head_dim) 
        
               self.gguf_writer.add_value_length(head_dim)

I'll see if we can work around that.

CISC · 2025-04-01T11:57:46Z

I'll see if we can work around that.

Actually, you know what, setting head_dim to 0 is broken, I'll submit a fix to the model instead.

See PR#2

CISC · 2025-04-01T12:18:22Z

@ngxson I still think it's worth it to merge this as it's slightly nicer, even though the model itself is what needs fixing.

ngxson

I don't have time to test this model, but seems good to me

ngxson · 2025-04-01T12:34:38Z

convert_hf_to_gguf.py

        n_kv_head = self.hparams.get("num_key_value_heads")
        n_embd = self.hparams["hidden_size"]
-        head_dim = self.hparams.get("head_dim", n_embd // n_head)
+        head_dim = self.hparams.get("head_dim") or n_embd // n_head


Yeah sometimes we have issue where models exported by tranformers has some keys set to None, will discuss with @huggingface team to see if it can be removed in next version

@CISC I checked with transformers team, they said that the None value is actually set by a custom code outside of the library.

More importantly, I almost forgot that we actually have Model.find_hparams in convert_hf_to_gguf.py that is perfect for handling such case. If you have time, can you do a pass to change places that currently using .get to find_hparams?

That method is useful if you have multiple candidates for a value, but I don't see how it applies here?

The issue is not None, but that they set an actual 0.

Hmm ok I misunderstood the function. But now I think it would be nice if find_hparam can handle this case, maybe add an arg default_value to it?

I'm thinking about this because it was also the case for gemma 3, is was a bit messy because some params are either missing or being null in the config.json, could be possible that many models in the future will have this same behavior.

CISC added 2 commits April 1, 2025 08:26

fix qkv split when head_dim is 0

c5c1d71

cleaner rope_dim assignment

fb8c6eb

CISC requested a review from ngxson April 1, 2025 06:32

github-actions bot added the python python script changes label Apr 1, 2025

ngxson approved these changes Apr 1, 2025

View reviewed changes

ngxson reviewed Apr 1, 2025

View reviewed changes

CISC merged commit 5936a61 into ggml-org:master Apr 1, 2025
5 checks passed

CISC deleted the fix-bailing-qkv-split branch April 1, 2025 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : BailingMoE : fix qkv split when head_dim is 0 #12687

convert : BailingMoE : fix qkv split when head_dim is 0 #12687

Uh oh!

CISC commented Apr 1, 2025 •

edited

Loading

Uh oh!

nicoboss commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025 •

edited

Loading

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

ngxson left a comment

Uh oh!

ngxson Apr 1, 2025

Uh oh!

ngxson Apr 1, 2025

Uh oh!

CISC Apr 1, 2025

Uh oh!

ngxson Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

convert : BailingMoE : fix qkv split when head_dim is 0 #12687

convert : BailingMoE : fix qkv split when head_dim is 0 #12687

Uh oh!

Conversation

CISC commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicoboss commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CISC commented Apr 1, 2025 •

edited

Loading

CISC commented Apr 1, 2025 •

edited

Loading