Skip to content

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented Apr 1, 2025

Missed second head_dim usage in #12678.

Cleaner assignment as well.

Edit: The Ling-lite-base model is still broken though until PR#2 is merged.

@bartowski1182 @nicoboss

@CISC CISC requested a review from ngxson April 1, 2025 06:32
@github-actions github-actions bot added the python python script changes label Apr 1, 2025
@nicoboss
Copy link
Contributor

nicoboss commented Apr 1, 2025

Thanks a lot for fixing this so quickly! I tested it and can confirm that this fixes any issues I experianced with Ling-lite-base GGUF convearsion. However unfortinately the resulting GGUF doesn't seam to load in llama.cpp:

root@AI:/apool/llama.cpp/build/bin# ./llama-cli -m /mradermacher/tmp/quant/Ling-lite-base.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 5016 (fb8c6eb4) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23241 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23260 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 367 tensors from /mradermacher/tmp/quant/Ling-lite-base.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bailingmoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ling Lite Base
llama_model_loader: - kv   3:                         general.size_label str              = 64x1.5B
llama_model_loader: - kv   4:                            general.license str              = mit
llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   6:                     bailingmoe.block_count u32              = 28
llama_model_loader: - kv   7:                  bailingmoe.context_length u32              = 16384
llama_model_loader: - kv   8:                bailingmoe.embedding_length u32              = 2048
llama_model_loader: - kv   9:             bailingmoe.feed_forward_length u32              = 5632
llama_model_loader: - kv  10:            bailingmoe.attention.head_count u32              = 16
llama_model_loader: - kv  11:         bailingmoe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                  bailingmoe.rope.freq_base f32              = 600000.000000
llama_model_loader: - kv  13: bailingmoe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:               bailingmoe.expert_used_count u32              = 6
llama_model_loader: - kv  15:            bailingmoe.attention.key_length u32              = 0
llama_model_loader: - kv  16:          bailingmoe.attention.value_length u32              = 0
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:            bailingmoe.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:               bailingmoe.rope.scaling.type str              = none
llama_model_loader: - kv  20:       bailingmoe.leading_dense_block_count u32              = 0
llama_model_loader: - kv  21:                      bailingmoe.vocab_size u32              = 126464
llama_model_loader: - kv  22:      bailingmoe.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  23:            bailingmoe.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  24:                    bailingmoe.expert_count u32              = 64
llama_model_loader: - kv  25:             bailingmoe.expert_shared_count u32              = 2
llama_model_loader: - kv  26:             bailingmoe.expert_weights_norm bool             = true
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = bailingmoe
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,126464]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,126464]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,125824]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "h e...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 126080
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 126081
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 126081
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   85 tensors
llama_model_loader: - type  f16:  282 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 31.30 GiB (16.00 BPW) 
load: special tokens cache size = 262
load: token to piece cache size = 0.8056 MB
print_info: arch             = bailingmoe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 16384
print_info: n_embd           = 2048
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 0
print_info: n_embd_head_v    = 0
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 0
print_info: n_embd_v_gqa     = 0
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5632
print_info: n_expert         = 64
print_info: n_expert_used    = 6
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = none
print_info: freq_base_train  = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 16384
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 16B
print_info: model params     = 16.80 B
print_info: general.name     = Ling Lite Base
print_info: n_layer_dense_lead   = 0
print_info: n_ff_exp             = 1408
print_info: n_expert_shared      = 2
print_info: expert_weights_scale = 1.0
print_info: expert_weights_norm  = 1
print_info: vocab type       = BPE
print_info: n_vocab          = 126464
print_info: n_merges         = 125824
print_info: BOS token        = 126080 '<|startoftext|>'
print_info: EOS token        = 126081 '<|endoftext|>'
print_info: EOT token        = 126081 '<|endoftext|>'
print_info: PAD token        = 126081 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 126081 '<|endoftext|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 32054.45 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 600000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.48 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
init: failed to allocate buffer for kv cache
llama_init_from_model: failed to initialize the context: failed to initialize self-attention cache
common_init_from_params: failed to create context with model '/mradermacher/tmp/quant/Ling-lite-base.gguf'
main: error: unable to load model

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

Thanks a lot for fixing this so quickly! I tested it and can confirm that this fixes any issues I experianced with Ling-lite-base GGUF convearsion. However unfortinately the resulting GGUF doesn't seam to load in llama.cpp:

Sigh, indeed, it's because the base class does this:

if (head_dim := self.hparams.get("head_dim")) is not None:
self.gguf_writer.add_key_length(head_dim)
self.gguf_writer.add_value_length(head_dim)

I'll see if we can work around that.

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

I'll see if we can work around that.

Actually, you know what, setting head_dim to 0 is broken, I'll submit a fix to the model instead.

See PR#2

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

@ngxson I still think it's worth it to merge this as it's slightly nicer, even though the model itself is what needs fixing.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have time to test this model, but seems good to me

n_kv_head = self.hparams.get("num_key_value_heads")
n_embd = self.hparams["hidden_size"]
head_dim = self.hparams.get("head_dim", n_embd // n_head)
head_dim = self.hparams.get("head_dim") or n_embd // n_head
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sometimes we have issue where models exported by tranformers has some keys set to None, will discuss with @huggingface team to see if it can be removed in next version

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC I checked with transformers team, they said that the None value is actually set by a custom code outside of the library.

More importantly, I almost forgot that we actually have Model.find_hparams in convert_hf_to_gguf.py that is perfect for handling such case. If you have time, can you do a pass to change places that currently using .get to find_hparams?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method is useful if you have multiple candidates for a value, but I don't see how it applies here?

The issue is not None, but that they set an actual 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok I misunderstood the function. But now I think it would be nice if find_hparam can handle this case, maybe add an arg default_value to it?

I'm thinking about this because it was also the case for gemma 3, is was a bit messy because some params are either missing or being null in the config.json, could be possible that many models in the future will have this same behavior.

@CISC CISC merged commit 5936a61 into ggml-org:master Apr 1, 2025
5 checks passed
@CISC CISC deleted the fix-bailing-qkv-split branch April 1, 2025 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants