Skip to content

Conversation

im0qianqian
Copy link

@im0qianqian im0qianqian commented Sep 16, 2025

@github-actions github-actions bot added the python python script changes label Sep 16, 2025
@im0qianqian im0qianqian marked this pull request as ready for review September 16, 2025 11:39
@CISC
Copy link
Collaborator

CISC commented Sep 16, 2025

Thank you for the effort, but I already have a working version, will submit PR soon.

Unfortunately I can tell that this PR is non-working.

@im0qianqian
Copy link
Author

im0qianqian commented Sep 17, 2025

Here is my test results. It runs perfectly.

command:

llama-cli -m ./Ling-mini-2.0-Q4_K_M.gguf --temp 0.7

llama-cli logs:

main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 36863 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 318 tensors from ./models/our_models/Ling-mini-2.0-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bailingmoe-v2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ling Mini 2.0
llama_model_loader: - kv   3:                            general.version str              = 2.0
llama_model_loader: - kv   4:                           general.basename str              = Ling
llama_model_loader: - kv   5:                         general.size_label str              = mini
llama_model_loader: - kv   6:                            general.license str              = MIT License
llama_model_loader: - kv   7:                  bailingmoe-v2.block_count u32              = 20
llama_model_loader: - kv   8:               bailingmoe-v2.context_length u32              = 32768
llama_model_loader: - kv   9:             bailingmoe-v2.embedding_length u32              = 2048
llama_model_loader: - kv  10:          bailingmoe-v2.feed_forward_length u32              = 5120
llama_model_loader: - kv  11:         bailingmoe-v2.attention.head_count u32              = 16
llama_model_loader: - kv  12:      bailingmoe-v2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:               bailingmoe-v2.rope.freq_base f32              = 600000.000000
llama_model_loader: - kv  14: bailingmoe-v2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:            bailingmoe-v2.expert_used_count u32              = 8
llama_model_loader: - kv  16:         bailingmoe-v2.attention.key_length u32              = 128
llama_model_loader: - kv  17:       bailingmoe-v2.attention.value_length u32              = 128
llama_model_loader: - kv  18:         bailingmoe-v2.rope.dimension_count u32              = 64
llama_model_loader: - kv  19:            bailingmoe-v2.rope.scaling.type str              = none
llama_model_loader: - kv  20:    bailingmoe-v2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  21:                   bailingmoe-v2.vocab_size u32              = 157184
llama_model_loader: - kv  22:   bailingmoe-v2.expert_feed_forward_length u32              = 512
llama_model_loader: - kv  23:         bailingmoe-v2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  24:                 bailingmoe-v2.expert_count u32              = 256
llama_model_loader: - kv  25:          bailingmoe-v2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:          bailingmoe-v2.expert_weights_norm bool             = true
llama_model_loader: - kv  27:           bailingmoe-v2.expert_gating_func u32              = 2
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = bailing-bt2
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,157184]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,157184]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,156635]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "h e...
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 156891
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 156895
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 156892
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {% set thinking_option = 'off' %}\n{{-...
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  119 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   30 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 9.21 GiB (4.87 BPW)
load: printing all EOG tokens:
load:   - 156892 ('<|endoftext|>')
load:   - 156895 ('<|role_end|>')
load: special tokens cache size = 262
load: token to piece cache size = 1.0010 MB
print_info: arch             = bailingmoe-v2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 20
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5120
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = none
print_info: freq_base_train  = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 16B
print_info: model params     = 16.26 B
print_info: general.name     = Ling Mini 2.0
print_info: n_layer_dense_lead   = 1
print_info: n_ff_exp             = 512
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: vocab type       = BPE
print_info: n_vocab          = 157184
print_info: n_merges         = 156635
print_info: BOS token        = 156891 '<|startoftext|>'
print_info: EOS token        = 156895 '<|role_end|>'
print_info: EOT token        = 156892 '<|endoftext|>'
print_info: PAD token        = 156892 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 156892 '<|endoftext|>'
print_info: EOG token        = 156895 '<|role_end|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 20 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 21/21 layers to GPU
load_tensors: Metal_Mapped model buffer size =  9433.81 MiB
load_tensors:   CPU_Mapped model buffer size =   172.69 MiB
...............................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 600000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_load_library: loaded in 0.006 sec
ggml_metal_init: GPU name:   Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: use fusion            = true
ggml_metal_init: use shared buffers    = true
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.60 MiB
llama_kv_cache:      Metal KV buffer size =   160.00 MiB
llama_kv_cache: size =  160.00 MiB (  4096 cells,  20 layers,  1/1 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      Metal compute buffer size =   319.00 MiB
llama_context:        CPU compute buffer size =    12.01 MiB
llama_context: graph nodes  = 1353
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|role_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<role>SYSTEM</role>You are a helpful assistant<role>HUMAN</role>Hello<role>ASSISTANT</role>Hi there<role>HUMAN</role>How are you?<role>ASSISTANT</role>

system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 |

main: interactive mode on.
sampler seed: 4258057996
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

Q & A (Ling-mini 2.0)

> Hello, do you like watching TV or playing games?
Hello! As an AI, I don't have personal preferences or the ability to watch TV or play games in the way humans do. However, I can certainly help you with information, recommendations, or even engage in a conversation about your favorite TV shows or games! What kind of content are you interested in? 😊🎮📺

If you're looking for recommendations, I can suggest some great shows or games based on your interests. Let me know what you enjoy! 😊🎮📺

Looking forward to hearing from you! 😊🎮📺

Best,
[Your Friendly AI Assistant]

Speed test results (136.17 tokens per second w/ Q4_K_M quantize on Apple M4 Pro):

llama_perf_sampler_print:    sampling time =       6.49 ms /   150 runs   (    0.04 ms per token, 23119.61 tokens per second)
llama_perf_context_print:        load time =     799.31 ms
llama_perf_context_print: prompt eval time =      58.75 ms /    20 tokens (    2.94 ms per token,   340.41 tokens per second)
llama_perf_context_print:        eval time =     947.34 ms /   129 runs   (    7.34 ms per token,   136.17 tokens per second)
llama_perf_context_print:       total time = 1053308.51 ms /   149 tokens
llama_perf_context_print:    graphs reused =        128
model size params backend threads test t/s
bailingmoe-v2 16B Q4_K - Medium 9.21 GiB 16.26 B Metal,BLAS 8 pp512 1772.72 ± 3.49
bailingmoe-v2 16B Q4_K - Medium 9.21 GiB 16.26 B Metal,BLAS 8 tg128 142.66 ± 0.61
bailingmoe-v2 16B Q4_K - Medium 9.21 GiB 16.26 B Metal,BLAS 8 tg256 141.64 ± 1.06
bailingmoe-v2 16B Q4_K - Medium 9.21 GiB 16.26 B Metal,BLAS 8 tg512 132.44 ± 5.31

@CISC
Copy link
Collaborator

CISC commented Sep 17, 2025

Here is my test results. It runs perfectly.

Sorry, but no, there are numerous issues with this implementation, I'll name just a few:

  • It does not work with the base model
  • It needlessly splits and permutes Q/K/V
  • It uses the wrong chat template (when not using --jinja)
  • It does not match the expert selection implementation

@fizzAI
Copy link

fizzAI commented Sep 17, 2025

Here is my test results. It runs perfectly.

Sorry, but no, there are numerous issues with this implementation, I'll name just a few:

* It does not work with the base model

* It needlessly splits and permutes Q/K/V

* It uses the wrong chat template (when not using `--jinja`)

* It does not match the expert selection implementation

... You do realize you're talking to someone who works at/with Inclusion about their own model arch, right? No need for such needless passive aggression anyways when everyone is trying to help :'(

@CISC
Copy link
Collaborator

CISC commented Sep 17, 2025

Here is my test results. It runs perfectly.

Sorry, but no, there are numerous issues with this implementation, I'll name just a few

... You do realize you're talking to someone who works at/with Inclusion about their own model arch, right? No need for such needless passive aggression anyways when everyone is trying to help :'(

Yes, and it is not passive aggression, simply stating facts.

@im0qianqian
Copy link
Author

Here is my test results. It runs perfectly.

Sorry, but no, there are numerous issues with this implementation, I'll name just a few:

  • It does not work with the base model
  • It needlessly splits and permutes Q/K/V
  • It uses the wrong chat template (when not using --jinja)
  • It does not match the expert selection implementation

I understand. I've now fixed issue 2 about "It needlessly splits and permutes Q/K/V".
Whether your PR or mine gets merged, I appreciate your contribution to the open‑source adaptation of the Ling-series models.

@CISC
Copy link
Collaborator

CISC commented Sep 18, 2025

I understand. I've now fixed issue 2 about "It needlessly splits and permutes Q/K/V". Whether your PR or mine gets merged, I appreciate your contribution to the open‑source adaptation of the Ling-series models.

You should not split them either, it's beneficial to have QKV fused.

@im0qianqian im0qianqian requested a review from ngxson September 22, 2025 06:26
@cklsoft
Copy link

cklsoft commented Sep 22, 2025

When will this PR be merged ? I want to deploy GGUF-format Ling models on my macOS. :)

@im0qianqian im0qianqian requested a review from CISC as a code owner September 26, 2025 08:43
|| t.first == "_<EOT>"
|| t.first == "<|end_of_text|>"
|| t.first == "<end_of_utterance>" // smoldocling
|| t.first == "<|role_end|>" // Ling v2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why this was added? It's set as eos_token and special in the tokenizer, so this should not be necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it's just because the llama-cli log told me that special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's fine though, at that point it is added as EOG, so not an issue. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants