Add support for Ling v2 #16028

im0qianqian · 2025-09-16T11:38:59Z

Add support for Ling v2

Related issues: #15968

Github：https://github.com/inclusionAI/Ling-V2
Huggingface：https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86
Modelscope：https://modelscope.cn/collections/Ling-V2-01d8988fbf864d

convert_hf_to_gguf.py

CISC · 2025-09-16T20:43:39Z

Thank you for the effort, but I already have a working version, will submit PR soon.

Unfortunately I can tell that this PR is non-working.

im0qianqian · 2025-09-17T00:28:05Z

Here is my test results. It runs perfectly.

command:

llama-cli -m ./Ling-mini-2.0-Q4_K_M.gguf --temp 0.7

llama-cli logs:

main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 36863 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 318 tensors from ./models/our_models/Ling-mini-2.0-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bailingmoe-v2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ling Mini 2.0
llama_model_loader: - kv   3:                            general.version str              = 2.0
llama_model_loader: - kv   4:                           general.basename str              = Ling
llama_model_loader: - kv   5:                         general.size_label str              = mini
llama_model_loader: - kv   6:                            general.license str              = MIT License
llama_model_loader: - kv   7:                  bailingmoe-v2.block_count u32              = 20
llama_model_loader: - kv   8:               bailingmoe-v2.context_length u32              = 32768
llama_model_loader: - kv   9:             bailingmoe-v2.embedding_length u32              = 2048
llama_model_loader: - kv  10:          bailingmoe-v2.feed_forward_length u32              = 5120
llama_model_loader: - kv  11:         bailingmoe-v2.attention.head_count u32              = 16
llama_model_loader: - kv  12:      bailingmoe-v2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:               bailingmoe-v2.rope.freq_base f32              = 600000.000000
llama_model_loader: - kv  14: bailingmoe-v2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:            bailingmoe-v2.expert_used_count u32              = 8
llama_model_loader: - kv  16:         bailingmoe-v2.attention.key_length u32              = 128
llama_model_loader: - kv  17:       bailingmoe-v2.attention.value_length u32              = 128
llama_model_loader: - kv  18:         bailingmoe-v2.rope.dimension_count u32              = 64
llama_model_loader: - kv  19:            bailingmoe-v2.rope.scaling.type str              = none
llama_model_loader: - kv  20:    bailingmoe-v2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  21:                   bailingmoe-v2.vocab_size u32              = 157184
llama_model_loader: - kv  22:   bailingmoe-v2.expert_feed_forward_length u32              = 512
llama_model_loader: - kv  23:         bailingmoe-v2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  24:                 bailingmoe-v2.expert_count u32              = 256
llama_model_loader: - kv  25:          bailingmoe-v2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:          bailingmoe-v2.expert_weights_norm bool             = true
llama_model_loader: - kv  27:           bailingmoe-v2.expert_gating_func u32              = 2
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = bailing-bt2
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,157184]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,157184]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,156635]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "h e...
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 156891
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 156895
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 156892
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {% set thinking_option = 'off' %}\n{{-...
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  119 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   30 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 9.21 GiB (4.87 BPW)
load: printing all EOG tokens:
load:   - 156892 ('<|endoftext|>')
load:   - 156895 ('<|role_end|>')
load: special tokens cache size = 262
load: token to piece cache size = 1.0010 MB
print_info: arch             = bailingmoe-v2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 20
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5120
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = none
print_info: freq_base_train  = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 16B
print_info: model params     = 16.26 B
print_info: general.name     = Ling Mini 2.0
print_info: n_layer_dense_lead   = 1
print_info: n_ff_exp             = 512
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: vocab type       = BPE
print_info: n_vocab          = 157184
print_info: n_merges         = 156635
print_info: BOS token        = 156891 '<|startoftext|>'
print_info: EOS token        = 156895 '<|role_end|>'
print_info: EOT token        = 156892 '<|endoftext|>'
print_info: PAD token        = 156892 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 156892 '<|endoftext|>'
print_info: EOG token        = 156895 '<|role_end|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 20 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 21/21 layers to GPU
load_tensors: Metal_Mapped model buffer size =  9433.81 MiB
load_tensors:   CPU_Mapped model buffer size =   172.69 MiB
...............................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 600000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_load_library: loaded in 0.006 sec
ggml_metal_init: GPU name:   Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: use fusion            = true
ggml_metal_init: use shared buffers    = true
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.60 MiB
llama_kv_cache:      Metal KV buffer size =   160.00 MiB
llama_kv_cache: size =  160.00 MiB (  4096 cells,  20 layers,  1/1 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      Metal compute buffer size =   319.00 MiB
llama_context:        CPU compute buffer size =    12.01 MiB
llama_context: graph nodes  = 1353
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|role_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<role>SYSTEM</role>You are a helpful assistant<role>HUMAN</role>Hello<role>ASSISTANT</role>Hi there<role>HUMAN</role>How are you?<role>ASSISTANT</role>

system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 |

main: interactive mode on.
sampler seed: 4258057996
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

Q & A (Ling-mini 2.0)

> Hello, do you like watching TV or playing games?
Hello! As an AI, I don't have personal preferences or the ability to watch TV or play games in the way humans do. However, I can certainly help you with information, recommendations, or even engage in a conversation about your favorite TV shows or games! What kind of content are you interested in? 😊🎮📺

If you're looking for recommendations, I can suggest some great shows or games based on your interests. Let me know what you enjoy! 😊🎮📺

Looking forward to hearing from you! 😊🎮📺

Best,
[Your Friendly AI Assistant]

Speed test results (136.17 tokens per second w/ Q4_K_M quantize on Apple M4 Pro):

llama_perf_sampler_print:    sampling time =       6.49 ms /   150 runs   (    0.04 ms per token, 23119.61 tokens per second)
llama_perf_context_print:        load time =     799.31 ms
llama_perf_context_print: prompt eval time =      58.75 ms /    20 tokens (    2.94 ms per token,   340.41 tokens per second)
llama_perf_context_print:        eval time =     947.34 ms /   129 runs   (    7.34 ms per token,   136.17 tokens per second)
llama_perf_context_print:       total time = 1053308.51 ms /   149 tokens
llama_perf_context_print:    graphs reused =        128

model	size	params	backend	threads	test	t/s
bailingmoe-v2 16B Q4_K - Medium	9.21 GiB	16.26 B	Metal,BLAS	8	pp512	1772.72 ± 3.49
bailingmoe-v2 16B Q4_K - Medium	9.21 GiB	16.26 B	Metal,BLAS	8	tg128	142.66 ± 0.61
bailingmoe-v2 16B Q4_K - Medium	9.21 GiB	16.26 B	Metal,BLAS	8	tg256	141.64 ± 1.06
bailingmoe-v2 16B Q4_K - Medium	9.21 GiB	16.26 B	Metal,BLAS	8	tg512	132.44 ± 5.31

CISC · 2025-09-17T07:18:12Z

Here is my test results. It runs perfectly.

Sorry, but no, there are numerous issues with this implementation, I'll name just a few:

It does not work with the base model
It needlessly splits and permutes Q/K/V
It uses the wrong chat template (when not using --jinja)
It does not match the expert selection implementation

fizzAI · 2025-09-17T14:06:50Z

Here is my test results. It runs perfectly.

Sorry, but no, there are numerous issues with this implementation, I'll name just a few:
* It does not work with the base model

* It needlessly splits and permutes Q/K/V

* It uses the wrong chat template (when not using `--jinja`)

* It does not match the expert selection implementation

... You do realize you're talking to someone who works at/with Inclusion about their own model arch, right? No need for such needless passive aggression anyways when everyone is trying to help :'(

CISC · 2025-09-17T14:09:09Z

Here is my test results. It runs perfectly.

Sorry, but no, there are numerous issues with this implementation, I'll name just a few

... You do realize you're talking to someone who works at/with Inclusion about their own model arch, right? No need for such needless passive aggression anyways when everyone is trying to help :'(

Yes, and it is not passive aggression, simply stating facts.

im0qianqian · 2025-09-18T13:39:52Z

Here is my test results. It runs perfectly.

Sorry, but no, there are numerous issues with this implementation, I'll name just a few:

It does not work with the base model

It needlessly splits and permutes Q/K/V

It uses the wrong chat template (when not using --jinja)

It does not match the expert selection implementation

I understand. I've now fixed issue 2 about "It needlessly splits and permutes Q/K/V".
Whether your PR or mine gets merged, I appreciate your contribution to the open‑source adaptation of the Ling-series models.

CISC · 2025-09-18T15:56:31Z

I understand. I've now fixed issue 2 about "It needlessly splits and permutes Q/K/V". Whether your PR or mine gets merged, I appreciate your contribution to the open‑source adaptation of the Ling-series models.

You should not split them either, it's beneficial to have QKV fused.

cklsoft · 2025-09-22T06:36:43Z

When will this PR be merged ? I want to deploy GGUF-format Ling models on my macOS. :)

CISC · 2025-09-27T18:54:59Z

src/llama-vocab.cpp

                    || t.first == "_<EOT>"
                    || t.first == "<|end_of_text|>"
                    || t.first == "<end_of_utterance>" // smoldocling
+                    || t.first == "<|role_end|>" // Ling v2


Just wondering why this was added? It's set as eos_token and special in the tokenizer, so this should not be necessary.

Hi, it's just because the llama-cli log told me that special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect.

Right, it's fine though, at that point it is added as EOG, so not an issue. :)

Ok. Thank you.

im0qianqian and others added 14 commits September 13, 2025 02:14

init tokenizer configs

b4eb50d

init model

7968c69

[feat] Ling mini 2.0 gguf 转换适配

ba8e746

update set_gguf_parameters

1dee442

update llm tensor names

63a2f54

update load tensors

69177c7

update llm graph

3fe7676

[fix] fix expert_bias convert hf to gguf for ling mini 2.0

94ec7dc

[fix] fix llm graph for ling mini 2.0

a2a2299

[feat] add chat template for ling 2.0

a6b3ca8

[feat] add chat template

c72e399

[fix] fix for eog token

b359533

[fix] skip mtp layer

e078a63

[fix] fix weight convert for Ling mini 2.0 w/ half rotary 🐛

09e3df4

github-actions bot added the python python script changes label Sep 16, 2025

im0qianqian marked this pull request as ready for review September 16, 2025 11:39

ngxson reviewed Sep 16, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

warshanks mentioned this pull request Sep 16, 2025

Feature Request: BailingMoeV2 Support (Ling Lite 2.0) #15968

Closed

4 tasks

lovedheart mentioned this pull request Sep 16, 2025

Add bailingmoe-v2 support #16036

Open

[fix] update vocab ref to Ling mini 2.0

1c6ec2a

[feat] add support for Ling Lite 2.0

3709961

[fix] use NEOX, and remove permute & split in convert process

48ddb75

im0qianqian requested a review from ngxson September 22, 2025 06:26

im0qianqian mentioned this pull request Sep 23, 2025

model : add BailingMoeV2 support #16063

Merged

Merge branch 'master' into gguf_for_ling

6caa87e

im0qianqian requested a review from CISC as a code owner September 26, 2025 08:43

CISC reviewed Sep 27, 2025

View reviewed changes

Add support for Ling v2 #16028

Are you sure you want to change the base?

Add support for Ling v2 #16028

Conversation

im0qianqian commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

CISC commented Sep 16, 2025

Uh oh!

im0qianqian commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fizzAI commented Sep 17, 2025

Uh oh!

CISC commented Sep 17, 2025

Uh oh!

im0qianqian commented Sep 18, 2025

Uh oh!

CISC commented Sep 18, 2025

Uh oh!

cklsoft commented Sep 22, 2025

Uh oh!

CISC Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

im0qianqian Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

im0qianqian Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

im0qianqian commented Sep 16, 2025 •

edited

Loading

im0qianqian commented Sep 17, 2025 •

edited

Loading

CISC commented Sep 17, 2025 •

edited

Loading