Skip to content

Conversation

gabriellarson
Copy link
Contributor

The text model portion of moonshotai/Kimi-VL-A3B-Instruct is functionally identical to moonshotai/Moonlight-16B-A3B-Instruct, but there is an error in the model's config files. The Kimi-VL models should be using token "<|im_end|>" as their EOS token, not "[EOS]". Without this fix, generation was stopping after any comma "," and I'm not really sure why.

Just wanted to get this merged before I really start working on getting the vision portion working.

@github-actions github-actions bot added the python python script changes label Aug 3, 2025
@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

I don't think this is the correct fix, <|im_end|> is automatically added as EOG, so the real issue is why does it stop after commas (it almost sound like it outputs ,[EOS])?

Have you tried --ignore-eos and looking at the actual tokens generated?

@gabriellarson
Copy link
Contributor Author

This gguf was made with just the "basic" commit without the fix, you can see that
print_info: BOS token = 11 ','
print_info: EOS token = 11 ','
print_info: EOT token = 163586 '<|im_end|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 11 ','
print_info: EOG token = 163586 '<|im_end|>'

./llama-cli -m "E:\llama.cpp\dev\kimivl\kimi-vl-instruct\Kimi-VL-Instruct-Q4_0_basic.gguf" -p "What are the colors of the google logo?" --ignore-eos
build: 6076 (1ffa83f9) with MSVC 19.36.32534.0 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 43 key-value pairs and 430 tensors from E:\llama.cpp\dev\kimivl\kimi-vl-instruct\Kimi-VL-Instruct-Q4_0_basic.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Kimi Vl Instruct
llama_model_loader: - kv   3:                           general.finetune str              = instruct
llama_model_loader: - kv   4:                           general.basename str              = kimi-vl
llama_model_loader: - kv   5:                         general.size_label str              = 64x1.8B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Moonlight 16B A3B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Moonshotai
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/moonshotai/Moo...
llama_model_loader: - kv  11:                               general.tags arr[str,5]       = ["agent", "video", "screenspot", "lon...
llama_model_loader: - kv  12:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv  13:                   deepseek2.context_length u32              = 131072
llama_model_loader: - kv  14:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv  15:              deepseek2.feed_forward_length u32              = 11264
llama_model_loader: - kv  16:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv  17:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  18:                   deepseek2.rope.freq_base f32              = 800000.000000
llama_model_loader: - kv  19: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  21:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  22:                       deepseek2.vocab_size u32              = 163840
llama_model_loader: - kv  23:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  24:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  25:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  26:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  27:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  28:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  29:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  30:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  31:             deepseek2.expert_weights_scale f32              = 2.446000
llama_model_loader: - kv  32:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  33:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  34:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  35:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  36:                         tokenizer.ggml.pre str              = kimi-k2
llama_model_loader: - kv  37:                      tokenizer.ggml.tokens arr[str,163840]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,163840]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  39:                      tokenizer.ggml.merges arr[str,163328]  = ["Ġ Ġ", "ĠĠ ĠĠ", "Ġ t", "i n",...
llama_model_loader: - kv  40:                    tokenizer.chat_template str              = {%- for message in messages -%}{%- if...
llama_model_loader: - kv  41:               general.quantization_version u32              = 2
llama_model_loader: - kv  42:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  134 tensors
llama_model_loader: - type q4_0:  295 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 8.45 GiB (4.55 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 256
load: token to piece cache size = 1.0607 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2048
print_info: n_layer          = 27
print_info: n_head           = 16
print_info: n_head_kv        = 1
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 576
print_info: n_embd_head_v    = 512
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 576
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11264
print_info: n_expert         = 64
print_info: n_expert_used    = 6
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 800000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 16B
print_info: model params     = 15.96 B
print_info: general.name     = Kimi Vl Instruct
print_info: n_layer_dense_lead   = 1
print_info: n_lora_q             = 0
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 192
print_info: n_embd_head_v_mla    = 128
print_info: n_ff_exp             = 1408
print_info: n_expert_shared      = 2
print_info: expert_weights_scale = 2.4
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.0000
print_info: vocab type       = BPE
print_info: n_vocab          = 163840
print_info: n_merges         = 163328
print_info: BOS token        = 11 ','
print_info: EOS token        = 11 ','
print_info: EOT token        = 163586 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 11 ','
print_info: EOG token        = 163586 '<|im_end|>'
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_REPACK model buffer size =  8169.40 MiB
load_tensors:   CPU_Mapped model buffer size =  8553.67 MiB
......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 800000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.62 MiB
llama_kv_cache_unified:        CPU KV buffer size =   229.50 MiB
llama_kv_cache_unified: size =  229.50 MiB (  4096 cells,  27 layers,  1/1 seqs), K (f16):  121.50 MiB, V (f16):  108.00 MiB
llama_context:        CPU compute buffer size =   356.38 MiB
llama_context: graph nodes  = 1974
llama_context: graph splits = 1
common_init_from_params: added , logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|>Hello<|im_end|><|im_assistant|>assistant<|im_middle|>Hi there<|im_end|><|im_user|>user<|im_middle|>How are you?<|im_end|><|im_assistant|>assistant<|im_middle|>

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: interactive mode on.
sampler seed: 2410674088
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

userWhat are the colors of the google logo?assistantAs an AI developed by Moonshot AI in 2023 and based on the latest information available up to October 2023 (before my knowledge cutoff date), the Google logo is indeed a combination of colors that have changed over time. The most recent version features a primary color palette that includes blue and green. The blue is a bright shade that can be associated with the sky or the vastness of space. The green is a vibrant and lively color that can be associated with growth and technology. These``` ... etc etc etc

@gabriellarson
Copy link
Contributor Author

And now that I'm looking at the "fixed" gguf again, this one still has the comma as the BOS

print_info: vocab type = BPE
print_info: n_vocab = 163840
print_info: n_merges = 163328
print_info: BOS token = 11 ','
print_info: EOS token = 163586 '<|im_end|>'
print_info: EOT token = 163586 '<|im_end|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 163586 '<|im_end|>'
print_info: max token length = 512

@gabriellarson
Copy link
Contributor Author

Ok I understand where the problem is coming from now @CISC

This is the kimi-vl config.json structure, it puts the x_token_id in the "text_config":

{
  "architectures": [
    "KimiVLForConditionalGeneration"
  ],
  "auto_map": {
    ...
  },
  "vision_config": {
    ...
  },
  "text_config": {
    "vocab_size": 163840,
    ...
    "bos_token_id": 163584,
    "pad_token_id": 163839,
    "eos_token_id": 163585,
  },
  ...
  "model_type": "kimi_vl"
}

in gguf-py\gguf\vocab.py, it looks only at the config["x_token_id"], not config["text_config"]["x_token_id"] :

    def _try_load_from_config_json(self, path: Path) -> bool:
        config_file = path / 'config.json'
        if not config_file.is_file():
            return False
        with open(config_file, encoding = 'utf-8') as f:
            config = json.load(f)
        for typ in self.special_token_types:
            self._set_special_token(typ, config.get(f'{typ}_token_id'))
        return True

So I added some extra logic to look at the text_config.

Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@CISC CISC merged commit 83bc2f2 into ggml-org:master Aug 3, 2025
6 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 5, 2025
… text_config) (ggml-org#15051)

* basic kimi-vl textmodel conversion

* check config["text_config"] for special tokens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants