mtmd : support Kimi VL model #15458

ngxson · 2025-08-20T21:14:41Z

Support https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506 (and newer "Thinking" variants) with dynamic resolution

GGUF: https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF

Large part of code is copied from LFM2 implementation, huge kudos to @tdakhran for the LFM2 code 😄

NOTE: Kimi-VL-A3B-Instruct generates gibberish output even in text-only mode - I have no idea why, if someone knows, please comment

tdakhran · 2025-08-21T07:45:20Z

tools/mtmd/clip.cpp

-            cur = ggml_permute(ctx0, cur, 0, 2, 1, 3);
-
-            cur = ggml_cont_2d(ctx0, cur, cur->ne[0], cur->ne[1] * cur->ne[2]);
+            cur = build_pixel_shuffle(cur, scale_factor);


In the HF checkpoint, the block is called PixelUnshuffleBlock. I followed that when I added a comment above // pixel unshuffle block. The function name is build_pixel_shuffle though.

Should it be shuffle or unshuffle? :) I'm personally fine with any as soon as we are consistent.

In smolvlm (siglip1) it's called "shuffle", siglip2 is "unshuffle" and here in kimi-vl it's called "merge_patches"

I think the more appropriate name can be merge_patches_permute since it's rely on permutation, to differentiate with patch merging using pool_avg_2d (used by gemma 3)

ngxson · 2025-08-25T14:20:30Z

It should work now. No idea why Kimi-VL-A3B-Instruct outputs gibberish even with text-only input, so I removed the GGUF

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/LFM2-VL-450M-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

ngxson · 2025-08-25T14:25:50Z

tools/mtmd/clip.cpp

+        GGML_ASSERT(pos_embd);

-        if (!pos_embd || height * width == pos_embd->ne[1]) {
+        if (height == n_per_side && width == n_per_side) {


@tdakhran FYI, I attempted to fix a potential bug here. For example, if the input shape is [n_embd, 256] and h = 8, w = 32, then 8 * 32 == 256 which skip ggml_interpolate even when it needs to.

Thank you @ngxson , I missed this logic.

tdakhran

Thank you for the new models and the fixes to existing ones, @ngxson!

foldl · 2025-08-27T01:27:55Z

I would like to report that Kimi-VL-A3B-Instruct does not generate gibberish outputs in my implementation:

* convert : fix tensor naming conflict for llama 4 vision * convert ok * support kimi vision model * clean up * fix style * fix calc number of output tokens * refactor resize_position_embeddings * add test case * rename build fn * correct a small bug

ngxson · 2025-08-27T06:40:12Z

@foldl feel free to open a PR to fix it

iamlemec · 2025-09-01T23:07:09Z

I can also report that Kimi-VL-A3B-Instruct is working for me with the current llama.cpp master. Nothing fancy, just the usual GGUF conversion script. Works for various quantizations on CUDA. Might be worth another shot @ngxson?

ngxson added 5 commits May 27, 2025 22:42

convert : fix tensor naming conflict for llama 4 vision

6af5deb

Merge branch 'master' into xsn/kimi-vl

131cd43

convert ok

a7dcaa9

support kimi vision model

56cf4ca

clean up

afd43a9

github-actions bot added examples python python script changes labels Aug 20, 2025

fix style

9bba547

This comment was marked as outdated.

Sign in to view

tdakhran reviewed Aug 21, 2025

View reviewed changes

ngxson added 4 commits August 25, 2025 15:11

fix calc number of output tokens

526c5aa

refactor resize_position_embeddings

1326bf8

add test case

1fe6331

rename build fn

c282e40

ngxson commented Aug 25, 2025

View reviewed changes

ngxson requested review from CISC and ggerganov August 25, 2025 14:26

CISC approved these changes Aug 25, 2025

View reviewed changes

tdakhran approved these changes Aug 25, 2025

View reviewed changes

correct a small bug

1609be0

ggerganov approved these changes Aug 26, 2025

View reviewed changes

ngxson merged commit 79a5462 into ggml-org:master Aug 26, 2025
51 of 52 checks passed

BVEsun mentioned this pull request Sep 2, 2025

Eval bug: Kimi-VL-A3B-Thinking-2506 not working correctly #15600

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd : support Kimi VL model #15458

mtmd : support Kimi VL model #15458

Uh oh!

ngxson commented Aug 20, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

tdakhran Aug 21, 2025

Uh oh!

ngxson Aug 21, 2025 •

edited

Loading

Uh oh!

ngxson commented Aug 25, 2025

Uh oh!

ngxson Aug 25, 2025

Uh oh!

tdakhran Aug 25, 2025

Uh oh!

tdakhran left a comment

Uh oh!

Uh oh!

foldl commented Aug 27, 2025

Uh oh!

ngxson commented Aug 27, 2025

Uh oh!

iamlemec commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mtmd : support Kimi VL model #15458

mtmd : support Kimi VL model #15458

Uh oh!

Conversation

ngxson commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

tdakhran Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Aug 25, 2025

Uh oh!

ngxson Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

tdakhran Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

tdakhran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

foldl commented Aug 27, 2025

Uh oh!

ngxson commented Aug 27, 2025

Uh oh!

iamlemec commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ngxson commented Aug 20, 2025 •

edited

Loading

ngxson Aug 21, 2025 •

edited

Loading